Tuesday, July 2, 2013

Protecting HBase against data center outages

By Lars Hofhansl

As HBase relies on HDFS for durability many of HDFS's idiosyncrasies need to be considered when deploying HBase into a production environment.

HBASE-5954 is still not finished, and in any event it is not clear whether the performance penalty incurred by syncing each log entry (i.e. each individual change) to disk is an acceptable performance trade off.

There are however a few practical considerations that can be applied right now.
It is important to keep in mind how HBase works:
  1. new data is collected in memory
  2. the memory contents are written into new immutable files (flush)
  3. smaller files are combined into larger immutable files and then deleted (compaction)
#3 is interesting when it comes power outages. As described here, data written to HDFS is typically only guaranteed to have reached N datanodes (3 by default), but not guaranteed to be persistent on disk. If the wrong N machines fail at the same time - as can happen during a data center power outage - recently written data can be lost.

In case #3 above, HBase rewrites an arbitrary amount of old data into new files and then deletes the old files. This in turn means that during a DC outage an arbitrary amount of data can be lost and that more specifically the data loss is not limited to the newest data.

Since version 1.1.1 HDFS support syncing a closed block to disk (HDFS-1539). When used together with HBase this guarantees that new files are persisted to disk before the old files are deleted, and thus compactions would no longer lead to data loss following DC outages.

Of course there is a price to pay. The client now needs to wait until the data is synced on the N datanodes, and more problematically this is likely to happen all at once when the datablock (64mb or larger) is closed. I found that with this setting new file creation takes between 50 and 100ms!

This performance penalty can be alleviated to some extend by syncing the data to disk early and asynchronously. Syncing early here has no detrimental effect as the all data written by HBase is immutable (a delay is only beneficial if a block is changed multiple times).

Linux/Posix can do that automatically with the proper fadvise and sync_file_range calls.
HDFS supports these since version 1.1.0 (HDFS-2465). Specifically, dfs.datanode.sync.behind.writes is useful here, since it will smooth out the sync cost as much as possible.

So... TL;DR: In our cluster we enable
  • dfs.datanode.sync.behind.writes
    and
  • dfs.datanode.synconclose
for HDFS, in order to get some amount of safety against DC power outages with acceptable performance.

2 comments:

  1. Do you need to mount the ext4 using dirsync option?

    ReplyDelete
  2. Yes. With ext4 you need to mount your volumes on the DataNodes with dirsync or you risk data loss on power outage.
    I recommend that in 2015 HBaseCon talk, but never updated this blog post, thanks for pointing it out.

    We did some performance testing with that and found that at least for HDFS used by HBase no performance penalty was observable.

    ReplyDelete