Wednesday, July 16, 2014

About HBase flushes and compactions

By Lars Hofhansl

The topic of flushes and compactions comes up frequently when using HBase. There are somewhat obscure configuration options around this, you hear terms like "write amplification", and you might see scary messages about blocked writes in the logs until a flush has finished.

Let's step back for a minute and explore what HBase is actually doing. The configuration parameters will then make more sense.

Unlike most traditional databases HBase stores its data in "Log Structured Merge" (LSM) Trees.

There are numerous academic treatments concerning LSM trees, so I won't go into details here.

Basically in HBase it works something like this:
  1. Edits (Puts, etc) are collected and sorted in memory (using a skip list specifically). HBase calls this the "memstore"
  2. When the memstore reached a certain size (hbase.hregion.memstore.flush.size) it is written (or flushed) to disk as a new "HFile"
  3. There is one memstore per region and column family
  4. Upon read, HBase performs a merge sort between all - partially sorted - memstore disk images (i.e. the HFiles)
From a correctness perspective that is all that is needed... But note that HBase would need to consider every memstore image ever written for sorting. Obviously that won't work. Each file needs to be seeked and read in order to find the next key in the sort. Hence eventually some of the HFiles need to be cleaned up and/or combined: compactions.

A compaction asynchronously reads two or more existing HFiles and rewrites the data into a single new HFile. The source HFiles are then deleted.

This reduces the work to be done at read time at the expense of rewriting the same data multiple times - this effect is called "write amplification". (There are some more nuances like major and minor compaction, which files to collect, etc, but that is besides the point for this discussion)

This can be tweaked to optimize either reads or writes.

If you let HBase accumulate many HFiles without compacting them, you'll achieve better write performance (the data is rewritten less frequently). If on the other hand you instruct HBase to compact many HFiles sooner you'll have better read performance, but now the same data is read and rewritten more often.

HBase allows to tweak when to start compacting HFiles and what is considered the maximum limit of HFiles to ensure acceptable read performance.

Generally flushes and compaction can commence in parallel. A scenario of particular interest is when clients write to HBase faster than the IO (disk and network) can absorb, i.e. faster than compactions can reduce the number of HFiles - manifested in an ever larger number of HFiles, eventually reaching the specified limit.
When this happens the memstores can continue to buffer the incoming data, but they cannot grow indefinitely - RAM is limited.

What should HBase do in this case? What can it do?
The only option is to disallow writes, and that is exactly what HBase does.

There are various knobs to tweak flushes and compactions:
  • hbase.hregion.memstore.flush.size
    The size a single memstore is allowed to reach before it is flushed to disk.
  • hbase.hregion.memstore.block.multiplier
    A memstore is temporarily allowed to grow to the maximum size times this factor.
    JVM global limit on aggregate memstore size before some of the memstore are force-flushed (in % of the heap).
    JVM memstore size limit before writes are blocked (in % of the heap)
  • hbase.hstore.compactionThreshold
    When a store (region and column family) has reach this many HFiles, HBase will start compacting HFiles
  • hbase.hstore.blockingStoreFiles
    HBase disallows further flushes until compactions have reduced the number of HFiles at least to this value. That means that now the memstores need to buffer all writes and hence eventually are subject blocking clients if compactions cannot keep up.
  • hbase.hstore.compaction.max
    The maximum number of HFiles a single - minor - compaction will consider.
  • hbase.hregion.majorcompactionTime interval between timed - major - compactions. HBase will trigger a compaction with this frequency even when no changes occurred.
  • hbase.hstore.blockingWaitTime
    Maximum time clients are blocked. After this time writes will be allowed again.
So when hbase.hstore.blockingStoreFiles HFiles are reached and the memstores are full (reaching
hbase.hregion.memstore.flush.size *
hbase.hregion.memstore.block.multiplier or due their aggregate size reaching writes are blocked for hbase.hstore.blockingWaitTime milliseconds.

Note that this is not a flaw of HBase but simply physics. When disks/network are too slow at some point clients needs to slowed down.

No comments:

Post a Comment