Monday, March 3, 2014

HBase GC tuning observations

By Lars Hofhansl

Java garbage collection tuning depends heavily on the application.

HBase is somewhat special: It produces a lot of short lived objects that typically does not outlive a single RPC request; at the same time most of the heap is (and should be) used for the block cache and the memstores that hold objects that typically live for a while.
In most setups HBase also requires reasonably short response times, many seconds of garbage collection are not acceptable.

Quick GC primer 

There is a lot of (better informed) literature out there about this topic, so here just the very basics.
  • In typical applications most object "die young" (they are created and soon are no longer needed). This is certainly true for HBase. This is called the "weak generational hypothesis". Most modern garbage collectors accommodate this by organizing the heap into generations.
  • HotSpot: manages four generations:
    1. Eden for all new objects
    2. Survivor I and II where surviving objects are promoted when eden is collected
    3. Tenured space. Objects surviving a few rounds (16 by default) of eden/survivor collection are promoted into the tenured space
    4. Perm gen for classes, interned strings, and other more or less permanent objects.
  • The principle costs to the garbage collector are (1) tracing all objects from the "root" objects and (2) collecting all unreachable objects.
    Obviously #1 is expensive when many objects need to be traced, and #2 is expensive when objects have to be moved (for example to reduce memory fragmentation)

Back to HBase

So what does that mean for HBase?

Most of the heap should be set aside for the memstores and the blockcache as that is the principal data that HBase manages in memory.

HBase already attempts to minimize the number of objects used (see cost #1 above) and keeps them at the same size as much as possible (cost #2 above).

The memstore is allocated in chunks of 2mb (by default), and the block cache stores blocks of 64k (also by default).
With that HBase keeps the number of objects small and avoids heap fragmentation (since all objects are roughly the same size and hence new objects can fit into "holes" created by previously released objects).

Most other objects and datastructures do not outlive a single request and can (and should) be collected quickly.

That means we want to keep eden as small as possible - just big enough to handle the garbage from a few concurrent requests. That allows us to use most of the heap for the useful data and keep minor compactions short.
We also want to optimize for latency rather than throughput.
And we'd want GC settings where do not move the objects in the tenured generation around and also collect new garbage as quickly as possible.
The young generation should be moving to avoid fragmentation (in fact all young gen collectors in Hotspot are moving).

TL;DR: With that all in mind here are the typical GC settings that I would recommend:

-Xmn256m - small eden space (maybe -Xmn512m, but not more than that)
-XX:+UseParNewGC - collect eden in parallel
-XX:+UseConcMarkSweepGC - use the non-moving CMS collector
-XX:CMSInitiatingOccupancyFraction=70 - start collecting when 70% of the tenured gen are full to avoid collection under pressure
-XX:+UseCMSInitiatingOccupancyOnly - do not try to adjust CMS setting

Those should be good settings for starters. You should find average GC times around 50ms and even 99'ile GC times around 150ms, and absolute worst case 1500ms or so; all on contemporary hardware with heaps of 32GB or less.

There are efforts to move cached data off the Java to heap to reduce these pauses especially the worst case.

Wednesday, January 1, 2014

More HBase performance profiling

By Lars Hofhansl

Continuing my sporadic profiling sessions I identified a few more issues in HBase.


First, a note about profiling

"Instrumentating" profilers often distort the actual results and frequently lead to premature optimization.
For example in HBase that lead to caching of redundant information on each KeyValue wasting heap space. Using a sampler showed that none of the optimized methods were hotspots.
The instrumentation is still useful, but the result should always be validated by a "sampler" or, even better, real "performance counters".

I simply used the Sampler in jVisualVM, which ships with the JDK.

Here are some new issues I found and fixed: 


HBASE-9915 Performance: isSeeked() in EncodedScannerV2 always returns false
HBase optimizes seeks and ensures that (re)seeks that would land on a key on the current block continue scanning forward inside the block, rather than (1) consulting the HFile index again, (2) loading the block, and (3) scanning from the beginning of that block again.
There was a bug in HBase where this optimization did not happen for block-encoded HFiles. Nobody noticed that this never worked.
A performance improvement of 2-3x (200-300%) when block encoding is enabled.

HBASE-9807 block encoder unnecessarily copies the key for each reseek
Here we copied the entire row key just to compare it with the first key of the next index block when the HFile is block encoded.
10% improvement by itself. Much more together with HBASE-9915 mentioned above.

HBASE-10015 Replace intrinsic locking with explicit locks in StoreScanner
This one was somewhat surprising. The issue is that we're locking inside the StoreScanner on each operation of next/seek/etc - millions of times per second, just so that a parallel compaction/flush can invalidate the HFile readers, which happens a few times per hour.
According to conventional wisdom intrinsic locking (synchronized) should be preferred since the JVM does biased locking. In my tests, which were confirmed across JVMs and architectures, I found that by simply replacing intrinsic locks with ReentrantLock's there was 20-30% performance gain overall to be had.

HBASE-10117 Avoid synchronization in HRegionScannerImpl.isFilterDone
Another issue with synchronization. As described here the synchronization penalty is not just blocked threads, but also involves memory read and write fences when the lock is completely uncontended. It turns out that in the typical case the synchronization is not needed.
20-25% improvement

More to come.
Edit: Some spelling/grammar corrections.

Monday, July 29, 2013

HBase and data locality

By Lars Hofhansl

A quick note about data locality.

In distributed systems data locality is an important property. Data locality, here, refers to the ability to move the computation close to where the data is.

This is one of the key concepts of MapReduce in Hadoop. The distributed file system provides the framework with locality information, which is then used to split the computation into chunks with maximum locality.

When data is written in HDFS, one copy is written locally, another is written to another node in a different rack (if possible) and a third copy is written to another node in the same rack. For all practical purposes the two extra copies are written to random nodes in the cluster.

In typical HBase setups a RegionServer is co-located with an HDFS DataNode on the same physical machine. Thus every write is written locally and then to the two nodes as mentioned above. As long the regions are not moved between RegionServers there is good data locality: A RegionServer can serve most reads just from the local disk (and cache), provided short circuit reads are enabled (see HDFS-2246 and HDFS-347).

When regions are moved some data locality is lost and the RegionServers in question need to request the data over the network from remote DataNodes, until the data is rewritten locally (for example by running some compactions).

There are four basic events that can potentially destroy data locality:
  1. The HBase balancer decides to move a region to balance data sizes across RegionServers.
  2. A RegionServer dies. All its regions need to be relocated to another server.
  3. A table is disable and re-enabled.
  4. A cluster is stopped and restarted.
Case #1 is a trade off. Either a RegionServer hosts an over-proportional large portion of the data, or some of the data loses its locality.

In case #2 there is not much choice. That RegionServer and likely its machine died, so the regions need to be migrated to somewhere else where not all data is local (remember the "random" writes to the two other data nodes).

In case #4 HBase reads the previous locality information from the .META. table (whose regions - currently only one - are assigned to an RegionServer first), and then attempts to the assign the regions of all tables to the same RegionServers (fixed with HBASE-4402).

In case #3 the regions are reassigned in a round-robin fashion...

Wait what? If the a table is simply disabled and then re-enabled its regions are just randomly assigned?

Yep. In a large enough cluster you'll end up with close to 0% data locality, which means all data for all requests is pulled remotely from another machine. In some setups that means an outage, a time of unacceptably slow read performance.

This was brought up on the mailing lists again this week and we're fixing the old issue now (HBASE-6143, HBASE-9080).

Tuesday, July 2, 2013

Protecting HBase against data center outages

By Lars Hofhansl

As HBase relies on HDFS for durability many of HDFS's idiosyncrasies need to be considered when deploying HBase into a production environment.

HBASE-5954 is still not finished, and in any event it is not clear whether the performance penalty incurred by syncing each log entry (i.e. each individual change) to disk is an acceptable performance trade off.

There are however a few practical considerations that can be applied right now.
It is important to keep in mind how HBase works:
  1. new data is collected in memory
  2. the memory contents are written into new immutable files (flush)
  3. smaller files are combined into larger immutable files and then deleted (compaction)
#3 is interesting when it comes power outages. As described here, data written to HDFS is typically only guaranteed to have reached N datanodes (3 by default), but not guaranteed to be persistent on disk. If the wrong N machines fail at the same time - as can happen during a data center power outage - recently written data can be lost.

In case #3 above, HBase rewrites an arbitrary amount of old data into new files and then deletes the old files. This in turn means that during a DC outage an arbitrary amount of data can be lost and that more specifically the data loss is not limited to the newest data.

Since version 1.1.1 HDFS support syncing a closed block to disk (HDFS-1539). When used together with HBase this guarantees that new files are persisted to disk before the old files are deleted, and thus compactions would no longer lead to data loss following DC outages.

Of course there is a price to pay. The client now needs to wait until the data is synced on the N datanodes, and more problematically this is likely to happen all at once when the datablock (64mb or larger) is closed. I found that with this setting new file creation takes between 50 and 100ms!

This performance penalty can be alleviated to some extend by syncing the data to disk early and asynchronously. Syncing early here has no detrimental effect as the all data written by HBase is immutable (a delay is only beneficial if a block is changed multiple times).

Linux/Posix can do that automatically with the proper fadvise and sync_file_range calls.
HDFS supports these since version 1.1.0 (HDFS-2465). Specifically, dfs.datanode.sync.behind.writes is useful here, since it will smooth out the sync cost as much as possible.

So... TL;DR: In our cluster we enable
  • dfs.datanode.sync.behind.writes
  • dfs.datanode.synconclose
for HDFS, in order to get some amount of safety against DC power outages with acceptable performance.

Monday, May 6, 2013

HBase durability guarantees

By Lars Hofhansl

Like most other databases HBase logs changes to a write ahead log (WAL) before applying them (i.e. making them visible).

I wrote in more detail about HDFS' flush and sync semantics here: here. Also check out HDFS-744 and HBASE-5954.

Recently I got back to this area in the code and committed HBASE-7801 to HBase 0.94.7+, which streamlines the durability API for the HBase client.
The existing API was a mess, with some settings configured only via the table on the server (such as deferred log flush) and some only via the client per each Mutation (i.e. Put or Delete) such as Mutation.setWriteToWAL(...). In addition some table settings could be overridden by a client and some could not.

HBASE-7801 includes the necessary changes on both server and client to support a new API: Mutation.setDurablity(Durability).
A client can now control durability per Mutation - via passing a Durability Enum - as follows:
  1. USE_DEFAULT: do whatever the default is for the table. This is the client default.
  2. SKIP_WAL: do not write to the WAL
  3. ASYNC_WAL: write to the WAL asynchronously as soon as possible
  4. SYNC_WAL: write WAL synchronously
  5. FSYNC_WAL: write to the WAL synchronously and guarantee that the edit is on disk (not currently supported)
Prior to this change only the equivalent of USE_DEFAULT and SKIP_WAL was supported.

The changes are backward compatible. A server prior to 0.94.7 will gracefully ignore any client settings that it does not support.

The HTable.setWriteToWAL(boolean) API is deprecated in 0.94.7+ and was removed in 0.95+.

I filed a followup jira to do the same refactoring the table descriptors HBASE-8375.

Monday, April 22, 2013

HBaseCon 2013

I am lucky enough to be on the HBaseCon Program Committee again this year.

We received a great set of talks and we organized them into four tracks: Operations, Internals, Ecosystem, and Case Studies

Today the Session List was announced:

Obviously I am biased, but I think this will be a great event.