Wednesday, January 1, 2014

More HBase performance profiling

By Lars Hofhansl

Continuing my sporadic profiling sessions I identified a few more issues in HBase.


First, a note about profiling

"Instrumentating" profilers often distort the actual results and frequently lead to premature optimization.
For example in HBase that lead to caching of redundant information on each KeyValue wasting heap space. Using a sampler showed that none of the optimized methods were hotspots.
The instrumentation is still useful, but the result should always be validated by a "sampler" or, even better, real "performance counters".

I simply used the Sampler in jVisualVM, which ships with the JDK.

Here are some new issues I found and fixed: 


HBASE-9915 Performance: isSeeked() in EncodedScannerV2 always returns false
HBase optimizes seeks and ensures that (re)seeks that would land on a key on the current block continue scanning forward inside the block, rather than (1) consulting the HFile index again, (2) loading the block, and (3) scanning from the beginning of that block again.
There was a bug in HBase where this optimization did not happen for block-encoded HFiles. Nobody noticed that this never worked.
A performance improvement of 2-3x (200-300%) when block encoding is enabled.

HBASE-9807 block encoder unnecessarily copies the key for each reseek
Here we copied the entire row key just to compare it with the first key of the next index block when the HFile is block encoded.
10% improvement by itself. Much more together with HBASE-9915 mentioned above.

HBASE-10015 Replace intrinsic locking with explicit locks in StoreScanner
This one was somewhat surprising. The issue is that we're locking inside the StoreScanner on each operation of next/seek/etc - millions of times per second, just so that a parallel compaction/flush can invalidate the HFile readers, which happens a few times per hour.
According to conventional wisdom intrinsic locking (synchronized) should be preferred since the JVM does biased locking. In my tests, which were confirmed across JVMs and architectures, I found that by simply replacing intrinsic locks with ReentrantLock's there was 20-30% performance gain overall to be had.

HBASE-10117 Avoid synchronization in HRegionScannerImpl.isFilterDone
Another issue with synchronization. As described here the synchronization penalty is not just blocked threads, but also involves memory read and write fences when the lock is completely uncontended. It turns out that in the typical case the synchronization is not needed.
20-25% improvement

More to come.
Edit: Some spelling/grammar corrections.