HBase: 2015

Monday, December 21, 2015

Yet more on HBase GC tuning with CMS

By Lars Hofhansl

Some of my previous articles delve into gc tuning and
more tuning for scanning.

We have since performed more tests with heavy write loads and I need revise my previous recommendation based on the findings.

I wrote a simple test tool that generates 5 million random keys of approximately 200 bytes; then it starts 50 threads that each pick 100k random batches of these keys and write them to HBase. I then start multiple of these and point them to an HBase cluster.
The result is that we see a lot of churn in HBase as new versions of Cells are written but the overall size of the data is kept more or less constant with compactions - so I can test very large write loads with limited disk space.

We find that for these kinds of loads a young generation of 512MB that I had recommended before is hopelessly undersized. We're seeing lots of premature promotions into tenured space, followed by minute long full pauses that eventually cause the region servers to shut down.

For the tests I ran with a 16G heap and found a young gen of 2G is good.

Note that in CMS the size of the young gen is one of the knobs to trade latency for throughput. The smaller the young gen is size the smaller the pauses tend to be... If all short lived garbage fits into the young gen, that is. If it does not, short lived objects get promoted into the tenured space and that will eventually lead to very long (minutes with 30G heaps) full collection pauses.

So the goal is to size the young gen just large enough to fit the short lived objects. In HBase we do some guide posts for this. 40% of the heap (by default) is dedicated to the memstores, another 40% (again by default) to the block cache, and the rest (20%) to "day-to-day" garbage.

With I now recommend the following for the young generation:

at least 512MB
after that 10-15% of the heap
but at most 3GB

For most setups the following should cover all the bases:

-Xmn2g - small eden space (or even -Xmn3g for very large heaps)
-XX:+UseParNewGC - collect eden in parallel
-XX:+UseConcMarkSweepGC - use the non-moving CMS collector
-XX:CMSInitiatingOccupancyFraction=70 - start collecting when 70% of the tenured gen are full to avoid collection under pressure
-XX:+UseCMSInitiatingOccupancyOnly - do not try to adjust CMS setting

In the end you have to test with your own work loads. Start with the smallest young gen size that you think you can get away with. Then watch the GC log. If you see lot's of "promotion failed" type messages, you need to increase eden (-Xmn), do that until the promotion failures stop.

We'll be doing testing with G1GC soon.

Friday, May 8, 2015

My HBaseCon talk about HBase Performance Tuning

By Lars Hofhansl

HBaseCon 2015 was a blast as always. All the presentations and videos will be online soon.

In the meanwhile my presentation on "HBase Performance Tuning" can be found on SlideShare.

Monday, April 20, 2015

HBaseCon 2015

Don't forget to come to HBaseCon, the yearly get-together for all things HBase in San Francisco. May 7th, 2015.

We have great collection of sessions this year:

A highly-trafficked HBase cluster with an uptime of sixteen months
An HBase deploy that spans three datacenters doing master-master replication between thousands of HBase nodes in each
Some nuggets on how Bigtable does it (and HBase could too)
How HBase is being used to record patient telemetry 24/7 on behalf of the Michael J. Fox Foundation to enable breakthroughs in Parkinson Disease research
How Pinterest and Microsoft are doing it in the cloud, how FINRA and Bloomberg are doing it out east, and how Rocketfuel, Yahoo! and Flipboard, etc., are representing the west

Among many others!

I'll be talking about HBase Tuning and have a brief cameo in the HBase 2.0 panel, talking abount semantic versioning. Feel free to find me afterwards.

Sunday, March 8, 2015

HBase Scanning Internals - SEEK vs SKIP

By Lars Hofhansl, March 8th, 2015

Recently we ran into a problem where a mapper that scanned a region of about 10GB with a timerange that did not include any Cell (KeyValue) took upwards of 20 minutes to finish; it processed only about 8MB/s.

It turns out this was a known problem that has eluded a fix for while: Deciding at the scanner level whether to SEEK ahead past potentially many Cells or whether to power through and repeatedly SKIP the next Cell until the next useful Cell is reached.

Background

Scanning forward through a file, HBase has no knowledge about how many columns are to follow for the current row or how many versions there are for the current column (remember that every version of every column in HBase has its own Cell in the HFiles).

In order to deal with many columns or versions, HBase can issue SEEKs to the next row (seeking past all versions for all remaining columns for the row) or the next column (seeking past all remaining versions). HBase errs on the side of SEEK'ing frequently since SKIP'ing potentially 1000' or 100000's of times can be disastrous for performance (imagine a row with 100 columns and 10000 versions each - unlikely, but possible).

The problem is: SEEK'ing is about 10x as expensive as a single SKIP - depending on how many seek pointers into HFiles have to be reset.

Yet, in many cases we have rows with only a few or even just one column and just one version each. Now the SEEKs will cause a significant slowdown.

After much trying finally there is a proposed solution:

HBASE-13109 Make better SEEK vs SKIP decisions during scanning

(0.98.12 and later)

How does it work?

HFiles are organized like B-Trees, and it is possible to determine the start key of the next block in each file.

A heuristic is now:
Will the SEEK we are about to execute get us into the next block of the HFile that is at top of the heap used for the merge sorting between the HFiles?

If so, we will definitely benefit from seeking (the repeated SKIPs would eventually exhaust the current block and load the next one anyway).

If not, we'll likely benefit from repeated SKIP'ing. This is a heuristic only, the SEEK might allow us to seek past manys Cell in HFiles not at the top of the heap, but that is unlikely.

In all tests I did this performs equal to or (much) better than the current behavior.

The upshot is that now the HBase plumbing (filters, coprocessors, delete marker handling, skipping columns, etc) can continue to issue SEEKs where that is logically possible, and then at the scanner level it can decide whether to act upon the SEEK or to keep SKIP'ing inside the current block, with almost optimal performance.

TL;DR:
HBASE-13109 Allows many queries against HBase to execute 2-3x faster. Especially those that select specific columns, or those that filter on timeranges, or where many deleted columns for column families are encountered.
Queries that request all columns and that do not filter the data in any way will not benefit from this change.

You do need to do anything to get that benefit (other than upgrading your HBase to at least the upcoming 0.98.12).

Monday, January 12, 2015

More HBase GC tuning

By Lars Hofhansl

My article on hbase-gc-tuning-observations explores how to configure the garbage collector for HBase.

There is actually a bit more to it, especially when block encoding is enabled for a column family and the predominant access is via the scan API with row caching.

Block encoding currently requires HBase to materialize each KeyValue after decoding during scanning, and hence this has the potential to produce a lot of garbage for each scan RPC, especially when the scan response is large as might be the case when scanner caching is set to larger value (see Scan.getCaching())

My experiments show that in that case it is better to run with a larger young gen of 512MB (-Xmn512m) and - crucially - make sure that all per RPC garbage across all handlers actively performing scans fits into the survivor space. (Note that this statement is true whether or not block encoding is used. Block encoding just produces a lot more garbage).

HBase actually has a way to limit the size of an individual scan response by setting hbase.client.scanner.max.result.size.

Quick recap:

The Hotspot JVM divides the heap into PermGen, Tenured Gen, and the YoungGen. YoungGen itself is divided into Eden and two survivor spaces.

By default the survivor ratio is 8 (i.e. each survivor space is 1/8 of Eden; together their sizes add up to the configured young gen size)

What to do?

With -Xmn512m this comes to ~51MB for each of the two survivor spaces.

Now you want to set hbase.client.scanner.max.result.size such that the expected number of a handler threads times the max.result.size is less than each of the survivor spaces.

With 30 handlers (default in HBase as of 0.98) this comes to 1.7MB, since not all handlers will always scan using the full buffer 2MB is probably a good setting.

Makes sense, doesn't? If per scan results across all active handlers cannot fit into the survivor space the collector has no choice but to promote to the tenured generation. That is exactly the scenario one would like to avoid as we would slowly polute the tenured gen with per PRC garbage, eventually requiring a full GC to defragment.

A 2MB response also happens to be a good size for 1ge networks. 2MB will take at least 16ms to be transmitted, which is higher than typical intra-datacenter latency of 0.1-1ms. With 10ge this would need to reviewed as 2MB can be send over 10ge is 1.6ms.

TL;DR:

When using block encoding make sure #handlers * max.results.size < survivor space, and use a slightly larger young generation:

-Xmn512mb (in hbase-env.sh)

hbase.client.scanner.max.result.size = 2097152 (in hbase-site.xml)

Update, January 31st, 2015
Since HBase versions 0.98 and later produce a little bit more garbage than 0.94 due to using protobuf, I am now generally recommending a young gen of 512mb for those versions of HBase.

And the same reasoning goes for writes, when batching writes make sure the batch sizes are around 2MB, so that they can temporarily fir into the survivor generation.