Friday, May 8, 2015

My HBaseCon talk about HBase Performance Tuning

By Lars Hofhansl

HBaseCon 2015 was a blast as always. All the presentations and videos will be online soon.

In the meanwhile my presentation on "HBase Performance Tuning" can be found on SlideShare.

Monday, April 20, 2015

HBaseCon 2015

Don't forget to come to HBaseCon, the yearly get-together for all things HBase in San Francisco. May 7th, 2015.

We have great collection of sessions this year:

  • A highly-trafficked HBase cluster with an uptime of sixteen months
  • An HBase deploy that spans three datacenters doing master-master replication between thousands of HBase nodes in each
  • Some nuggets on how Bigtable does it (and HBase could too)
  • How HBase is being used to record patient telemetry 24/7 on behalf of the Michael J. Fox Foundation to enable breakthroughs in Parkinson Disease research
  • How Pinterest and Microsoft are doing it in the cloud, how FINRA and Bloomberg are doing it out east, and how Rocketfuel, Yahoo! and Flipboard, etc., are representing the west

Among many others!

I'll be talking about HBase Tuning and have a brief cameo in the HBase 2.0 panel, talking abount semantic versioning. Feel free to find me afterwards.

Sunday, March 8, 2015

HBase Scanning Internals - SEEK vs SKIP

By Lars Hofhansl, March 8th, 2015

Recently we ran into a problem where a mapper that scanned a region of about 10GB with a timerange that did not include any Cell (KeyValue) took upwards of 20 minutes to finish; it processed only about 8MB/s.

It turns out this was a known problem that has eluded a fix for while: Deciding at the scanner level whether to SEEK ahead past potentially many Cells or whether to power through and repeatedly SKIP the next Cell until the next useful Cell is reached.


Scanning forward through a file, HBase has no knowledge about how many columns are to follow for the current row or how many versions there are for the current column (remember that every version of every column in HBase has its own Cell in the HFiles).

In order to deal with many columns or versions, HBase can issue SEEKs to the next row (seeking past all versions for all remaining columns for the row) or the next column (seeking past all remaining versions). HBase errs on the side of SEEK'ing frequently since SKIP'ing potentially 1000' or 100000's of times can be disastrous for performance (imagine a row with 100 columns and 10000 versions each - unlikely, but possible).

The problem is: SEEK'ing is about 10x as expensive as a single SKIP - depending on how many seek pointers into HFiles have to be reset.

Yet, in many cases we have rows with only a few or even just one column and just one version each. Now the SEEKs will cause a significant slowdown.

After much trying finally there is a proposed solution:
HBASE-13109 Make better SEEK vs SKIP decisions during scanning
(0.98.12 and later)

How does it work?

HFiles are organized like B-Trees, and it is possible to determine the start key of the next block in each file.

A heuristic is now:
Will the SEEK we are about to execute get us into the next block of the HFile that is at top of the heap used for the merge sorting between the HFiles?

If so, we will definitely benefit from seeking (the repeated SKIPs would eventually exhaust the current block and load the next one anyway).

If not, we'll likely benefit from repeated SKIP'ing. This is a heuristic only, the SEEK might allow us to seek past manys Cell in HFiles not at the top of the heap, but that is unlikely.

In all tests I did this performs equal to or (much) better than the current behavior.

The upshot is that now the HBase plumbing (filters, coprocessors, delete marker handling, skipping columns, etc) can continue to issue SEEKs where that is logically possible, and then at the scanner level it can decide whether to act upon the SEEK or to keep SKIP'ing inside the current block, with almost optimal performance.

HBASE-13109 Allows many queries against HBase to execute 2-3x faster. Especially those that select specific columns, or those that filter on timeranges, or where many deleted columns for column families are encountered.
Queries that request all columns and that do not filter the data in any way will not benefit from this change.

You do need to do anything to get that benefit (other than upgrading your HBase to at least the upcoming 0.98.12).

Monday, January 12, 2015

More HBase GC tuning

By Lars Hofhansl

My article on hbase-gc-tuning-observations explores how to configure the garbage collector for HBase.

There is actually a bit more to it, especially when block encoding is enabled for a column family and the predominant access is via the scan API with row caching.

Block encoding currently requires HBase to materialize each KeyValue after decoding during scanning, and hence this has the potential to produce a lot of garbage for each scan RPC, especially when the scan response is large as might be the case when scanner caching is set to larger value (see Scan.getCaching())

My experiments show that in that case it is better to run with a larger young gen of 512MB (-Xmn512m) and - crucially - make sure that all per RPC garbage across all handlers actively performing scans fits into the survivor space. (Note that this statement is true whether or not block encoding is used. Block encoding just produces a lot more garbage).

HBase actually has a way to limit the size of an individual scan response by setting hbase.client.scanner.max.result.size.


Quick recap:

The Hotspot JVM divides the heap into PermGen, Tenured Gen, and the YoungGen. YoungGen itself is divided into Eden and two survivor spaces.
By default the survivor ratio is 8 (i.e. each survivor space is 1/8 of Eden; together their sizes add up to the configured young gen size) 


What to do?

With -Xmn512m this comes to ~51MB for each of the two survivor spaces.
hbase.client.scanner.max.result.size = 2097152 (in hbase-site.xml)

Update, January 31st, 2015
Since HBase versions 0.98 and later produce a little bit more garbage than 0.94 due to using protobuf, I am now generally recommending a young gen of 512mb for those versions of HBase.

And the same reasoning goes for writes, when batching writes make sure the batch sizes are around 2MB, so that they can temporarily fir into the survivor generation.

Thursday, October 23, 2014

Branching - Managing Open Source and Corporate goals (HBase)

By Lars Hofhansl

Company and Open Source goals are often at odds or at least not completely aligned.
Here's how we do things for HBase (and dependent projects) at Salesforce.

  1. We do not fork any of the projects. A "fork" here being a departure from the open source repository significant enough to prevent us from contributing patches back to the open source branches or to use open source updates against our repository.
  2. We do (almost) all work against the open source branches (0.98 currently).
  3. We have internal copies of the HBase repository and all dependent projects (Hadoop, ZooKeeper, etc).
  4. We have minimal patches in our own repositories. Mostly pom changes to defined where to pull dependencies from - for example we want to build our HBase against our build of Hadoop.
    Sometimes we have an odd patch or two that have not made it back to open source.
  5. We attach internal version numbers to our builds such as 0.98.4-sfdc-2.2.1, to indicate the exact version of what we're running in production.
  6. Everything we run in production is build automatically (via jenkins jobs) from source against these internal repositories. This allows to be agile in case of emergencies.
  7. Updates to the internal repository are manual (by design). We do not track the open source branches automatically. At our own pace, when we are ready, we move to a new upstream version, which most of the time allows us to remove some of one-off patches we had applied locally. For example we stayed at 0.98.4 for a while with some patches on top, and recently moved to 0.98.7, to which we had contributed all of the patches.
  8. All internal patches are eventually cleaned up and contributed back to open source, so that we can follow along the release train of minor version (0.98.4, 0.98.5, etc).
  9. Of course we keep an eye on and spend a lot of time with the open source releases to make sure they are stable and suitable for us to use a future internal release.
With this simple model we avoid forking, tack along with the open source releases, remain agile, and remain in full control over what exactly is deployed, completely at our own pace. Open source and corporate goals do not have to be at odds.

This might all be obvious; a bit of diligence is required to support both the open source goals for a project as well as the specific corporate goals.

Tuesday, August 12, 2014

HBase client response times

By Lars Hofhansl

When talking about latency expectations from HBase you'll hear various anecdotes and various war- and horror-stories where latency varies from a few milliseconds to many minutes.

In this blog post I will classify the latency conditions and their causes.

There are five principle causes to latency in HBase:

  1. Normal network round trip time (RTT) plus internal work that HBase has to do to retrieve or write the data. This type of latency is in the order of milliseconds - RTT + disk seeks, etc.
  2. Client retries due to moved regions or splits. Regions can be moved due to HBase deciding that the cluster is not balanced. As regions grow in size their are split.
    The HBase client rides over this by retrying a few times. You should make sure you have the client retry count and interval configured according to your needs. This can take in the order of one second. Note that this is independent on the size of the region as no data is actually moved - it remains in its location in HDFS - just the ownership is transferred to another region server.
  3. GC (see also hbase-gc-tuning-observations). This is an issue with nearly every Java application. HBase is no exception. A well tuned HBase install can keep this below 2s at all times, with average GC times around 50ms (this is for large heaps of 32GB)
  4. Server outages. This is were large tail end latency is caused. When a region server dies it takes - by default - 90s to detect, then that server's regions are reassigned to other servers, which then have replay logs and bring the regions online. Depending on the amount uncommitted data the region server had, this can take minutes. So in total this is in the order of a few minutes - if a client requests any data for any of the involved regions. Interactions with other region servers are not impacted.
  5. Writers overwhelming the cluster. Imagine a set of clients trying to write data into an HBase cluster faster than the cluster's hardware (network for 3-way replication, or disk) can absorb. HBase can buffer some spikes in memory (the memstore) but after some time of sustained write load has to force the clients to stop. In 0.94 a region server will just block any writers for a configurable maximum amount of time (90s by default). In 0.96 and later the server throws a RegionTooBusyException back to the client. The client then will have to retry until HBase has enough resource to accept the request. See also about-hbase-flushes-and-compactions. Depending on how fast HBase can compact excess HFiles this condition can last minutes.
All of the cases refer to any delays caused by HBase itself. Scanning large regions or writing a lot of data in bulk-write requests naturally have to adhere to physics. The data needs to loaded from disk - potentially from another machines - or it needs to be written across the network to three replicas. The time needed here depends on the particular network/disk setup.

The gist is that when things are smooth (and you have disabled Nagle's, i.e. enable tcpnodelay) you'll see latency of a few ms if things are in the blockcache, or < 20ms or so when disk seeks are needed.
The 99th percentile will include GCs, region splits, and region moves, and you should see something around 1-2s.

In the event of failures such as failed region servers or overwhelming the cluster with too many write requests, latency can go as high as a few minutes for requests to the involved regions.

Wednesday, July 16, 2014

About HBase flushes and compactions

By Lars Hofhansl

The topic of flushes and compactions comes up frequently when using HBase. There are somewhat obscure configuration options around this, you hear terms like "write amplification", and you might see scary messages about blocked writes in the logs until a flush has finished.

Let's step back for a minute and explore what HBase is actually doing. The configuration parameters will then make more sense.

Unlike most traditional databases HBase stores its data in "Log Structured Merge" (LSM) Trees.

There are numerous academic treatments concerning LSM trees, so I won't go into details here.

Basically in HBase it works something like this:
  1. Edits (Puts, etc) are collected and sorted in memory (using a skip list specifically). HBase calls this the "memstore"
  2. When the memstore reached a certain size (hbase.hregion.memstore.flush.size) it is written (or flushed) to disk as a new "HFile"
  3. There is one memstore per region and column family
  4. Upon read, HBase performs a merge sort between all - partially sorted - memstore disk images (i.e. the HFiles)
From a correctness perspective that is all that is needed... But note that HBase would need to consider every memstore image ever written for sorting. Obviously that won't work. Each file needs to be seeked and read in order to find the next key in the sort. Hence eventually some of the HFiles need to be cleaned up and/or combined: compactions.

A compaction asynchronously reads two or more existing HFiles and rewrites the data into a single new HFile. The source HFiles are then deleted.

This reduces the work to be done at read time at the expense of rewriting the same data multiple times - this effect is called "write amplification". (There are some more nuances like major and minor compaction, which files to collect, etc, but that is besides the point for this discussion)

This can be tweaked to optimize either reads or writes.

If you let HBase accumulate many HFiles without compacting them, you'll achieve better write performance (the data is rewritten less frequently). If on the other hand you instruct HBase to compact many HFiles sooner you'll have better read performance, but now the same data is read and rewritten more often.

HBase allows to tweak when to start compacting HFiles and what is considered the maximum limit of HFiles to ensure acceptable read performance.

Generally flushes and compaction can commence in parallel. A scenario of particular interest is when clients write to HBase faster than the IO (disk and network) can absorb, i.e. faster than compactions can reduce the number of HFiles - manifested in an ever larger number of HFiles, eventually reaching the specified limit.
When this happens the memstores can continue to buffer the incoming data, but they cannot grow indefinitely - RAM is limited.

What should HBase do in this case? What can it do?
The only option is to disallow writes, and that is exactly what HBase does.

There are various knobs to tweak flushes and compactions:
  • hbase.hregion.memstore.flush.size
    The size a single memstore is allowed to reach before it is flushed to disk.
  • hbase.hregion.memstore.block.multiplier
    A memstore is temporarily allowed to grow to the maximum size times this factor.
    JVM global limit on aggregate memstore size before some of the memstore are force-flushed (in % of the heap).
    JVM memstore size limit before writes are blocked (in % of the heap)
  • hbase.hstore.compactionThreshold
    When a store (region and column family) has reach this many HFiles, HBase will start compacting HFiles
  • hbase.hstore.blockingStoreFiles
    HBase disallows further flushes until compactions have reduced the number of HFiles at least to this value. That means that now the memstores need to buffer all writes and hence eventually are subject blocking clients if compactions cannot keep up.
  • hbase.hstore.compaction.max
    The maximum number of HFiles a single - minor - compaction will consider.
  • hbase.hregion.majorcompactionTime interval between timed - major - compactions. HBase will trigger a compaction with this frequency even when no changes occurred.
  • hbase.hstore.blockingWaitTime
    Maximum time clients are blocked. After this time writes will be allowed again.
So when hbase.hstore.blockingStoreFiles HFiles are reached and the memstores are full (reaching
hbase.hregion.memstore.flush.size *
hbase.hregion.memstore.block.multiplier or due their aggregate size reaching writes are blocked for hbase.hstore.blockingWaitTime milliseconds.

Note that this is not a flaw of HBase but simply physics. When disks/network are too slow at some point clients needs to slowed down.