Thursday, October 23, 2014

Branching - Managing Open Source and Corporate goals (HBase)

By Lars Hofhansl

Company and Open Source goals are often at odds or at least not completely aligned.
Here's how we do things for HBase (and dependent projects) at Salesforce.

  1. We do not fork any of the projects. A "fork" here being a departure from the open source repository significant enough to prevent us from contributing patches back to the open source branches or to use open source updates against our repository.
  2. We do (almost) all work against the open source branches (0.98 currently).
  3. We have internal copies of the HBase repository and all dependent projects (Hadoop, ZooKeeper, etc).
  4. We have minimal patches in our own repositories. Mostly pom changes to defined where to pull dependencies from - for example we want to build our HBase against our build of Hadoop.
    Sometimes we have an odd patch or two that have not made it back to open source.
  5. We attach internal version numbers to our builds such as 0.98.4-sfdc-2.2.1, to indicate the exact version of what we're running in production.
  6. Everything we run in production is build automatically (via jenkins jobs) from source against these internal repositories. This allows to be agile in case of emergencies.
  7. Updates to the internal repository are manual (by design). We do not track the open source branches automatically. At our own pace, when we are ready, we move to a new upstream version, which most of the time allows us to remove some of one-off patches we had applied locally. For example we stayed at 0.98.4 for a while with some patches on top, and recently moved to 0.98.7, to which we had contributed all of the patches.
  8. All internal patches are eventually cleaned up and contributed back to open source, so that we can follow along the release train of minor version (0.98.4, 0.98.5, etc).
  9. Of course we keep an eye on and spend a lot of time with the open source releases to make sure they are stable and suitable for us to use a future internal release.
With this simple model we avoid forking, tack along with the open source releases, remain agile, and remain in full control over what exactly is deployed, completely at our own pace. Open source and corporate goals do not have to be at odds.

This might all be obvious; a bit of diligence is required to support both the open source goals for a project as well as the specific corporate goals.

Tuesday, August 12, 2014

HBase client response times

By Lars Hofhansl

When talking about latency expectations from HBase you'll hear various anecdotes and various war- and horror-stories where latency varies from a few milliseconds to many minutes.

In this blog post I will classify the latency conditions and their causes.

There are five principle causes to latency in HBase:

  1. Normal network round trip time (RTT) plus internal work that HBase has to do to retrieve or write the data. This type of latency is in the order of milliseconds - RTT + disk seeks, etc.
  2. Client retries due to moved regions or splits. Regions can be moved due to HBase deciding that the cluster is not balanced. As regions grow in size their are split.
    The HBase client rides over this by retrying a few times. You should make sure you have the client retry count and interval configured according to your needs. This can take in the order of one second. Note that this is independent on the size of the region as no data is actually moved - it remains in its location in HDFS - just the ownership is transferred to another region server.
  3. GC (see also hbase-gc-tuning-observations). This is an issue with nearly every Java application. HBase is no exception. A well tuned HBase install can keep this below 2s at all times, with average GC times around 50ms (this is for large heaps of 32GB)
  4. Server outages. This is were large tail end latency is caused. When a region server dies it takes - by default - 90s to detect, then that server's regions are reassigned to other servers, which then have replay logs and bring the regions online. Depending on the amount uncommitted data the region server had, this can take minutes. So in total this is in the order of a few minutes - if a client requests any data for any of the involved regions. Interactions with other region servers are not impacted.
  5. Writers overwhelming the cluster. Imagine a set of clients trying to write data into an HBase cluster faster than the cluster's hardware (network for 3-way replication, or disk) can absorb. HBase can buffer some spikes in memory (the memstore) but after some time of sustained write load has to force the clients to stop. In 0.94 a region server will just block any writers for a configurable maximum amount of time (90s by default). In 0.96 and later the server throws a RegionTooBusyException back to the client. The client then will have to retry until HBase has enough resource to accept the request. See also about-hbase-flushes-and-compactions. Depending on how fast HBase can compact excess HFiles this condition can last minutes.
All of the cases refer to any delays caused by HBase itself. Scanning large regions or writing a lot of data in bulk-write requests naturally have to adhere to physics. The data needs to loaded from disk - potentially from another machines - or it needs to be written across the network to three replicas. The time needed here depends on the particular network/disk setup.

The gist is that when things are smooth (and you have disabled Nagle's, i.e. enable tcpnodelay) you'll see latency of a few ms if things are in the blockcache, or < 20ms or so when disk seeks are needed.
The 99th percentile will include GCs, region splits, and region moves, and you should see something around 1-2s.

In the event of failures such as failed region servers or overwhelming the cluster with too many write requests, latency can go as high as a few minutes for requests to the involved regions.

Wednesday, July 16, 2014

About HBase flushes and compactions

By Lars Hofhansl

The topic of flushes and compactions comes up frequently when using HBase. There are somewhat obscure configuration options around this, you hear terms like "write amplification", and you might see scary messages about blocked writes in the logs until a flush has finished.

Let's step back for a minute and explore what HBase is actually doing. The configuration parameters will then make more sense.

Unlike most traditional databases HBase stores its data in "Log Structured Merge" (LSM) Trees.

There are numerous academic treatments concerning LSM trees, so I won't go into details here.

Basically in HBase it works something like this:
  1. Edits (Puts, etc) are collected and sorted in memory (using a skip list specifically). HBase calls this the "memstore"
  2. When the memstore reached a certain size (hbase.hregion.memstore.flush.size) it is written (or flushed) to disk as a new "HFile"
  3. There is one memstore per region and column family
  4. Upon read, HBase performs a merge sort between all - partially sorted - memstore disk images (i.e. the HFiles)
From a correctness perspective that is all that is needed... But note that HBase would need to consider every memstore image ever written for sorting. Obviously that won't work. Each file needs to be seeked and read in order to find the next key in the sort. Hence eventually some of the HFiles need to be cleaned up and/or combined: compactions.

A compaction asynchronously reads two or more existing HFiles and rewrites the data into a single new HFile. The source HFiles are then deleted.

This reduces the work to be done at read time at the expense of rewriting the same data multiple times - this effect is called "write amplification". (There are some more nuances like major and minor compaction, which files to collect, etc, but that is besides the point for this discussion)

This can be tweaked to optimize either reads or writes.

If you let HBase accumulate many HFiles without compacting them, you'll achieve better write performance (the data is rewritten less frequently). If on the other hand you instruct HBase to compact many HFiles sooner you'll have better read performance, but now the same data is read and rewritten more often.

HBase allows to tweak when to start compacting HFiles and what is considered the maximum limit of HFiles to ensure acceptable read performance.

Generally flushes and compaction can commence in parallel. A scenario of particular interest is when clients write to HBase faster than the IO (disk and network) can absorb, i.e. faster than compactions can reduce the number of HFiles - manifested in an ever larger number of HFiles, eventually reaching the specified limit.
When this happens the memstores can continue to buffer the incoming data, but they cannot grow indefinitely - RAM is limited.

What should HBase do in this case? What can it do?
The only option is to disallow writes, and that is exactly what HBase does.

There are various knobs to tweak flushes and compactions:
  • hbase.hregion.memstore.flush.size
    The size a single memstore is allowed to reach before it is flushed to disk.
  • hbase.hregion.memstore.block.multiplier
    A memstore is temporarily allowed to grow to the maximum size times this factor.
  • hbase.regionserver.global.memstore.lowerLimit
    JVM global limit on aggregate memstore size before some of the memstore are force-flushed (in % of the heap).
  • hbase.regionserver.global.memstore.upperLimit
    JVM memstore size limit before writes are blocked (in % of the heap)
  • hbase.hstore.compactionThreshold
    When a store (region and column family) has reach this many HFiles, HBase will start compacting HFiles
  • hbase.hstore.blockingStoreFiles
    HBase disallows further flushes until compactions have reduced the number of HFiles at least to this value. That means that now the memstores need to buffer all writes and hence eventually are subject blocking clients if compactions cannot keep up.
  • hbase.hstore.compaction.max
    The maximum number of HFiles a single - minor - compaction will consider.
  • hbase.hregion.majorcompactionTime interval between timed - major - compactions. HBase will trigger a compaction with this frequency even when no changes occurred.
  • hbase.hstore.blockingWaitTime
    Maximum time clients are blocked. After this time writes will be allowed again.
So when hbase.hstore.blockingStoreFiles HFiles are reached and the memstores are full (reaching
hbase.hregion.memstore.flush.size *
hbase.hregion.memstore.block.multiplier or due their aggregate size reaching hbase.regionserver.global.memstore.upperLimit) writes are blocked for hbase.hstore.blockingWaitTime milliseconds.

Note that this is not a flaw of HBase but simply physics. When disks/network are too slow at some point clients needs to slowed down.

Tuesday, July 8, 2014

Key note at HBaseCon 2014

By Lars Hofhansl

HBaseCon 2014 in May was a blast again as usual.
My keynote is now online (together with Google's and Facebook's key note - Facebook's for some reason was not recorded).

(And that from somebody who just three years ago would quite literally rather have died than doing any public speaking - I think you can do and learn anything if you put your mind to it.)

Also check out the session that actually have some contents: http://hbasecon.com/archive.html.

Monday, March 3, 2014

HBase GC tuning observations

By Lars Hofhansl

Java garbage collection tuning depends heavily on the application.

HBase is somewhat special: It produces a lot of short lived objects that typically does not outlive a single RPC request; at the same time most of the heap is (and should be) used for the block cache and the memstores that hold objects that typically live for a while.
In most setups HBase also requires reasonably short response times, many seconds of garbage collection are not acceptable.

Quick GC primer 

There is a lot of (better informed) literature out there about this topic, so here just the very basics.
  • In typical applications most object "die young" (they are created and soon are no longer needed). This is certainly true for HBase. This is called the "weak generational hypothesis". Most modern garbage collectors accommodate this by organizing the heap into generations.
  • HotSpot: manages four generations:
    1. Eden for all new objects
    2. Survivor I and II where surviving objects are promoted when eden is collected
    3. Tenured space. Objects surviving a few rounds (16 by default) of eden/survivor collection are promoted into the tenured space
    4. Perm gen for classes, interned strings, and other more or less permanent objects.
  • The principle costs to the garbage collector are (1) tracing all objects from the "root" objects and (2) collecting all unreachable objects.
    Obviously #1 is expensive when many objects need to be traced, and #2 is expensive when objects have to be moved (for example to reduce memory fragmentation)

Back to HBase

So what does that mean for HBase?

Most of the heap should be set aside for the memstores and the blockcache as that is the principal data that HBase manages in memory.

HBase already attempts to minimize the number of objects used (see cost #1 above) and keeps them at the same size as much as possible (cost #2 above).

The memstore is allocated in chunks of 2mb (by default), and the block cache stores blocks of 64k (also by default).
With that HBase keeps the number of objects small and avoids heap fragmentation (since all objects are roughly the same size and hence new objects can fit into "holes" created by previously released objects).

Most other objects and datastructures do not outlive a single request and can (and should) be collected quickly.

That means we want to keep eden as small as possible - just big enough to handle the garbage from a few concurrent requests. That allows us to use most of the heap for the useful data and keep minor compactions short.
We also want to optimize for latency rather than throughput.
And we'd want GC settings where we do not move the objects in the tenured generation around and also collect new garbage as quickly as possible.
The young generation should be moving to avoid fragmentation (in fact all young gen collectors in Hotspot are moving).

TL;DR: With that all in mind here are the typical GC settings that I would recommend:

-Xmn256m - small eden space (maybe -Xmn512m, but not more than that)
-XX:+UseParNewGC - collect eden in parallel
-XX:+UseConcMarkSweepGC - use the non-moving CMS collector
-XX:CMSInitiatingOccupancyFraction=70 - start collecting when 70% of the tenured gen are full to avoid collection under pressure
-XX:+UseCMSInitiatingOccupancyOnly - do not try to adjust CMS setting


Those should be good settings for starters. You should find average GC times around 50ms and even 99'ile GC times around 150ms, and absolute worst case 1500ms or so; all on contemporary hardware with heaps of 32GB or less.

There are efforts to move cached data off the Java to heap to reduce these pauses especially the worst case.

Wednesday, January 1, 2014

More HBase performance profiling

By Lars Hofhansl

Continuing my sporadic profiling sessions I identified a few more issues in HBase.

 

First, a note about profiling

"Instrumentating" profilers often distort the actual results and frequently lead to premature optimization.
For example in HBase that lead to caching of redundant information on each KeyValue wasting heap space. Using a sampler showed that none of the optimized methods were hotspots.
The instrumentation is still useful, but the result should always be validated by a "sampler" or, even better, real "performance counters".

I simply used the Sampler in jVisualVM, which ships with the JDK.

Here are some new issues I found and fixed: 

 

HBASE-9915 Performance: isSeeked() in EncodedScannerV2 always returns false
HBase optimizes seeks and ensures that (re)seeks that would land on a key on the current block continue scanning forward inside the block, rather than (1) consulting the HFile index again, (2) loading the block, and (3) scanning from the beginning of that block again.
There was a bug in HBase where this optimization did not happen for block-encoded HFiles. Nobody noticed that this never worked.
A performance improvement of 2-3x (200-300%) when block encoding is enabled.

HBASE-9807 block encoder unnecessarily copies the key for each reseek
Here we copied the entire row key just to compare it with the first key of the next index block when the HFile is block encoded.
10% improvement by itself. Much more together with HBASE-9915 mentioned above.

HBASE-10015 Replace intrinsic locking with explicit locks in StoreScanner
This one was somewhat surprising. The issue is that we're locking inside the StoreScanner on each operation of next/seek/etc - millions of times per second, just so that a parallel compaction/flush can invalidate the HFile readers, which happens a few times per hour.
According to conventional wisdom intrinsic locking (synchronized) should be preferred since the JVM does biased locking. In my tests, which were confirmed across JVMs and architectures, I found that by simply replacing intrinsic locks with ReentrantLock's there was 20-30% performance gain overall to be had.

HBASE-10117 Avoid synchronization in HRegionScannerImpl.isFilterDone
Another issue with synchronization. As described here the synchronization penalty is not just blocked threads, but also involves memory read and write fences when the lock is completely uncontended. It turns out that in the typical case the synchronization is not needed.
20-25% improvement

More to come.
Edit: Some spelling/grammar corrections.

Monday, July 29, 2013

HBase and data locality

By Lars Hofhansl

A quick note about data locality.

In distributed systems data locality is an important property. Data locality, here, refers to the ability to move the computation close to where the data is.

This is one of the key concepts of MapReduce in Hadoop. The distributed file system provides the framework with locality information, which is then used to split the computation into chunks with maximum locality.

When data is written in HDFS, one copy is written locally, another is written to another node in a different rack (if possible) and a third copy is written to another node in the same rack. For all practical purposes the two extra copies are written to random nodes in the cluster.

In typical HBase setups a RegionServer is co-located with an HDFS DataNode on the same physical machine. Thus every write is written locally and then to the two nodes as mentioned above. As long the regions are not moved between RegionServers there is good data locality: A RegionServer can serve most reads just from the local disk (and cache), provided short circuit reads are enabled (see HDFS-2246 and HDFS-347).

When regions are moved some data locality is lost and the RegionServers in question need to request the data over the network from remote DataNodes, until the data is rewritten locally (for example by running some compactions).

There are four basic events that can potentially destroy data locality:
  1. The HBase balancer decides to move a region to balance data sizes across RegionServers.
  2. A RegionServer dies. All its regions need to be relocated to another server.
  3. A table is disable and re-enabled.
  4. A cluster is stopped and restarted.
Case #1 is a trade off. Either a RegionServer hosts an over-proportional large portion of the data, or some of the data loses its locality.

In case #2 there is not much choice. That RegionServer and likely its machine died, so the regions need to be migrated to somewhere else where not all data is local (remember the "random" writes to the two other data nodes).

In case #4 HBase reads the previous locality information from the .META. table (whose regions - currently only one - are assigned to an RegionServer first), and then attempts to the assign the regions of all tables to the same RegionServers (fixed with HBASE-4402).

In case #3 the regions are reassigned in a round-robin fashion...

Wait what? If the a table is simply disabled and then re-enabled its regions are just randomly assigned?

Yep. In a large enough cluster you'll end up with close to 0% data locality, which means all data for all requests is pulled remotely from another machine. In some setups that means an outage, a time of unacceptably slow read performance.

This was brought up on the mailing lists again this week and we're fixing the old issue now (HBASE-6143, HBASE-9080).