Monday, May 6, 2013

HBase durability guarantees

Like most other databases HBase logs changes to a write ahead log (WAL) before applying them (i.e. making them visible).

I wrote in more detail about HDFS' flush and sync semantics here: here. Also check out HDFS-744 and HBASE-5954.

Recently I got back to this area in the code and committed HBASE-7801 to HBase 0.94.7+, which streamlines the durability API for the HBase client.
The existing API was a mess, with some settings configured only via the table on the server (such as deferred log flush) and some only via the client per each Mutation (i.e. Put or Delete) such as Mutation.setWriteToWAL(...). In addition some table settings could be overridden by a client and some could not.

HBASE-7801 includes the necessary changes on both server and client to support a new API: Mutation.setDurablity(Durability).
A client can now control durability per Mutation - via passing a Durability Enum - as follows:
  1. USE_DEFAULT: do whatever the default is for the table. This is the client default.
  2. SKIP_WAL: do not write to the WAL
  3. ASYNC_WAL: write to the WAL asynchronously as soon as possible
  4. SYNC_WAL: write WAL synchronously
  5. FSYNC_WAL: write to the WAL synchronously and guarantee that the edit is on disk (not currently supported)
Prior to this change only the equivalent of USE_DEFAULT and SKIP_WAL was supported.

The changes are backward compatible. A server prior to 0.94.7 will gracefully ignore any client settings that it does not support.

The HTable.setWriteToWAL(boolean) API is deprecated in 0.94.7+ and was removed in 0.95+.

I filed a followup jira to do the same refactoring the table descriptors HBASE-8375.

Monday, April 22, 2013

HBaseCon 2013

I am lucky enough to be on the HBaseCon Program Committee again this year.

We received a great set of talks and we organized them into four tracks: Operations, Internals, Ecosystem, and Case Studies

Today the Session List was announced:

http://blog.cloudera.com/blog/2013/04/hbasecon-2013-speakers-tracks-and-sessions-announced/

Obviously I am biased, but I think this will be a great event.

Friday, February 15, 2013

Managing HBase releases

Last year I (perhaps foolishly?) signed up to be the release manager for the 0.94 branch of HBase. I released 0.94.0 in Mai 2012. Since then I have learned a lot of the open source release process (which, in the end, is not that different from release proprietary software).

There are no defined responsibilities per se for such a role other than actually doing the release.

When I started HBase had relatively infrequent releases and there used to be many discussions and delays to a release to get some more "essential" features in.
The partial cure for this is two fold:
  1. Frequent releases on a somewhat strict schedule. If a feature or fix does not get in, it'll be in the next release a few weeks later.
    This reduces the pressure to push a feature into the next point release.
    The only discussions now are typically around serious bugs that have been discovered during the round of release candidates.
    This is the "release train" model. The train stops every few weeks, changes that are ready get on board, the other changes wait for the next train.
  2. A passing, comprehensive test suite, so that we can do the frequent releases with confidence. Problems are identified early (if the tests fail regularly nobody will check out new test failures, or these failures just drown in the noise of failing tests).
We're now on HBase 0.94.5 (released today actually!), and the pattern that emerged after some initial adjustment is one point release (0.94.0, 0.94.1, ...) about every four to six weeks (depending on how many rounds of release candidates were needed), with a relatively constant rate of change of two fixes and improvements per day (hence a point release ends up having 60-80 changes).

As you can see HBase is pretty actively maintained!

So to me being the release manager includes all of the following:
  • Help decide what features or fixes should be included in the release.
  • Help channel the discussion about whether a feature in (unstable) trunk is important enough to be backported to 0.94.
  • Try to review all the changes that go into 0.94. Due to the rate of change I cannot have a detailed look at every fix (I have other responsibilities in my day time job too), but I try to at least skim the changes to see if anything risky or incorrect sticks out.
  • Make sure the test suite passes reliably. This is a pet-peeve of mine and has been especially challenging, but we're now at pass rate of about 70% (up from 20-30% a few months back, but still needs to be improved).
    (Note that many of the failures are due to timing issues in the virtual build machines, and not due to a bug in the HBase code base. A single failing test out of over 1800 tests will make the test suite fail. So 70% is not as bad as it sounds.)
  • Keep timely releases. This my pet-peeve number two.
    Releases should be frequent, on a semi strict schedule, and backward compatible.
    That allows users to get features and fixes sooner and does not require cumbersome serial upgrades (where you need to upgrade from version 0.94.0 to 0.94.1 first in order to then upgrade from 0.94.1 to 0.94.2, and so on). Intermediary releases can be skipped (it is possible to upgrade from 0.94.0 to 0.94.5 directly).
    At the same time - as mentioned above - it allows developers to finish a feature or fix correctly rather than rushing it to "get it in", just because the next release will be 6 months from now.
  • (Sometimes) coordinate with vendors (such as Cloudera and Hortonworks) to time a release or a fix with their releases. This is on a best effort basis, the Apache release is independent of any vendor; but let's be honest, a significant fraction of our users run a release from these vendors.
  • Doing the actual release:
    • Tagging the release in SVN
    • Creating the release artifacts (currently we use the ones generated by the jenkins build for this).
    • Go through a round of one or more RCs and get other committers to test and vote for this RC. Here we need to improve with more automated integration test.
    • Uploading the release to the official Apache mirrors.
    • Pushing the release to the Maven repository (which involves a lot of black voodoo).
0.94 is the current stable release branch of HBase. As long as the next version (0.96) does not have a stable release we will keep backporting new features to 0.94 and keep the frequent releases going.

So far this has been fun (with the occasional frustration about the flaky test suite in the past).

The HBase community is very friendly and invites outside patches and improvements. So download HBase 0.94.5, and start contributing :)

Wednesday, January 30, 2013

SQL on HBase

My Salesforce colleague and cubicle neighbor, James Taylor, just released Phoenix a SQL layer on top of HBase to the Open Source world.

Phoenix is implemented as a JDBC driver. It makes use of various HBase features such as coprocessors and filters to push predicates into the server as much as possible. Queries are parallelized across RegionServers.

Phoenix has a formal data model that includes making use of the row key structure for optimization.

Currently Phoenix is limited to single table operations.

Here's James' blog entry announcing Phoenix.

Friday, January 18, 2013

HBase region server memory sizing

Sizing a machine for HBase is somewhat of a black art.

Unlike a pure storage machine that would just be optimized for disk size and throughput, an HBase RegionServer is also a compute node.

Every byte of disk space needs to be matched with a fraction of a byte in the RegionServer's Java heap.

You can estimate the ratio of raw disk space to required Java heap as follows:

RegionSize / MemstoreSize *
ReplicationFactor * HeapFractionForMemstores

Or in terms of HBase/HDFS configuration parameters:

regions.hbase.hregion.max.filesize /
hbase.hregion.memstore.flush.size *
dfs.replication *
hbase.regionserver.global.memstore.lowerLimit

Say you have the following parameters (these are the defaults in 0.94):
  • 10GB regions
  • 128MB memstores
  • HDFS replication factor of 3 
  • 40% of the heap use for the memstores

Then: 10GB/128MB*3*0.4 = 96.

Now think about this. With the default setting this means that if you wanted to serve 10T worth of disks space per region server you would need a 107GB Java heap!
Or if you give a region server a 10G heap you can only utilize about 1T of disk space per region server machine.

Most people are surprised by this. I know I was.

Let's double check:
In order to serve 10T worth of raw disk space - 3.3T of effective space after 3-way replication -  with 10GB regions, you'd need ~338 regions. @128MB that's about 43GB. But only 40% is by default used for the memstores so what you actually need is 43GB/0.4 ~ 107GB. Yep it's right.

Maybe we can get away with a bit less by assuming that not all memstores are 100% full at all times. That is offset by the fact that not all region will be exactly the same size or 100% filled.

Now. What can you do?
There are several options:
  1. Increase the region size. 20GB is about the maximum. Although some people claim they have 200GB regions. (hbase.hregion.max.filesize)
  2. Decrease the memstore size. Depending on your write load you can go smaller, 64MB or even less. (hbase.hregion.memstore.flush.size).
    You can allow a memstore to grow beyond this size temporarily.
    (hbase.hregion.memstore.block.multiplier)
  3. Increase the HDFS replication factor. That does not really help per se, but if you have more disk space than you can utilize, increasing the replication factor would at least put your disks to good use.
  4. Fiddle with the heap fractions used for the memstores. If you load is write-heave maybe up that 50% of the heap (hbase.regionserver.global.memstore.upperLimit, hbase.regionserver.global.memstore.lowerLimit)
These parameters (except the replication factor, which is an HDFS setting) are described in hbase-defaults.xml that ships with HBase.

Personally I would place the maximum disk space per machine that can be served exclusively with HBase around 6T, unless you have a very read-heavy workload.
In that case the Java heap should be 32GB (20G regions, 128M memstores, the rest defaults). With MSLAB in 0.94 that works.

Of course your needs may vary. You may have mostly readonly load, in which case you can shrink the memstores. Or the disk space might be shared with other applications.
Maybe you need smaller regions or larger memstores. In that case he maximum disk space you can serve per machine would be less.

Future JVMs might support bigger heap effectively (JDK7's G1 comes to mind).

In any case. The formula above provides a reasonable starting point.

Saturday, December 15, 2012

HBase Profiling

By Lars Hofhansl

Modern CPU cores can execute hundreds of instructions in the time it takes to reload the L1 cache. "RAM is the new disk" as a coworker at Salesforce likes to say. The L1-cache is the new RAM I might add.

As we add more and more CPU cores, we can easily be memory IO bound unless we are a careful.

Many common problems I have seen over the years were related to:
  1. concurrency problems
    Aside from safety and liveliness considerations, a typical problem is too much synchronization limiting potential parallel execution.
  2. unneeded or unintended memory barriers
    Memory barriers are required in Java by the following language constructs:
    • synchronized - sets read and write barriers as needed (details depend on JVM, version, and settings)
    • volatile - sets a read barrier before a read to a volatile, and write barrier after a write
    • final - set a write barrier after the assignment
    • AtomicInteger, AtomicLong, etc - uses volatiles and hardware CAS instructions
  3. unnecessary, unintended, or repeated memory copy or access
    Memory copying is often seen in Java for example because of the lack of in-array pointers, or really just general unawareness and the expectation that "garbage collector will clean up the mess." Well, it does, but not without a price.
(Entire collections of books are dedicated to each of these topics, so I won't embarrass myself by going into more detail.)

Like any software project of reasonable size, HBase has problems of all the above categories.

Profiling in Java has become extremely convenient. Just start jVisualVM which ships with the SunOracleJDK, pick the process to profile (in my case a local HBase regionserver) and start profiling.

Over the past few weeks I did some on and off profiling in HBase, which lead to the following issues:

HBASE-6603 - RegionMetricsStorage.incrNumericMetric is called too often

Ironically here it was the collection of a performance metric that caused a measurable slowdown of up 15%(!) for very wide rows (> 10k columns).
The metric was maintained as an AtomicLong, which introduced a memory barrier in one of the hottest code paths in HBase.
The good folks at Facebook have found the same issue at roughly the same time. (It turns that they were also... uhm... the folks who introduced the problem.)

HBASE-6621 - Reduce calls to Bytes.toInt

A KeyValue (the data structure that represents "columns" in HBase) is currently backed by a single byte[]. The sizes of the various parts are encoded in this byte[] and have to read and decoded; each time an extra memory access. In many cases that can be avoided, leading to slight performance improvement.

HBASE-6711 - Avoid local results copy in StoreScanner

All references pertaining to a single row (i.e. KeyValue with the same row key) were copied at the StoreScanner layer. Removing this lead to another slight performance increase with wide rows.

HBASE-7180 - RegionScannerImpl.next() is inefficient

This introduces a mechanism for coprocessors to access RegionScanners at a lower level, thus allowing skipping of a lot of unnecessary setup for each next() call. In tight loops a coprocessor can make use of this new API to save another 10-15%.

HBASE-7279 - Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

The row key of KeyValue was copied in the various scan related classes. To reduce that effect the row key was previously cached in the KeyValue class - leading to extra memory required for each KeyValue.
This change avoids all copying and hence also obviates the need for caching the row key.
A KeyValue now is hardly more than an array pointer (a byte[], an offset, and a length), and no data is copied any longer all the way from the block loaded from disk or cache to the RPC layer (unless the KeyValues are optionally encoded on disk, in which case they still need to be decoded in memory - we're working on improving that too).

Previously the size of a KeyValue on the scan path was at least 116 bytes + the length of the rowkey (which can be arbitrarily long). Now it is ~60 bytes, flat and including its own reference.
(remember during a course of a large scan we might be creating millions or even billions of KeyValue objects)

This is nice improvement both in term of scan performance (15-20% for small row keys of few bytes, much more for large ones) and in terms of produced garbage.
Since all copying is avoided, scanning now scales almost linearly with the number of cores.

HBASE-6852 - SchemaMetrics.updateOnCacheHit costs too much while full scanning a table with all of its fields

Other folks have been busy too. Here Cheng Hao found another problem with a scan related metric that caused a noticeable slowdown (even though I did not believe him first).
This removed another set of unnecessary memory barriers.

HBASE-7336 - HFileBlock.readAtOffset does not work well with multiple threads

This is slightly different issue caused by bad synchronization of the FSReader associated with a Storefile. There is only a single reader per storefile. So if the file's blocks are not cached - possibly because the scan indicated that it wants no caching, because it expects to touch too many blocks - the scanner threads are now competing for read access to the store file. That lead to outright terrible performance, such a scanners timing out even with just two scanners accessing the same file in tight loop.
This patch is a stop gap measure: Attempt to acquire the lock on the reader, if that failed switch to HDFS positional reads, which can read at an offset without affecting the state of the stream, and hence requires no locking.

Summary

Together these various changes can lead to ~40-50% scan performance improvement when using a single core. Even more when using multiple cores on the same machines (as is the case with HBase)

An entirely unscientific benchmark

20m rows, with two column families just a few dozen bytes each.
I performed two tests:
1. A scan that returns rows to the client
2. A scan that touches all rows via a filter but does not return anything to the client.
(This is useful to gauge the actual server side performance).

Further I tested with (1) no caching, all reads from disk (2) all data in the OS cache and (3) all data in HBase's block cache.

I compared 0.94.0 against the current 0.94 branch (what I will soon release as 0.94.4).

Results:
  • Scanning with scanner caching set to 10000:
    • 0.94.0
      no data in cache: 54s
      data in OS cache: 51s
      data in block cache: 35s

    • 0.94.4-snapshot
      no data in cache: 50s (IO bound between disk and network)
      data in OS cache: 43s
      data in block cache: 32s
      (limiting factor was shipping the results to the client)
  • all data filtered at the server (with a SingleValueColumnFilter that does not match anything, so each rows is still scanned)
    • 0.94.0
      no data in cache: 31s

      data in OS cache: 25s
      data in block cache: 11s
    • 0.94.4-snapshot
      no data in cache: 22s
      data in OS cache: 17s
      cache in block cache: 6.3s
I have not quantified the same with multiple concurrent scanners, yet.
So as you can see scan performance has significantly improved since 0.94.0.

Salesforce just hired some performance engineers from a well known chip manufacturer, and I plan to get some of their time to analyze HBase in even more details, to track down memory stalls, etc.