Sunday, September 16, 2012

KeyValue explicit timestamps vs memstoreTS

In this post I tried to summarize HBase's ACID implementation. This topic was also the main subject of my talk at HBaseCon.

I talked to a few folks afterwards, and it seems there is some general confusion about the memstoreTS (and MVCC) and how it relates to the explicit timestamp stored with each KeyValue.

The key difference is that the timestamp is part of the data model. Each KeyValue has a rowkey, a column identifier, a value... and a timestamp.
An application can retrieve and set the timestamp, decide as of which timestamp(s) it wants to see a certain value, etc. It should be considered as just another part of a column (a KeyValue pair to be specific) that happens to have special meaning.

The memstoreTS on the other hand is a construct purely internal to HBase. Even though it is called a timestamp (TS), it has nothing to do with time, but is just a monotonically increasing "transaction number". It is used to control visibility (MVCC) of a "transaction" in HBase (i.e. a Put or Delete operation), so that changes are not partially visible.

Remember that the explicit timestamp is part of the application data that gets changed by a transaction.

Say you have two concurrent threads that update the same columns on the same row (a row is the unit at which HBase provides transactions):
Thread1, Put: c1 = a, c2 =b
Thread2, Put: c1 = x, c2 =y

That's cool. One of the threads will be first, HBase makes sure (via the memstoreTS) that Puts are not partially visible, and also makes sure that all explicit timestamp are set to the same value for the same Put operation.
So any client (looking at the latest timestamp) will either see (a,b) or (x,y).

Things get more confusing when a client sets the explicit timestamp of (say) a Put. Again consider this example where two concurrent threads both issue a Put to the same columns of the same row, but set different timestamps:

Thread1, Put: c1(t1) = a, c2(t2) =b
Thread2, Put: c1(t2) = x, c2(t1) =y

So what will a client see?

The first observation is still that each Put is seen in its entirety, or not at all.

If a Get just retrieves the latest versions and does so after both Thread1 and Thread2 finished it will see:
c1(t2) = x, c2(t2) = b

This is where many people get confused.

It looks like we're seeing mixed data from multiple Puts. Well, in fact we do, but we asked for only the latest versions, which include values from different transactions; so that is OK and correct.

To understand this better let's perform a Get of all versions. (i.e. call setMaxVersions() on your Get). Now we'll potentially see this:
  • nothing (if the Get happens to be processed before either the Puts).
  • c1(t1) = a, c2(t2) = b (Thread1 was processed first, and Thread2 is not yet finished)
  • c1(t2) = x, c2(t1) = y (Thread2 was processed first, and Thread1 is not yet finished)
  • c1(t1) = a, c1(t2) = x, c2(t1) = y, c2(t2) = b (both Thread1 and Thread2 finished)
We see that each Put indeed is either completely visible or not at all, and that HBase correctly avoid visibility of a partial Put operation.
In addition we can confirm that the timestamp is part of the data model that HBase exposes to clients, and the transaction visibility has nothing to do with the application controlled/visible timestamps.

Lastly in this scenario we also see that it is impossible for a client to piece together the state of the row at precise transaction boundaries.
Assuming both Thread1 and Thread2 have finished, a client can request the state as of t1 (in which case it gets c1(t1) = a, c2(t1) = y) or as of t2 (in which case it'll see c1(t2) = x, c2(t2) = b).

This behavior is the same as you'd expect from a relational database - the current state is the result of many past transactions; with the extra mental leap that the timestamp is actually part of the data model in HBase.

Wednesday, September 12, 2012

HBase client timeouts

The HBase client is a somewhat jumbled mess of layers with unintended nested retries, nested connection pools, etc. among others. Mixed in are connections to the Zookeeper ensemble.

It is important to realize that the client directly handles all communication with the RegionServers, there is no proxy at the server side. Consequently the client needs to do the service discovery and caching as well as the connection and thread management necessary. And hence some of the complexity is understandable: The client is part of the cluster.

See also this blog post. Before HBASE-5682 a client would potentially never recover when it could not reach the cluster. And before HBASE-4805 and HBASE-6326, a client could not - with good conscience - be used in a long running ApplicationServer.

An important aspect of any client library is what I like to call "time to exception". If things go wrong the client should (at least as an option) fail fast and let the calling application - which has the necessary semantic context - decide how to handle this situation.

Unfortunately the HBase and Zookeeper clients were not designed with this in mind.

Among the various time outs are:
  • ZK session timeout (zookeeper.session.timeout)
  • RPC timeout (hbase.rpc.timeout)
  • RecoverableZookeeper retry count and retry wait (zookeeper.recovery.retry, zookeeper.recovery.retry.intervalmill)
  • Client retry count and wait (hbase.client.retries.number, hbase.client.pause)
In some error paths these retry loops are nested, so that in the default setting if both ZK and HBase are down a client will throw an Exception after a whooping 20 Minutes! The application has no chance to react to outages in any meaningful way.

HBASE-6326 fixes one issue, where .META. and -ROOT- lookups would be nested, each time causing a ZK timeout N^2 times (N being the client retry count, 10 by default), which itself would be retried by RecoverableZookeeper (3 by default).

The defaults for some of these settings are optimized for the various server side components. If the network "blips" for five seconds the RegionServers should not abort themselves. So a session timeout of 180s makes sense there.

For clients running inside a stateless ApplicationServer the design goals are different. Short timeouts of five seconds seem reasonable. A failure is quickly detected and the application can react (potentially by controlled retrying).

With the fixes in the various jiras mentioned above, it is now possible (in HBase 0.94+) to set the various retry counts and timeouts to low values and get reasonably short timespans after which the client would report a connection error to calling application thread.
And this is in fact what should done when the HBaseClient (HTable, etc) is used inside an ApplicationServer for HBase requests that are synchronous in the calling thread (for example a web server serving data from HBase).