Sunday, October 7, 2012

Musings on Secondary Indexes

By Lars Hofhansl

There has been a lot of discussion about 2ndary indexes in HBase recently.

Since HBase is a distributed system there are many naive ways to do this, but it is not trivial to get it correct. HBase is a consistent key value store, so you would expect the same guarantees from your indexes.

The choices mostly boil down to two solutions:
  1. "Local" (Region level) indexes. These are good for low latency (everything is handled at locally at a server). The trade off is that it is impossible to ask globally where the row matching an index is located. Instead all regions have to be consulted.
  2. Global indexes. These can be used for global index queries, but involve multi server updates for index maintenance.
Here I will focus on the latter: Global Indexes.

In turns out a correct solution can be implemented with relative simplicity without any changes to HBase.
First off there are three important observations when talking about index updates:
  1. An index update is like a transaction
  2. An index update transaction is idempotent
  3. With a bit of discipline partial index transactions can be detected at runtime. I.e. the atomicity can be guaranteed at read-time.
Naively we could just write data to the "main" table and to a special index table:
The index table would be keyed by <columnValue>|<rowkey>. The <rowkey> is necessary to keep the index entries unique.

This simple approach breaks down in the face of failures. Say a client writes to these two tables, but fails before one of the changes could be applied. Now we may have the main table update but missed the index update or vice versa.
Notice that it is OK to have a stale index entry (a false index match) but not to have a stale main table entry.

So here is a solution for consistent secondary indexes in HBase:
  • Update: If using TTL, setup the main table to expire slightly before the index table(s).
  • For a Put, first retrieve a local timestamp, apply all changes to the index table(s) first using that timestamp. Then apply the changes to the main table. Using the same timestamp.
    The timestamp can safely be created locally (with the same caveat that generally applies to HBase: Updates to the same KV in the same ms leads to more or less undefined results).
  • Upon delete, read back the row, then delete the main table entry, then remove all index entries (using the column values and timestamps read back from the row and using the same delete type - family, column, or version - that we used for the main table). Since we used the same timestamps this will be correctly.
    As Daniel Gómez Ferro points out, this in fact can only work for version deletes. I.e. only a specific version of a row can be deleted. In order to delete all versions of a column (a column delete marker) of a row or all versions of all columns (a family delete marker), all of these versions would need to be enumerated in the client and be deleted one by one.
Note that if there is a failure we might end up with an extra entry in the index, but we'll never have an entry in the main table and not in the index table.
  • Now upon reading along an index, we find the rows that the index entries point to. When we retrieve the rows we have to verify that a row we read has the expected timestamp for the index column. If not we simply ignore that row/index entry. It must be from a partial index update - either from a concurrent update that is still in progress, or from a partially failed update.
    This is how we assure atomicity at read-time; or in other words we can tolerate a slightly wrong index without affecting correctness.
Lastly we need a M/R job that can build an index from an existing table. Because of the idempotent nature of index transaction - stemming from the fact that everything in HBase is timestamped - we can enable the index on a table (i.e. clients will immediately start to write to both tables) and then run the M/R job. As soon as the M/R is finished the index is up to date and can be used (the M/R job would update some index metadata in the end to mark the index as available).

That's it. No server side code needed. No locking. And this will scale nicely. Just a little bit of discipline. Discipline that could easily be encapsulated in a client library.

Since partial failures will be rare, and we can filter those at read time, we need not clean the index up.

Because of the timestamps we can also still perform correct time range queries.


  1. Hi Lars,

    Very interesting post.

    I think there could be some corner cases, though. Since this technique doesn't provide snapshot isolation, I think it would be difficult to understand the guarantees it provides. For example, if you have some row set to some value and you 'atomically' update it to some other value, a concurrent reader of the index might miss both states OR it could see both of them, depending on timing constrains (if my understanding is correct, of course).

    Also, I'm not sure how you propose to handle deletes. For a family/column delete I think you need to fallback to a scan and a entry-by-entry version-delete both from the table and the index, no?

    I think it would be great to discuss what guarantees do Secondary Indexes implemented on bare HBase provide and whether or not you need global transactions to give useful guarantees.


  2. Hi Daniel,

    good point about family deletes, these would have to be translated to single deletes at the client (see HBASE-6942, which could help with that).

    As for your atomicity example. The guarantee would be the same as it in "straight" HBase (a scanner that started before a couple of updates to different rows, might see some of them but not all). The behavior will not be different from normal HBase operations (changes to the same row will either be seen in total or not at all).

    Anything that would need a lock (CheckAndPut, Increment, etc) could indeed not be supported that way.

    Is that what you meant?

  3. Hi Lars,

    Thanks for your response. HBASE-6942 looks interesting indeed.

    As for the atomicity case, I'm not sure we are talking about the same issue. Suppose we have a table with a secondary index on column 'project'. One of the rows for the table is 'dev0' -> 'apache hbase'.

    Client CS starts a scan over the index, reading every entry that starts with 'apache', for example.

    Just before the scan reaches 'apache hbase|dev0'', which points to our previously defined row, a second client CU updates 'dev0' to project 'apache hadoop'. If I understood correctly your proposal, when CS reads 'apache hbase|dev0' it's going to ignore it, missing the user 'dev0' completely, even though it was either 'apache base' or 'apache hadoop' the whole time.

    Does this case fit into your model? I think this issue could surprise people. In my opinion the sensible result would be to return either of 'dev0' -> 'apache hbase' or 'dev0' -> 'apache hadoop', but returning none (or both) is wrong.


  4. Lars can you mention a bit about the types of query patterns you would like to support through this? I can see how this approach could give you equality and range queries, though I'm not sure how efficient they would be and it looks like your client would probably need to filter keys out.

    Probably not very efficient for compound queries where hash indexing is the usual approach. Just wondering if you could share more about what use cases matter to you.

  5. @Carter: I am mostly interested in patterns that benefit from global indexes (just to state the obvious). I.e. highly selective indexes. The other dimension I am interested in, is that this should scale naturally with HBase in the number of machines.
    Latency will be high. Index scan that are not selective will need to bring back a lot of data to the client.

  6. @Daniel: Yes that is an interesting scenario. It would violate the principle that a scanner started after some data exists in HBase the scanner is guaranteed to see it. That would not be the case here as the a (logical) scan from the client could miss the maintable entry in the scenario you describe.
    One could retrieve the main row as of the TS of the index entry that matches, but that is not generally what you want (other columns could have changed after the indexed column, and you'd want those later changes).
    Or one could retrieve multiple versions of the main row and check that there is one version with a TS that matches the TS of the index entry.

  7. Lars
    bq.For a Put, first retrieve a local timestamp, apply all changes to the index table(s) first using that timestamp. Then apply the changes to the main table. Using the same timestamp.

    What if the Put kvs were having some time stamps? And it can be different also. Even I can make a Put with 2 versions for a cell. Just telling u may need to consider these also.