There has been a lot of discussion about 2ndary indexes in HBase recently.
Since HBase is a distributed system there are many naive ways to do this, but it is not trivial to get it correct. HBase is a consistent key value store, so you would expect the same guarantees from your indexes.
The choices mostly boil down to two solutions:
- "Local" (Region level) indexes. These are good for low latency (everything is handled at locally at a server). The trade off is that it is impossible to ask globally where the row matching an index is located. Instead all regions have to be consulted.
- Global indexes. These can be used for global index queries, but involve multi server updates for index maintenance.
In turns out a correct solution can be implemented with relative simplicity without any changes to HBase.
First off there are three important observations when talking about index updates:
- An index update is like a transaction
- An index update transaction is idempotent
- With a bit of discipline partial index transactions can be detected at runtime. I.e. the atomicity can be guaranteed at read-time.
The index table would be keyed by <columnValue>|<rowkey>. The <rowkey> is necessary to keep the index entries unique.
This simple approach breaks down in the face of failures. Say a client writes to these two tables, but fails before one of the changes could be applied. Now we may have the main table update but missed the index update or vice versa.
Notice that it is OK to have a stale index entry (a false index match) but not to have a stale main table entry.
So here is a solution for consistent secondary indexes in HBase:
- Update: If using TTL, setup the main table to expire slightly before the index table(s).
- For a Put, first retrieve a local timestamp, apply all changes to the index table(s) first using that timestamp. Then apply the changes to the main table. Using the same timestamp.
The timestamp can safely be created locally (with the same caveat that generally applies to HBase: Updates to the same KV in the same ms leads to more or less undefined results).
- Upon delete, read back the row, then delete the main table entry, then remove all index entries (using the column values and timestamps read back from the row and using the same delete type - family, column, or version - that we used for the main table). Since we used the same timestamps this will be correctly.
As Daniel Gómez Ferro points out, this in fact can only work for version deletes. I.e. only a specific version of a row can be deleted. In order to delete all versions of a column (a column delete marker) of a row or all versions of all columns (a family delete marker), all of these versions would need to be enumerated in the client and be deleted one by one.
- Now upon reading along an index, we find the rows that the index entries point to. When we retrieve the rows we have to verify that a row we read has the expected timestamp for the index column. If not we simply ignore that row/index entry. It must be from a partial index update - either from a concurrent update that is still in progress, or from a partially failed update.
This is how we assure atomicity at read-time; or in other words we can tolerate a slightly wrong index without affecting correctness.
That's it. No server side code needed. No locking. And this will scale nicely. Just a little bit of discipline. Discipline that could easily be encapsulated in a client library.
Since partial failures will be rare, and we can filter those at read time, we need not clean the index up.
Because of the timestamps we can also still perform correct time range queries.