This post last week on secondary indexes caused a lot of discussion - which I find interesting, because this is (IMHO) one of the less interesting posts here.
It appears that I did not provide enough context to frame the discussion.
So... What do people means when they say "secondary index". Turns out many things:
- Reverse Lookup
- Range Scans
- Index Covered Queries
I also was not clear about the consistency guarantees provided.
The "ragged edge" of current time is always a thorny issue in distributed systems. Without a centralized clock (or other generator of monotonically increasing ids) the notion of "at the same time" is at best a fuzzy one (see also Google's Spanner paper and it's notion of current time being an interval bounded by clock precision, rather than a single point).
The idea of last week's indexing is consistent when the index is scanned as of a time in the past (as long at the time is far enough in the past to cover clock skew between servers - since all RegionServers must be installed with NTP, a few dozen milliseconds should be OK). Note you need my HBASE-4536 (0.94+) for this to be correct with deletes.
Queries for current results lead to interesting anomalies pointed out by Daniel (rephrased here):
- a client could read along the index
- it just passed index entry X
- another client changes the indexed column's value to Y and hence adds index entry Y
- the first client now checks back with the main row (whose latest version of the indexed column was changed, so X != Y and the timestamps won't match)
- the first client would conclude that the indexentry resulted from a partial update, and hence ignore it (to be consistent it should return either the new or the old value)
If that is not good enough it can be addressed in two ways:
- Instead of retrieving the latest versions of the KVs of the main row, retrieve all versions. If there is a version of the indexed column that matches the timestamp of the index entry, then use the versions of all other columns less then next version of the indexed columns (those are all the column values that existed before the index was updated to the new value). If there is no matching version of the indexed column then ignore, if the matching version is the latest version, then use the latest versions of all columns.
- Have one extra roundtrip, retrieve all version of the indexed column(s).
If no version of the indexed column matched the timestamp... ignore, if is it the latest version retrieve the rest of the main at the same timestamp used to the retrieve the indexed column (to avoid another client who could have added later rows). Or if the matching value is not latest one retrieve the rest of the row as of the time of the next value (as in #1)
That way this should be correct, and makes use of the atomicity guarantees provided by HBase.
The last part that was not mentioned last week is how we should encode the values. All column values to be indexed have be encoded in a way that lends itself to lexicographic sorting of bytes. This is easy for string (ignoring locale specific rules), but harder to integer and even floating point numbers.
The excellent Lily library has code to perform this kind of encoding.
(Incidentally it is also a complete index solution using write ahead logging - into HBase itself - to provide idempotent index transactions).
Lastly, it is also the simplest solution one can possibly get away with that requires no changes to HBase and no write ahead logging.
This all turned out to to be bit more complex than anticipated, so it should definitely be encapsulated in a client side library.