Sunday, October 14, 2012

Secondary Indexes (Part II)

By Lars Hofhansl

This post last week on secondary indexes caused a lot of discussion - which I find interesting, because this is (IMHO) one of the less interesting posts here.

It appears that I did not provide enough context to frame the discussion.

So... What do people means when they say "secondary index". Turns out many things:
  • Reverse Lookup
  • Range Scans
  • Sorting
  • Index Covered Queries
  • etc
The idea outlined in last weeks post is good for global Reverse Lookup, partially useful for Range Scans (as long as the range is reasonably small), not very useful for Sorting (all KVs would have to be brought back to the client), and entirely useless for Index Covered Queries (the index might not contain the whole truth).

I also was not clear about the consistency guarantees provided.
The "ragged edge" of current time is always a thorny issue in distributed systems. Without a centralized clock (or other generator of monotonically increasing ids) the notion of "at the same time" is at best a fuzzy one (see also Google's Spanner paper and it's notion of current time being an interval bounded by clock precision, rather than a single point).

The idea of last week's indexing is consistent when the index is scanned as of a time in the past (as long at the time is far enough in the past to cover clock skew between servers - since all RegionServers must be installed with NTP, a few dozen milliseconds should be OK). Note you need my HBASE-4536 (0.94+) for this to be correct with deletes.

Queries for current results lead to interesting anomalies pointed out by Daniel (rephrased here):
  1. a client could read along the index
  2. it just passed index entry X
  3. another client changes the indexed column's value to Y and hence adds index entry Y
  4. the first client now checks back with the main row (whose latest version of the indexed column was changed, so X != Y and the timestamps won't match)
  5. the first client would conclude that the indexentry resulted from a partial update, and hence ignore it (to be consistent it should return either the new or the old value)
That is mostly in line with HBase's other operation, with the difference that it won't return a value that did exist when the logical scan operation started.

If that is not good enough it can be addressed in two ways:
  1. Instead of retrieving the latest versions of the KVs of the main row, retrieve all versions. If there is a version of the indexed column that matches the timestamp of the index entry, then use the versions of all other columns less then next version of the indexed columns (those are all the column values that existed before the index was updated to the new value). If there is no matching version of the indexed column then ignore, if the matching version is the latest version, then use the latest versions of all columns.
  2. Have one extra roundtrip, retrieve all version of the indexed column(s).
    If no version of the indexed column matched the timestamp... ignore, if is it the latest version retrieve the rest of the main at the same timestamp used to the retrieve the indexed column (to avoid another client who could have added later rows). Or if the matching value is not latest one retrieve the rest of the row as of the time of the next value (as in #1)
Whether #1 or #2 is more efficient depends on how many versions of all columns have to be retrieved vs. the extra roundtrip for just the indexed column(s).

That way this should be correct, and makes use of the atomicity guarantees provided by HBase.

The last part that was not mentioned last week is how we should encode the values. All column values to be indexed have be encoded in a way that lends itself to lexicographic sorting of bytes. This is easy for string (ignoring locale specific rules), but harder to integer and even floating point numbers.
The excellent Lily library has code to perform this kind of encoding.
(Incidentally it is also a complete index solution using write ahead logging - into HBase itself - to provide idempotent index transactions).

Lastly, it is also the simplest solution one can possibly get away with that requires no changes to HBase and no write ahead logging.

This all turned out to to be bit more complex than anticipated, so it should definitely be encapsulated in a client side library.


  1. Thanks a lot for the followup, the explanation makes it much easier to understand the revised approach.

    I agree the solutions you propose solve the problem of ignoring an entry due to an update, but I think the complimentary problem still exists, which is returning both values. This could be easily solved at client-side making sure you only keep the most up-to-date entry for each row.

    There are some corner cases, though (if I understood your revisions correctly). Let me put an example:

    1. row R has an indexed column with value X, and a client updates it to value Y
    2. a long time later, the same client starts reading the index
    3. it just passed index entry X
    4. since there is an old version on the main row with value X, we return it for the time being (techniques #1 and #2)
    5. another client deletes row R, and all index entries including X and Y
    6. the first client continues reading the index until the end, skipping Y which was deleted

    The problem here is that the client received a really old value X instead of the updated Y or nothing at all (if the delete was "ordered" before the scan)

    Do you think this is a problem? Otherwise, this implementation could always return arbitrarily old values, violating again the contract of the scan operation.

    I'm sorry for being tedious but I'm really interested on this topic. I think this could help HBase as a whole if it was clear which guarantees could be provided by default and which guarantees need an external transactional implementation.


  2. Thanks Daniel. I am not claiming to have all the answers :)
    It's good to discuss all the corner cases.

    The scenario you describe seems OK, though. The value (X) existed when the scan started, so it seems OK returning that. HBase never had scan consistency (or maybe I misunderstand?)

    Also if you do bulk scans, you'd scan all values along the index first, and then go back for the actual rows. In that case both X and Y would have been removed from the main table.

  3. I think I didn't explain my concern properly. Suppose there was a step 1b in my scenario, where the client does a read and receives the updated value Y. Afterwards it would receive the old value X, which doesn't make sense since it should receive the updated value again or nothing at all.

    But doing bulk scans as you suggest would solve it anyway. Thanks for the insight!