Friday, January 27, 2012

Scanning in HBase

In HBASE-5268 I proposed a "prefix delete marker" (a delete marker that would mark a set of columns identified by a column prefix as deleted).

As it turns out, I didn't quite think through how scanning/seeking in HBase works, especially when delete markers are involved. So I thought I'd write about that here.
At the end of the article I hope you will understand why column prefix delete marker that are sorted with the KeyValues that they affect cannot work in HBase.

This blog entry describes how deletion works in HBase. One interesting piece of information not mentioned there is that version and column delete markers are ordered in line together with the KeyValues that they affect and family delete markers are always sorted to the beginning of their row.

Generally each column family is represented by a Store, which manages one or more StoreFiles. Scanning is a form of merge-sort performed by a RegionScanner, which merges results of one or more StoreScanners (one per family), who in turn merge results of one or more StoreFileScanners (one for each file for this family):

                                                    /                     \
                               StoreScanner                      StoreScanner
                               /                   \                      /                  \
        StoreFileScanner   StoreFileScanner   StoreFileScanner  StoreFileScanner
                   |                             |                            |                          |
             StoreFile                  StoreFile                StoreFile              StoreFile

Say you performed the following actions (T is time):
  1. put:            row1, family, col1, value1, T
  2. delete family:  row1, family, T+1
  3. put:            row1, family, col1, value2, T+2
  4. delete columns: row1, family, col1, T+3
  5. put:            row1, family, col1, value3, T+4

What we will find in the StoreFile for "family" is this:
family-delete row1, T+1
row1,col1,value3, T+4
column-delete row1,col1, T+3
row1,col1,value2, T+2
row1,col1,value1, T

KeyValues are ordered in reverse chronological order (within their row and column). The family delete marker, however, is always first on the row. That makes sense, because family delete marker affects potentially many columns in this row, so in order to allow scanners to scan forward-only, the family delete markers need to be seen by a scanner first.

That also means that
  1. even if we are only looking for a specific column, we always seek to the beginning of the row to check if there is a family delete with a timestamp that is greater of equal to the versions of column that we are interested in. After that the scanner seeks to the column.
  2. even if we are looking for a past version of a column we have to seek to the "beginning" of the column (i.e. the potential delete marker right before it), before we can scan forward to the version we're interested in.
My initial patch for HBASE-5268 would sort the prefix delete markers just like column delete markers.

By now it should be obvious why this does not work.

The beginning of a row is a known point, so it the "beginning" of a column. The beginning of a prefix of a column is not. So to find out whether a column is marked for deletion we would have to start at the beginning of the row and then scan forward to find all prefix delete markers. That clearly is not efficient.

My 2nd attempt placed the all prefix delete markers at the beginning of the row. That technically works. But notice that a column delete marker only has to be retained by the scanner for a short period of time (until after we scanned past all versions that it affects). For prefix delete markers we'd have to keep them into memory until we scanned past all columns that start with the prefix. In addition the number of prefix delete markers for a row is not principally limited.
Family delete markers do not have this problem because (1) the number of column families is limited for other reasons and (2) store files are per family, so all we have to remember for a family in a StoreScanner is a timestamp.


  1. Hello Lars,

    I don't understand what actualy is a StoreFile? How this is related with a HFile?

  2. Hi Zaharie,

    yes, a StoreFile is also referred to as HFile.

  3. Hello again Lars,

    Thank you for the answer.

    I was confused about this sentence in the "HBase The Definitive Guide" book:

    "The cells of each row are stored one after the other, in a separate storage file per column family. In other words, on disk you will have all cells of one family in a StoreFile, and all cells of another in a different file"

    So the answer is that there are more HFiles(StoreFiles) per column family?