Wednesday, December 21, 2011

HBase data rentention options

Every key value (or cell) in HBase is versioned. Data is never changed in place, but a new version is created for every change.

HBase periodically cleans up old or expired versions when the memstore is flushed to disk (see HBASE-4241) and during periodic minor and major compactions. I will refer to all three events as "compaction" below.

HBase has two principal knobs to declare how much data you would like to retain.
  1. Number of versions
    when this number of versions (for a cell!) is reached older versions will be deleted during the next compaction.
  2. Time To Live (TTL)
    when cells are older than the TTL they will be removed during the next compactions.
Using the setMaxVersions(...) and setTimeRange(...) methods on the Get and Scan objects allows an application to decide what version it would like to see.

Now, what happens to cells that are in principle expired or beyond the maximum number of versions before HBase had a chance to collect them?

The answer is that in that case any Scan or Get issued by a client will automatically filter these cells. For all practical purposes they are gone.

Another interesting question is: What happens when TTL is enabled for a column family and the last version of a cell expires?

In that case that last version is also deleted, leaving no version of the cell in question.

For certain backup scenarios it would useful to set a TTL, but at least keep a certain number of versions around. So for cells within the TTL range one can a fully restore of any previous state (provided enough versions are stored) and at the same time it is always possible to restore the last N versions.

HBASE-4071 provides such a feature. It adds the ability to declare a minimum number of versions to keep.
Together with HBASE-4536 - described here, it is possible to design a fairly elaborate data retention policy for primary and replicated HBase stores.

For example it is possible to say:
  • expire cells after one week
  • keep at least two versions around
  • but not more than 100 versions
  • (with HBASE-4536) also keep deleted cells until they expire
Together with HBase replication this can used as an effective way to keep backups of historical data.

2 comments:

  1. HBase row keys are always unique

    ReplyDelete
  2. The questions was this:
    "I have a question on hbase. Can a hbase table have a key which contains duplicate values. Is it possible to have more than one column as a key in hbase table."

    Everything in HBase is versioned, stamped with a timestamp. (row-key, column, timestamp) together make the key, and that is unique. But you can have multiple KeyValues with the same row-key and column, but a different timestamp.

    I am deleting the original question as it is a link bait, linking to other content from a comment here.

    ReplyDelete