HBase periodically cleans up old or expired versions when the memstore is flushed to disk (see HBASE-4241) and during periodic minor and major compactions. I will refer to all three events as "compaction" below.
HBase has two principal knobs to declare how much data you would like to retain.
- Number of versions
when this number of versions (for a cell!) is reached older versions will be deleted during the next compaction.
- Time To Live (TTL)
when cells are older than the TTL they will be removed during the next compactions.
Now, what happens to cells that are in principle expired or beyond the maximum number of versions before HBase had a chance to collect them?
The answer is that in that case any Scan or Get issued by a client will automatically filter these cells. For all practical purposes they are gone.
Another interesting question is: What happens when TTL is enabled for a column family and the last version of a cell expires?
In that case that last version is also deleted, leaving no version of the cell in question.
For certain backup scenarios it would useful to set a TTL, but at least keep a certain number of versions around. So for cells within the TTL range one can a fully restore of any previous state (provided enough versions are stored) and at the same time it is always possible to restore the last N versions.
HBASE-4071 provides such a feature. It adds the ability to declare a minimum number of versions to keep.
Together with HBASE-4536 - described here, it is possible to design a fairly elaborate data retention policy for primary and replicated HBase stores.
For example it is possible to say:
- expire cells after one week
- keep at least two versions around
- but not more than 100 versions
- (with HBASE-4536) also keep deleted cells until they expire