Wednesday, April 18, 2012

(timestamp-) Consistent backups in HBase

The topic consistent backups in HBase comes up every now and then.
In this article I will outline a scheme that does provide timestamp-consistent backups.

Consistent backups are possible in HBase. With "consistent" I mean "consistent as of a specific timestamp" (this is limited by the HBase timestamp granularity.)

Over the past few months I have contributed various changes to HBase that now hopefully lead full circle to a more coherent story for "system of record" type data retention.

The basic setup is simple. You'll need
(HBase 0.94 will have all the relevant patches once it is releases - which should be any time now)

HBase's built-in Export tool can then be used to generate consistent snapshots.

Normally the data collected by an Export job is "smeared" over the time interval it takes to execute the job; an Export-Scan sees a row (cells of a row to be precise) as of the time when it happens to get to it, and rows can change while the Export is running.

That is problematic, because it is not possible to recreate a consistent view of the database from export generated this way.

The trick then is to keep all data in HBase long enough for the backup job to finish, and to only collect information before the start time of the export job.

Say the backup takes no more than T to complete (yes, it is hard to know ahead of time how long an Export is going to run). In that case the table's column families can be setup as follows:
  • set TTL to T + some headroom (so maybe 2T to be safe)
  • set VERSIONS to a very large number, (max int = 2147483647 for example, i.e. nothing is evicted from HBase due to VERSIONS)
  • set MIN_VERSIONS to how many versions you want to keep around, otherwise all versions could be removed if their TTL expired
  • set KEEP_DELETED_CELLS to true (this makes sure that deleted cell and delete markers are kept until they expire by the TTL)
In the shell (setting TTL to two hours):
create <table>, {NAME=><columnfamily>, VERSIONS=>2147483647, TTL=>7200, MIN_VERSIONS=>1, KEEP_DELETED_CELLS=>true}

Export can now we used as follows:
  • set versions to 2147483647 (i.e. all versions)
  • set startTime to -2147483648 (min int, i.e. everything is guaranteed to be included)
  • set endTime to the start time of the Export (i.e. the current time when the export starts. This is the crucial part)
  • enable support for deleted cell
hbase org.apache.hadoop.hbase.mapreduce.Export
-D hbase.mapreduce.include.deleted.rows=true
<tablename> <outputdir>
2147483647 -2147483648 <now in ms>

As long as the Export finishes within 2T, a consistent snapshot as of the time the Export was started is created. Otherwise some data might be missing, as it could have been compacted away before the Export had a chance to see it.

Since the backups also copied deleted rows and delete markers, a backup restored to an HBase instance can be queried using a time range (see Scan) to retrieve the state of the data at any arbitrary time.

Export is current limited to a single table, but given enough storage in your live cluster this can be extended to multiple table Exports, simply by setting the endTime of all Exports jobs to the start time of the first job.

This same trick can also be used for incremental backups. In that case the TTL has to be large enough to cover the interval between incremental backups.
If, for example, the incremental backups frequency is daily, the TTL above can be set to 2 days (TTL=>172800). Then use Export again:

hbase org.apache.hadoop.hbase.mapreduce.Export
-D hbase.mapreduce.include.deleted.rows=true
<tablename> <outputdir>
2147483647 <time of last backup in ms> <now in ms>

The longer TTL guarantees that there will be no gaps that are not covered by the incremental backups.

An example:
  1. A Put (p1) happens at T1
  2. Full backup starts at T2, time interval [0, T2)
  3. Another Put (p2) at T3
  4. full backup jobs finishes
  5. A Delete happens at T4 
  6. Incremental backup starts at T5, time interval [T2, T5)
  7. Yet another Put (p3)
  8. Incremental backup finishes
Note that in this scenario is does not matter when the backup jobs finish.

The full backup contains only p1. The incremental backup contains p2 and the Delete. p3 is not included in any backup, yet.
The state at T2 (p1) and T5 (p1, p2, delete) can be directly restored. Using time range Scans or Gets the state as of T4 and T3 can also be retrieved, once both backups have been restored into the same HBase instance (you need HBASE-4536 for this to work correctly with Deletes).

Finally, if keeping enough data to cover the time between two incremental backups in the live HBase cluster is problematic for your organization, it is also possible to archive HBase's Write Ahead Logs (WAL) and then replay with the built-in WALPlayer (HBASE-5604), but that is for another post.