I will tackle backups in another blog post, here I will write about replication.
Since I prefer to use built-in features rather than rolling our own internal add-ons, over the past few months I have added various features to HBase that I believe are needed for DR with HBase. These are:
- HBASE-4071 minimum versions with TTL
- HBASE-4536 option to keep deleted cells and markers
- HBASE-2195 cyclic replication - master <-> master is a special case of that)
- HBASE-2196 (replication to multiple slave clusters)
Disaster here means:
- The data center sinks into the ocean.
Note that replication is asynchronous. If an entire data center - along with all local backups - is lost, there will be a window during which some data is lost.
- The customer deleted some data by accident (which means performing and confirming soft deletion and performing and confirming hard deletion afterwards... Still it happens and did happen).
- The application software deletes some data by accident.
Replication can help as follows:
- Setup up the main cluster as usual.
- Setup an identical slave cluster (same tables, same config)
- Enable replication between the two
Or maybe you want be able to restore the state of (say) the past 48 hours without expensive and time consuming restores from (gasp) tape:
- Setup up the main cluster as usual.
- Setup a slave in the same way, but set the TTL for all tables to 48h and minimum versions to 1 (HBASE-4071), along with a reasonably large setting for maximum versions (MAX_INT to be correct).
- Enable replication.
Or say you wanted to guard against accidental deletes as well.
- Setup up the main cluster as usual.
- As before setup TTL, min/max versions in the slave cluster's table.
- Also enable keep deleted cells (see HBASE-4536) on all slave tables
- Enable replication.
Maybe you want even multiple slave clusters, one to take over in case of failure, and one to guard against accidental deletes, one with many cheap disks, etc.
Nice Post Lars :)
ReplyDeleteI am also interested in different replication policies of HBase. In particular I am looking forward to do some part of my PhD work by implementing new replication schemes for NoSQL systems like HBase. Could you kindly able to guide me through few initial steps from where I can able to hook up with the HBase replication mechanism (I mean the actual code, documentation, various deployment scenarios, etc.)? I am very new to HBase, Hadoop, HDFS and Zookeeper so it would be also great to point out few good end-to-end workaround deployment guides as well :) Thanks again for the nice post. Now heading towards your other posts :)
Hi Joarder,
ReplyDeleteglad it was useful.
You can find some information here:
http://hbase.apache.org/replication.html
and here:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements
The best way to start is probably by looking at the code.
You could start with ReplicationSource.java and ReplicationSink.java and work your way out from here.