Monday, July 29, 2013

HBase and data locality

By Lars Hofhansl

A quick note about data locality.

In distributed systems data locality is an important property. Data locality, here, refers to the ability to move the computation close to where the data is.

This is one of the key concepts of MapReduce in Hadoop. The distributed file system provides the framework with locality information, which is then used to split the computation into chunks with maximum locality.

When data is written in HDFS, one copy is written locally, another is written to another node in a different rack (if possible) and a third copy is written to another node in the same rack. For all practical purposes the two extra copies are written to random nodes in the cluster.

In typical HBase setups a RegionServer is co-located with an HDFS DataNode on the same physical machine. Thus every write is written locally and then to the two nodes as mentioned above. As long the regions are not moved between RegionServers there is good data locality: A RegionServer can serve most reads just from the local disk (and cache), provided short circuit reads are enabled (see HDFS-2246 and HDFS-347).

When regions are moved some data locality is lost and the RegionServers in question need to request the data over the network from remote DataNodes, until the data is rewritten locally (for example by running some compactions).

There are four basic events that can potentially destroy data locality:
  1. The HBase balancer decides to move a region to balance data sizes across RegionServers.
  2. A RegionServer dies. All its regions need to be relocated to another server.
  3. A table is disable and re-enabled.
  4. A cluster is stopped and restarted.
Case #1 is a trade off. Either a RegionServer hosts an over-proportional large portion of the data, or some of the data loses its locality.

In case #2 there is not much choice. That RegionServer and likely its machine died, so the regions need to be migrated to somewhere else where not all data is local (remember the "random" writes to the two other data nodes).

In case #4 HBase reads the previous locality information from the .META. table (whose regions - currently only one - are assigned to an RegionServer first), and then attempts to the assign the regions of all tables to the same RegionServers (fixed with HBASE-4402).

In case #3 the regions are reassigned in a round-robin fashion...

Wait what? If the a table is simply disabled and then re-enabled its regions are just randomly assigned?

Yep. In a large enough cluster you'll end up with close to 0% data locality, which means all data for all requests is pulled remotely from another machine. In some setups that means an outage, a time of unacceptably slow read performance.

This was brought up on the mailing lists again this week and we're fixing the old issue now (HBASE-6143, HBASE-9080).

2 comments:

  1. You foregot about 5'th "data locality destroyer" case: HDFS balancer started. In large systems, you can't live long without balancing, which effectively shuffles blocks killing hbase locality.

    There is solution exists in https://issues.apache.org/jira/browse/HDFS-4420, but it's not in trunk so far.

    ReplyDelete
  2. Thanks @id. Very true.
    HBase distributes regions by size. Through compactions, and the fact that one block is written locally, the blocks should end up mostly distributed and mostly local. So this should in theory work without the HDFS balancer.
    I'd need more empirical data on it.

    ReplyDelete