Saturday, October 20, 2012

Coprocessor access to HBase internals

By Lars Hofhansl

Most folks familiar with HBase have heard of coprocessors.
Coprocessors come in two flavors: Observers and Endpoints.

An Observer is similar to a database trigger, an Endpoint can be likened to a stored procedure.
This analogy only goes that far, though.

While triggers and stored procedures are (typically) sandboxed and expressed in a highlevel language (typically SQL with procedural extensions), coprocessors are written in Java and are designed to extend HBase directly (in the sense of avoiding subclassing the HRegionServer class in order to extend it). Code in a coprocessor will happily shutdown a region server by calling System.exit(...)!

On the other hand coprocessors are strangely limited. Before HBASE-6522 they had no access to a RegionServer's locks and leases and hence it was impossible to implement check-and-set type as a coprocessor (because the row modified would need to be locked), or to time out expensive server side data structures (via leases).
HBASE-6522 makes some trivial changes to remedy that.

It was also hard to maintain any kind of share state in coprocessors.
Keep in mind that region coprocessors are loaded per region and there might be 100's of regions for a given region server.

Static members won't work reliably, because coprocessor classes are loaded by special classloaders.

HBASE-6505 fixes that too. Now the RegionCoprocessorEnvironment provides a getSharedData() method, which returns a ConcurrentMap, which is held by the coprocessor environment as a weak reference (in a special map with strongly referenced keys and weakly referenced values), and held strongly by the environment that manages each coprocessor.
That way if the coprocessor is blacklisted (due to throwing an unexpected exception) the coprocessors environment is removed, and any shared data is immediately available for garbage collection, thus avoiding ugly and error prone reference counting (maybe this warrants a separate post).

This shared data is per coprocessor class and per regionserver. As long as there is at least one region observer or endpoint active this shared data is not garbage collected and can be accessed to share state between the remaining coprocessors of the same class.

These changes allow coprocessor to be used for a variety of use cases.
State can be shared across them, allowing coordination between many regions, for example for coordinated queries.
Row locks can be created and released - allowing for check-and-set type operations.
And leases can be used to safely expire expensive data structures or to time out locks among other uses.

I should also mention that RegionObservers already have access to a region's MVCC.

1 comment:

  1. hello, your HBase Technology blog is very informative and sharable blog.