Thursday, December 29, 2011

Introduction to HBase

Sometimes people ask me: "What is HBase?" It's hard to give a concise answer.

There is a lot of information about HBase, but I have not been able to find a good and short introduction to HBase, yet. I find them either to short or too long.

So here's my attempt at YAITH (yet another introduction to HBase):

Definition
HBase is a key/value store. Specifically it is a
Sparse, Consistent, Distributed, Multidimensional, Sorted map.

Sounds extremely fancy... Let's explore what this means.

Map
HBase maintains maps of Keys to Values (key -> value). Each of these mappings is called a "KeyValue" or a "Cell". You can find a value by its key... That's it.

Sorted
These cells are sorted by the key. This is a very important property as it allows for searching ("give me all values for which the key is between X and Y"), rather than just retrieving a value for a known key.

Multidimensional
The key itself has structure. Each key consists of the following parts:
row-key, column family, column, and time-stamp.
So the mapping is actually:
(rowkey, column family, column, timestamp) -> value
rowkey and value are just bytes (column family needs to be printable), so you can store anything that you can serialize into a byte[] into a cell.

Sparse
This follows from the fact the HBase stores key -> value mappings and that a "row" is nothing more than a grouping of these mappings (identified by the rowkey mentioned above).
Unlike NULL in most relational databases, no storage is needed for absent information, there will be just no cell for a column that does not have any value.
It also means that every value carries all its coordinates with it.

Distributed
One key feature of HBase is that the data can be spread over 100s or 1000s of machines and reach billions of cells. HBase manages the load balancing automatically.

Consistent 
HBase makes two guarantees:
All changes the with the same rowkey (see Multidimensional above) are atomic. A reader will always read the last written (and committed) values.

Storage
HBase partitions the key space. Each partition is called a Table. Each table declares one or more column families. Column families define the storage properties for an arbitrary set of columns. Columns are not declared, they are essentially just an additional label for the value.

The principle operations supported by HBase are Put (add some data), Delete ("delete" some data), Scan (retrieve some cells), Get (which is just a special case of Scan). No queries, no secondary indexes...

That's it in a nutshell. If you looked for a quick summary you can stop here.
Below I will go into a bit more detail and list some implications stemming from the architecture.

Details
Let's drill deeper into the multidimensional aspect of HBase. As mentioned above HBase maps (rowkey, column family, column, timestamp) to a value.

The rowkey is defined by the application. As the combined key is prefixed by the rowkey this allows the application to define the desired sort order. Defining the right sort order is extremely important as scanning is the only way retrieve any value for which the key is not known a priori.

The rowkey also provides a logical grouping of cells; and HBase ensures that all cells with the same rowkey are co-located on the same server (called a RegionServer in HBase), which allows for ACID guarantees for updates with the same rowkey without complicated and slow two-phase-commit or paxos.

Column families are declared when a table is created. They define storage attributes such as compression, number of versions to maintain, time to live, and minimum number of versions - among others.

Columns are arbitrary names (or labels) assigned by the application.

The timestamp is a long identifying (by default) the creation time of the of the cell. Each cell (as opposed to row) is versioned, which makes it interesting to reason about consistency and ACID guarantees (more on that later). No data is ever overwritten or changed in place, instead every "update" creates a new version of the affected set of cells.

You should not think of a row like this:
(key, column value 1, column value 2, NULL, column value 4, ...)
But rather like this:
(rowkey, column family, column1, timestamp) -> column value 1
(rowkey, column family, column2, timestamp) -> column value 2
(rowkey, column family, column4, timestamp) -> column value 4
...

So what about consistency?

HBase ensures that all new versions created by single Put operation for a particular rowkey are either all seen by other clients or seen by none.
Furthermore a Get or Scan will only return a combination of versions of cells for a row that existed together at some point. This ensures that no client will ever see a partially completed update or delete.
HBase achieves this by using a variation of Multi Version Currency Control (mvcc).

What about storage size?

As said above, HBase stores cells. Cells are grouped by a rowkey into something that looks like a row. Since cells are stored individually the storage is sparse.

If your data is naturally sparse (i.e. your model has many columns, but each row only has a values for a few of those) that works great, there is no space wasted for for those columns for which a row has no value.

If your model is dense on the other hand (i.e. all columns have a value in each row) a lot of space it wasted as each column stores its full coordinates (which maybe many times larger than the actual column value).

Since coordinates are very repetitive, the supported compression mitigates the wasted space to some extent.

Of course, since the value of a cell is just an array of bytes, many dense columns can be serialized and stored into a single value of a single cell.

What about querying?

You can only Scan a range of cells or Get a set of cells.
A scan typically defines a start-rowkey and stop-rowkey (otherwise it would be a full table scan, with a runtime proportional to the size of the table).

In stores like HBase the data is typically denormalized at the time of writing, rather than sliced and diced at reading time.

How is the data physically stored?

Puts and Deletes are collected into an in-memory structure called the MemStore. Before the MemStore is update the changes are written to a Write Ahead Log (WAL) to enable recovery in case a server crashes.
When it reaches a certain size the MemStore is flushed to disk into StoreFile.

Periodically StoreFiles are compacted into fewer StoreFiles.

For reading and writing HBase employs Log Structured Mergetrees, which is just a fancy way of saying that reading and compacting in HBase is performing a merge sort (a scan looks at the heads of all StoreFiles and the Memstore and picks the smallest element first, in case of a Scan it is returned to the client, in case of a compacted it is written to the new StoreFile).

There is a lot more to say, but I set out saying that this will be short.


More information can be found in the HBase Book.

8 comments:

  1. Thank you Mohammad and Syed.
    If you have other topics of interest, please let me know.

    ReplyDelete
  2. What do i need to know to implement it in my website which is currently on MySQL...?? I just can't figure out how to implement this in a website. :(

    ReplyDelete
  3. HBase ensures that all cells with the same rowkey are co-located on the same server ?

    Does it mean if i has millions of column versions it will be stored on the same server ?

    ReplyDelete
  4. Edlebug, yes that is correct. That is the price to pay for ACID.

    ReplyDelete
  5. Thanks a lot, its really informative..

    ReplyDelete
  6. Lars Thanks for informative introduction, Can you define Region Server?

    ReplyDelete