Thursday, October 4, 2018

Apache HBase and Apache Phoenix, more on block encoding and compression

By Lars Hofhansl

2 1/2 years ago I blogged about HBase Compression vs Blockencoding.

Things have moved on since then. The Hadoop and HBase communities added new compression schemes like zSTD, and new block encoders like ROW_INDEX_V1.

zSTD promises to be fast and yield great compression ratios.
The ROW_INDEX_V1 encoder allows fast seeks into an HFile block by storing the offsets of all Cells in that block so a Cell can be found by binary search, instead of the usual linear search.

So let's do some tests. Like in the previous blog I'm setting up a single node HBase, on a single node HDFS, with a single node ZK.
(But note that these are different machines so do not compare these numbers to a two years old blog post)


Then I loaded 2^22 = 4194304 rows. Columns v1 and v2 are random values from [0,1).

Let's look at the size of the data. Remember this is just a single machine test. This is a qualitative test. We verified elsewhere that that this scales with adding machines.

112.1MB ROW_INDEX_v1 + GZ
556.9MB NONE
So far, nothing surprising:
  • The ROW_INDEX adds a bit more data (~3% for these small Cells).
  • zSTD offers the best compression, followed by GZ, and SNAPPY.
  • Uncompressed and unencoded HBase bloats the data a lot. 

Full scans:

Let's first do some scanning. First from the OS buffer cache (1), then from the HBase block cache (2).

(2) SELECT /*+ SERIAL */ COUNT(*) FROM <table>

I'm using the SERIAL hint in Phoenix in order to get consistent results independently of how Phoenix decides to parallelize the query (which is based on region sizes as well the current stats).

  • zSTD decompression rate is pretty good, closer to Snappy than to Gzip.
  • ROW_INDEX - as expected - does not help with full scans, but it also dos not seem hurt (variance was within the noise).
  • FAST_DIFF has the worst scan times, whether data is in the block cache or not. 

Where do we seek a lot with Phoenix?

ROW_INDEX_V1 helps with random seeks, but where do we do this in Phoenix. Some areas are obvious, some might surprise you:
  • SELECT on random PKs. Well Dah...
  • Reverse Scans
  • Scans through tables with lot of deleted rows (before they are compacted away)
  • Row lookup from uncovered indexes

Point Gets:

So let's start with random gets, just point queries like these. To avoid repeated meta-data lookup I re-configured the table with UPDATE_CACHE_FREQUENCY = NEVER. And then issued:

SELECT pk FROM <table> WHERE pk = <random pk>, 1000 times.

So now we see how ROW_INDEX_V1 helps, GETs are improved by 24% and remember that this is end-to-end (query planning/compilation, network overhead, and finally the HBase scan).
We also see the relative decompression cost the schemes are adding (the OS buffer cache case).

Yet, zSTD decompression, seems to be on par with FAST_DIFF, and almost as fast as SNAPPY, while providing vastly better compression ratios.

Also remember that default blocks are stored uncompressed in the block cache (so all ROW_INDEX block cache requests are the same performance).

Reverse Scans:

HBase offers reverse scans, Phoenix will automatically make use of those in queries like these:


(Everything is in the block cache here)

Reverse scanning involves a lot of seeking. For each Cell (i.e. each column in Phoenix) we need to seek the previous row, then skip forward again for the columns of that row. We see that ROW_INDEX_V1 is 2.5x faster than no encoding and over 6x faster compared to FAST_DIFF.

99.9% of cells deleted:

When HBase deletes data, it is not actually removed immediately but rather marked for deletion in the future by placing tombstones.

For this scenario I deleted 99.9% of the rows:
DELETE FROM <table> WHERE v1 < 0.999

Then issued:
SELECT /*+ SERIAL */ COUNT(*) FROM <table>

We see that ROW_INDEX_V1 helps with seeking past the delete markers. About a 24% improvement.

Now what about reverse scanning with lots of delete marker?

WOW... Reverse scanning with deleted Cells is somewhat of a worst case for HBase, we need to seek backwards Cell by Cell until we find the next non-deleted one.

ROW_INDEX_V1 makes a huge difference here. In fact without it, just scanning 100 row reverse is almost four times slower than scanning through all 4m rows plus almost 4m delete markers.

If you have lots of deletes in your data set, for example if it is churning set, you might want to switch the encoding to ROW_INDEX_V1.

Local Secondary (uncovered) Indexes:

Still 2^22 rows...

Now we have two decisions to make: (1) how to encode/compress the main column family, and (2) how to encode/compress the local index column family. In the interest of brevity I limited this to three cases:

170.9MB RI + zSTD, RI + zSTD

Let's first do a full scan:
SELECT /*+ SERIAL */ COUNT(*) FROM <table>

Looks like this is now purely driven by the size of the index, as expected, since Phoenix now scans along the index, rather than the main data.

Let's do some filtering.

SELECT /*+ SERIAL */ COUNT(*) FROM <table> WHERE v2 < p

Pretty much as expected. The index encoding is what counts, and differences are marginal.

Next: The indexes I created are uncovered, they include only the indexed column(s) but no other columns. So now when we include some columns not in the index Phoenix needs to retrieve those as well.

SELECT /*+ SERIAL */ COUNT(v1) FROM <table> WHERE v2 < p;

Again... Wow... We see a huge difference. Phoenix' default FAST_DIFF, FAST_DIFF is almost useless. Look back at the full scan times in the beginning. We can conclude that unless the index returns less than 0.5% of the rows a full scan is faster!

Why's that!? Well, for each row we need find the remainder of the row (at least a part of it). All we have for that in HBase is a GET. Each GET needs to SEEK into the relevant blocks and a SEEK with FAST_DIFF implies SEEKing the last full key before the key we're looking for and then forward to the actual key. That's quite expensive.

When we switch the encoding to ROW_INDEX_V1 the GET can now instead be served with a binary search this the Cell offset in the blocks are known. Now index queries that return up to 10% of the Cells are still faster than a full scan.

Since ROW_INDEX_V1 add so little storage overhead I did not test with no encoding.


  • ROW_INDEX_V1 does not add much overhead but can potentially very helpful.
  • If you use uncovered local indexes (which you should, since all other cases will ust blow up the data), you must configure the main table with ROW_INDEX_V1 or the related row lookup is simply too slow.
  • zSTD provides an excellent compromise between compression ratio and decompression speed (I didn't test compression speed). It should probably be the default.
  • ROW_INDEX_V1 together with zSTD compression make for an great default for a wide range of use cases.

1 comment:

  1. I want to use zSTD for Hbase on Horntonworks 3.0. But how to add zSTD to the hadoop/Hbase stack?