Wednesday, September 25, 2024

Closure and blocking comments

By Lars Hofhansl

My last post here was in 2019.

BigData has had an "interesting" journey. From the "thing" that everybody needed but did not quite understand, to lots of fragmented solutions, to some very large installations, to partial irrelevance due to the public clouds and some kind of coming back to "small" data.

HDFS, HBase and Phoenix were part of this ride, and the almost 10 years I worked on those were some of the best in my career. Open Source is fun, and I was lucky enough to provide 100's of changes and improvements to these projects.

Since 2020 I had worked a bit on Trino, Kafka, Spark, and some other projects, but have since come full circle and now work on (and with) more "traditional" databases. Some of these are internal, and I can't really talk about them... Think equivalents to PostgreSQL and DuckDB, resp.

Since every single comment in the last few years was spam about some Hadoop training or the other - and I am flattered that someone might think to reach anyone through "advertising" on my HBase blog - I have now disabled all comments.

Except for Trino, Kafka, and Spark (and perhaps a few more) the BigData boom is over or at best in maintenance or legacy mode.

It was a good run, I learned a lot, had a lot fun, deployed a few systems serving trillions of requests/transactions.

 

So Long, and Thanks for All the Fish

Wednesday, December 25, 2019

HBase, Phoenix - Local index optimizations with region pruning

By Lars Hofhansl
(Updated for clarity)

Why Local Indexes?

Local indexes are a powerful tool in the HBase/Phoenix toolbox. They are (1) naturally and cheaply transactional, they (2) avoid creating extra index tables, (3) provide the best write performance, and (4) they can small since they work uncovered in all cases (i.e. you do not have to include extra columns so that a query can be answered from only the index).

Problems with Local Indexes?

Local indexes work by maintaining an index per HBase region, each region has its own local index.

At read-time, for a query along the index that means that each region of a table needs to be consulted. And even though the RPCs can be done in parallel and regions with no matching keys will be ruled out quickly, for large tables for 1000's or 10000's of regions this can cause many unnecessary RPC.

Say we have the following simple table:

CREATE TABLE T (pk1 INTEGER NOT NULL,
                pk2 INTEGER NOT NULL,
                v1 FLOAT,
                v2 FLOAT
       CONSTRAINT pk PRIMARY KEY (pk1, pk2));

With the the following index:

CREATE LOCAL INDEX L T(v2);

A query for the form:

SELECT ... FROM T WHERE v2 = ...

Would now need to check each and every region of the table to see if there are any rows matching the v2 = ... condition.

Region Pruning

This is where region pruning comes into the picture.

Can we rule out some of the regions by using other parts of the query? It turns out we can!

There are currently two methods to do that.

One is implemented in PHOENIX-3941, and requires to declare the index in a specific way (assuming the same simple table):

CREATE LOCAL INDEX L T(pk1, v2);

Notice that we included the prefix of the pk (namely pk1) we'll use for extra pruning in the index definition. Now a query of the form 

> SELECT ... FROM T WHERE v2 = ...

will not be able to use the index!

But if we provide a value (or range) for pk1 the query compiler can now remove regions where this key is known not to be found, as in

> SELECT ... FROM T WHERE v2 = ... AND pk1 = ...

Phoenix does this by re-arranging the key structure of the local index.

Normally the key would look like this:
(INDEX_ID, v2, pk1, pk2)

The pk1 is included so that we can find the row once we found a match. With the rearrange index it now looks like this:
(INDEX_ID, pk1, v2, pk2)

Thus Phoenix can now use its algorithms for key range intersection along the (pk1, v2) key space to effectively limit how much scanning it needs to perform with the local index... Albeit at the expense that the index can only be used when querying along all declared index parts.

Can we do better?

Again, it turns out we can. See PHOENIX-5096 (will be in 4.16.0).

(This was a long standing problem, but it turns out adding more pruning was ridiculously simple once the code was understood. All I had to do was filter the regions through the non-index query plan, and voilá.)

Now we can declare the index naturally (same table definition):

CREATE LOCAL INDEX L T(v2);

And a query like:

SELECT ... FROM T WHERE v2 = ...

would correctly use the local index - but ping every region of the table.

But
SELECT ... FROM T WHERE v2 = ... AND pk1 = ...

can still use the pk1 part of the query to rule out any region where we know for sure a key with this prefix cannot be found.
 
So now the we can use and define the index as expected, and get the region pruning.

PHOENIX-5096 and PHOENIX-3941 complement each other. If you know that you will always include a certain prefix of the row key you should define your indexes with the rearranged key structure.

Conclusion

  • Local indexes are an effective and perhaps overlooked tool in Phoenix.
  • At read time performance is impacted by the fact that all region need to be consulted.
  • Phoenix can automatically prune those regions by looking at other parts of the query.
  • If we know that we will always include certain parts of the PK we should include those in the local index definition (as prefix).
  • If we want to use a local index sometimes with, sometimes without knowing anything about part of the key, we can now rely on Phoenix to prune unnecessary regions. And this expands the areas where local indexes are useful.







Thursday, October 4, 2018

Apache HBase and Apache Phoenix, more on block encoding and compression

By Lars Hofhansl

2 1/2 years ago I blogged about HBase Compression vs Blockencoding.

Things have moved on since then. The Hadoop and HBase communities added new compression schemes like zSTD, and new block encoders like ROW_INDEX_V1.

zSTD promises to be fast and yield great compression ratios.
The ROW_INDEX_V1 encoder allows fast seeks into an HFile block by storing the offsets of all Cells in that block so a Cell can be found by binary search, instead of the usual linear search.

So let's do some tests. Like in the previous blog I'm setting up a single node HBase, on a single node HDFS, with a single node ZK.
(But note that these are different machines so do not compare these numbers to a two years old blog post)

CREATE TABLE <table> (pk INTEGER PRIMARY key, v1 FLOAT, v2 FLOAT, v3 INTEGER) DISABLE_WAL=true, SALT_BUCKETS=8

Then I loaded 2^22 = 4194304 rows. Columns v1 and v2 are random values from [0,1).

Let's look at the size of the data. Remember this is just a single machine test. This is a qualitative test. We verified elsewhere that that this scales with adding machines.

 88.5MB ROW_INDEX_V1 + zSTD
112.1MB ROW_INDEX_v1 + GZ
184.0MB ROW_INDEX_V1 + SNAPPY
200.0MB FAST_DIFF
556.9MB NONE
572.9MB ROW_INDEX_V1
 
So far, nothing surprising:
  • The ROW_INDEX adds a bit more data (~3% for these small Cells).
  • zSTD offers the best compression, followed by GZ, and SNAPPY.
  • Uncompressed and unencoded HBase bloats the data a lot. 

Full scans:

Let's first do some scanning. First from the OS buffer cache (1), then from the HBase block cache (2).

(1) SELECT /*+ NO_CACHE SERIAL */ COUNT(*) FROM <table>
(2) SELECT /*+ SERIAL */ COUNT(*) FROM <table>

I'm using the SERIAL hint in Phoenix in order to get consistent results independently of how Phoenix decides to parallelize the query (which is based on region sizes as well the current stats).

  • zSTD decompression rate is pretty good, closer to Snappy than to Gzip.
  • ROW_INDEX - as expected - does not help with full scans, but it also dos not seem hurt (variance was within the noise).
  • FAST_DIFF has the worst scan times, whether data is in the block cache or not. 

Where do we seek a lot with Phoenix?

ROW_INDEX_V1 helps with random seeks, but where do we do this in Phoenix. Some areas are obvious, some might surprise you:
  • SELECT on random PKs. Well Dah...
  • Reverse Scans
  • Scans through tables with lot of deleted rows (before they are compacted away)
  • Row lookup from uncovered indexes

Point Gets:

So let's start with random gets, just point queries like these. To avoid repeated meta-data lookup I re-configured the table with UPDATE_CACHE_FREQUENCY = NEVER. And then issued:

SELECT pk FROM <table> WHERE pk = <random pk>, 1000 times.

So now we see how ROW_INDEX_V1 helps, GETs are improved by 24% and remember that this is end-to-end (query planning/compilation, network overhead, and finally the HBase scan).
We also see the relative decompression cost the schemes are adding (the OS buffer cache case).

Yet, zSTD decompression, seems to be on par with FAST_DIFF, and almost as fast as SNAPPY, while providing vastly better compression ratios.

Also remember that default blocks are stored uncompressed in the block cache (so all ROW_INDEX block cache requests are the same performance).

Reverse Scans:

HBase offers reverse scans, Phoenix will automatically make use of those in queries like these:

SELECT * FROM <table> ORDER BY pk DESC LIMIT 100;


(Everything is in the block cache here)

Reverse scanning involves a lot of seeking. For each Cell (i.e. each column in Phoenix) we need to seek the previous row, then skip forward again for the columns of that row. We see that ROW_INDEX_V1 is 2.5x faster than no encoding and over 6x faster compared to FAST_DIFF.

99.9% of cells deleted:

When HBase deletes data, it is not actually removed immediately but rather marked for deletion in the future by placing tombstones.

For this scenario I deleted 99.9% of the rows:
DELETE FROM <table> WHERE v1 < 0.999

Then issued:
SELECT /*+ SERIAL */ COUNT(*) FROM <table>




We see that ROW_INDEX_V1 helps with seeking past the delete markers. About a 24% improvement.


Now what about reverse scanning with lots of delete marker?
SELECT * FROM <table> ORDER BY pk DESC LIMIT 100


WOW... Reverse scanning with deleted Cells is somewhat of a worst case for HBase, we need to seek backwards Cell by Cell until we find the next non-deleted one.

ROW_INDEX_V1 makes a huge difference here. In fact without it, just scanning 100 row reverse is almost four times slower than scanning through all 4m rows plus almost 4m delete markers.

If you have lots of deletes in your data set, for example if it is churning set, you might want to switch the encoding to ROW_INDEX_V1.

Local Secondary (uncovered) Indexes:

Still 2^22 rows...

Now we have two decisions to make: (1) how to encode/compress the main column family, and (2) how to encode/compress the local index column family. In the interest of brevity I limited this to three cases:

308.1MB FAST_DIFF, FAST_DIFF
195.8MB RI + zSTD, FAST_DIFF
170.9MB RI + zSTD, RI + zSTD

Let's first do a full scan:
SELECT /*+ SERIAL */ COUNT(*) FROM <table>
 

Looks like this is now purely driven by the size of the index, as expected, since Phoenix now scans along the index, rather than the main data.


Let's do some filtering.

SELECT /*+ SERIAL */ COUNT(*) FROM <table> WHERE v2 < p


Pretty much as expected. The index encoding is what counts, and differences are marginal.


Next: The indexes I created are uncovered, they include only the indexed column(s) but no other columns. So now when we include some columns not in the index Phoenix needs to retrieve those as well.

SELECT /*+ SERIAL */ COUNT(v1) FROM <table> WHERE v2 < p;


Again... Wow... We see a huge difference. Phoenix' default FAST_DIFF, FAST_DIFF is almost useless. Look back at the full scan times in the beginning. We can conclude that unless the index returns less than 0.5% of the rows a full scan is faster!

Why's that!? Well, for each row we need find the remainder of the row (at least a part of it). All we have for that in HBase is a GET. Each GET needs to SEEK into the relevant blocks and a SEEK with FAST_DIFF implies SEEKing the last full key before the key we're looking for and then forward to the actual key. That's quite expensive.

When we switch the encoding to ROW_INDEX_V1 the GET can now instead be served with a binary search this the Cell offset in the blocks are known. Now index queries that return up to 10% of the Cells are still faster than a full scan.

Since ROW_INDEX_V1 add so little storage overhead I did not test with no encoding.

Takeaways:

  • ROW_INDEX_V1 does not add much overhead but can potentially very helpful.
  • If you use uncovered local indexes (which you should, since all other cases will ust blow up the data), you must configure the main table with ROW_INDEX_V1 or the related row lookup is simply too slow.
  • zSTD provides an excellent compromise between compression ratio and decompression speed (I didn't test compression speed). It should probably be the default.
  • ROW_INDEX_V1 together with zSTD compression make for an great default for a wide range of use cases.