Wednesday, December 25, 2019

HBase, Phoenix - Local index optimizations with region pruning

By Lars Hofhansl
(Updated for clarity)

Why Local Indexes?

Local indexes are a powerful tool in the HBase/Phoenix toolbox. They are (1) naturally and cheaply transactional, they (2) avoid creating extra index tables, (3) provide the best write performance, and (4) they can small since they work uncovered in all cases (i.e. you do not have to include extra columns so that a query can be answered from only the index).

Problems with Local Indexes?

Local indexes work by maintaining an index per HBase region, each region has its own local index.

At read-time, for a query along the index that means that each region of a table needs to be consulted. And even though the RPCs can be done in parallel and regions with no matching keys will be ruled out quickly, for large tables for 1000's or 10000's of regions this can cause many unnecessary RPC.

Say we have the following simple table:

CREATE TABLE T (pk1 INTEGER NOT NULL,
                pk2 INTEGER NOT NULL,
                v1 FLOAT,
                v2 FLOAT
       CONSTRAINT pk PRIMARY KEY (pk1, pk2));

With the the following index:

CREATE LOCAL INDEX L T(v2);

A query for the form:

SELECT ... FROM T WHERE v2 = ...

Would now need to check each and every region of the table to see if there are any rows matching the v2 = ... condition.

Region Pruning

This is where region pruning comes into the picture.

Can we rule out some of the regions by using other parts of the query? It turns out we can!

There are currently two methods to do that.

One is implemented in PHOENIX-3941, and requires to declare the index in a specific way (assuming the same simple table):

CREATE LOCAL INDEX L T(pk1, v2);

Notice that we included the prefix of the pk (namely pk1) we'll use for extra pruning in the index definition. Now a query of the form 

> SELECT ... FROM T WHERE v2 = ...

will not be able to use the index!

But if we provide a value (or range) for pk1 the query compiler can now remove regions where this key is known not to be found, as in

> SELECT ... FROM T WHERE v2 = ... AND pk1 = ...

Phoenix does this by re-arranging the key structure of the local index.

Normally the key would look like this:
(INDEX_ID, v2, pk1, pk2)

The pk1 is included so that we can find the row once we found a match. With the rearrange index it now looks like this:
(INDEX_ID, pk1, v2, pk2)

Thus Phoenix can now use its algorithms for key range intersection along the (pk1, v2) key space to effectively limit how much scanning it needs to perform with the local index... Albeit at the expense that the index can only be used when querying along all declared index parts.

Can we do better?

Again, it turns out we can. See PHOENIX-5096 (will be in 4.16.0).

(This was a long standing problem, but it turns out adding more pruning was ridiculously simple once the code was understood. All I had to do was filter the regions through the non-index query plan, and voilá.)

Now we can declare the index naturally (same table definition):

CREATE LOCAL INDEX L T(v2);

And a query like:

SELECT ... FROM T WHERE v2 = ...

would correctly use the local index - but ping every region of the table.

But
SELECT ... FROM T WHERE v2 = ... AND pk1 = ...

can still use the pk1 part of the query to rule out any region where we know for sure a key with this prefix cannot be found.
 
So now the we can use and define the index as expected, and get the region pruning.

PHOENIX-5096 and PHOENIX-3941 complement each other. If you know that you will always include a certain prefix of the row key you should define your indexes with the rearranged key structure.

Conclusion

  • Local indexes are an effective and perhaps overlooked tool in Phoenix.
  • At read time performance is impacted by the fact that all region need to be consulted.
  • Phoenix can automatically prune those regions by looking at other parts of the query.
  • If we know that we will always include certain parts of the PK we should include those in the local index definition (as prefix).
  • If we want to use a local index sometimes with, sometimes without knowing anything about part of the key, we can now rely on Phoenix to prune unnecessary regions. And this expands the areas where local indexes are useful.