Monday, July 24, 2017

Poor-mans sampling in HBase and Phoenix

By Lars Hofhansl

Sampling is an important feature when lots of data is (are?) involved.

In HBase this is traditionally tricky, as there is no notion of the n-th key without actually reading n key values from disk. IO is usually the bottleneck.

As it turns out there is a simple trick, involving a separate column family and HBase's "essential column family" feature.

The trick is now to randomly write a value (any value) or nothing to separate column family, then scan along that column family - which is possible as long as the filter only apply to column in that column family - and let HBase/Phoenix load the other column families as needed.

CREATE TABLE <table> (mykey INTEGER NOT NULL PRIMARY KEY, A.cola INTEGER, B.colb INTEGER)

Now you UPSERT with a certain randomness (Math.Random r; r.nextFloat() < 0.0001, for 0.01% sampling):
UPSERT INTO <table> VALUES(?,?,?)

And otherwise this:
UPSERT INTO <table> VALUES(?,?)

So now, as long as you do not update existing values, you get rows in HBase where with a 0.1% chance B.colb has a non-null value. We can use that fact now for our sampling by adding WHERE colb IS NOT NULL to our queries.

The queries are:
  1. SELECT COUNT(*) FROM <table>
  2. SELECT COUNT(*) FROM <table> WHERE colb IS NOT NULL
  3. SELECT COUNT(cola) FROM <table> WHERE colb IS NOT NULL
Query #1 performs a simple full scan. Query number #2 scans along the sampled column only. Query #3 scans along the sampled column, and then brings in the matching parts of the row from the other column family, this will likely be the most likely scenario.

Now:
  

As you can see, this kind of sampling is no longer useful when the selectivity approaches 20% or worse. In real-world scenarios 0.1% or less is probably prudent anyway.
Also note that a full scan across the table is not entirely free, the relevant parts of the sampled column also need to be scanned, though the difference is minimal.

Zoomed into the more useful range:

Note that we're approaching the timer resolution as well as the static planning cost of the Phoenix queries.

So here's another test with 10m rows and 0.1% and 0.01% sampling.
A full scan now takes 6.9s.


In conclusion... Sampling this way at 0.01% (or 1/10000) can return queries at 70x the speed. For larger datasets in the billions, one would probably sample even less.

Remember this is poor-mans sampling! Things need to be setup at write time - although you can retrofit existing data. The HBase and Phoenix communities are working on "real" solutions. For example PHOENIX-153.


1 comment:

  1. Hi Lars,

    My name is Anuj Agarwal. I'm Founder of Feedspot.

    I would like to personally congratulate you as your blog HBase has been selected by our panelist as one of the Top 40 Hadoop Blogs on the web.

    https://blog.feedspot.com/hadoop_blogs/

    I personally give you a high-five and want to thank you for your contribution to this world. This is the most comprehensive list of Top 40 Hadoop Blogs on the internet and I’m honored to have you as part of this!

    Also, you have the honor of displaying the badge on your blog.

    Best,
    Anuj

    ReplyDelete