Sampling is an important feature when lots of data is (are?) involved.
In HBase this is traditionally tricky, as there is no notion of the n-th key without actually reading n key values from disk. IO is usually the bottleneck.
As it turns out there is a simple trick, involving a separate column family and HBase's "essential column family" feature.
The trick is now to randomly write a value (any value) or nothing to separate column family, then scan along that column family - which is possible as long as the filter only apply to column in that column family - and let HBase/Phoenix load the other column families as needed.
CREATE TABLE <table> (mykey INTEGER NOT NULL PRIMARY KEY, A.cola INTEGER, B.colb INTEGER)
Now you UPSERT with a certain randomness (Math.Random r; r.nextFloat() < 0.0001, for 0.01% sampling):
UPSERT INTO <table> VALUES(?,?,?)
And otherwise this:
UPSERT INTO <table> VALUES(?,?)
So now, as long as you do not update existing values, you get rows in HBase where with a 0.1% chance B.colb has a non-null value. We can use that fact now for our sampling by adding WHERE colb IS NOT NULL to our queries.
The queries are:
- SELECT COUNT(*) FROM <table>
- SELECT COUNT(*) FROM <table> WHERE colb IS NOT NULL
- SELECT COUNT(cola) FROM <table> WHERE colb IS NOT NULL
Now:
As you can see, this kind of sampling is no longer useful when the selectivity approaches 20% or worse. In real-world scenarios 0.1% or less is probably prudent anyway.
Also note that a full scan across the table is not entirely free, the relevant parts of the sampled column also need to be scanned, though the difference is minimal.
Zoomed into the more useful range:
Note that we're approaching the timer resolution as well as the static planning cost of the Phoenix queries.
So here's another test with 10m rows and 0.1% and 0.01% sampling.
A full scan now takes 6.9s.
In conclusion... Sampling this way at 0.01% (or 1/10000) can return queries at 70x the speed. For larger datasets in the billions, one would probably sample even less.
Remember this is poor-mans sampling! Things need to be setup at write time - although you can retrofit existing data. The HBase and Phoenix communities are working on "real" solutions. For example PHOENIX-153.