Warning: General topic ahead.
Over five years ago I posted HBase and data locality, summarizing how important data locality is and how HBase ensures it.
A lot has changed since then. We now have typically less than 0.1ms network latency inside a site (a datacenter) and 10Ge and even 40Ge or 100Ge networks are commonplace - without oversubscription at the ToR and Spine switches or at the fabric.
Meanwhile rotating disks still show 4 - 15ms latency and 120-140MB/s throughput.
Even SSDs typicaly have 0.08 to 0.16ms of "seek" times, and the drive-to-host data transfer is often limited to 300MB/s (10bit encoding over 3.0GBit/s SATA).
(The usual disclaimer: I'm not a H/W guy... But these number are ballpark correct.)
So a 10Ge link can comfortably serve from a remote machine with 10 HDDs or 4 SSD at full throttle. With a 40Ge networks or better we just do not need to think about this anymore.
So... I think it's safe to say that Data Locality No Longer Matters!
What still matters is cache locality (bandwidth between RAM and CPU is still better than the network, 40GB/s or more.) I.e. that HBase still has its block cache local to the RegionServers.
Instead we should focus on untangling storage from compute going forward. Unlike storage, (stateless) compute is elastic; we can grow and shrink it on demand. Only storage (HDFS in our case, or S3 on AWS or GCS on GCP) need to be carefully managed.
That allows us to manage required storage in new ways (pick storage dense machines or VMs) or even outsource storage to the likes of S3 or GCS, while allowing us to choose CPU/RAM heavy machines or VMs for out compute needs. Note that both HBase and Phoenix are compute. So are MapReduce and Spark.
The Spark community has seen this trend years ago and Spark comfortably works over purely remote storage.
The older MapReduce as a programming paradigm is still relevant. As a means to bring the compute to the data, it is outmoded and no longer needed or desired.
For HBase it means that instead of running the HBase RegionServers and the HDFS DataNodes on the same machines/VMs, and have HBase try to maintain data locality - as in the five year old post above - we should pick our storage and compute options separately and optimize each independently, and rely on a fast network between them.
My team and I will soon look into how we can increase the elasticity of HBase, make it easier to grow and shrink the set of RegionServer dynamically quickly with low overhead.