Unlike a pure storage machine that would just be optimized for disk size and throughput, an HBase RegionServer is also a compute node.
Every byte of disk space needs to be matched with a fraction of a byte in the RegionServer's Java heap.
You can estimate the ratio of raw disk space to required Java heap as follows:
RegionSize / MemstoreSize *
ReplicationFactor * HeapFractionForMemstores
Or in terms of HBase/HDFS configuration parameters:
Say you have the following parameters (these are the defaults in 0.94):
- 10GB regions
- 128MB memstores
- HDFS replication factor of 3
- 40% of the heap use for the memstores
Then: 10GB/128MB*3*0.4 = 96.
Now think about this. With the default setting this means that if you wanted to serve 10T worth of disks space per region server you would need a 107GB Java heap!
Or if you give a region server a 10G heap you can only utilize about 1T of disk space per region server machine.
Most people are surprised by this. I know I was.
Let's double check:
In order to serve 10T worth of raw disk space - 3.3T of effective space after 3-way replication - with 10GB regions, you'd need ~338 regions. @128MB that's about 43GB. But only 40% is by default used for the memstores so what you actually need is 43GB/0.4 ~ 107GB. Yep it's right.
Maybe we can get away with a bit less by assuming that not all memstores are 100% full at all times. That is offset by the fact that not all region will be exactly the same size or 100% filled.
Now. What can you do?
There are several options:
- Increase the region size. 20GB is about the maximum. Although some people claim they have 200GB regions. (hbase.hregion.max.filesize)
- Decrease the memstore size. Depending on your write load you can go smaller, 64MB or even less. (hbase.hregion.memstore.flush.size).
You can allow a memstore to grow beyond this size temporarily. (hbase.hregion.memstore.block.multiplier)
- Increase the HDFS replication factor. That does not really help per se, but if you have more disk space than you can utilize, increasing the replication factor would at least put your disks to good use.
- Fiddle with the heap fractions used for the memstores. If you load is write-heave maybe up that 50% of the heap (hbase.regionserver.global.memstore.upperLimit, hbase.regionserver.global.memstore.lowerLimit)
Personally I would place the maximum disk space per machine that can be served exclusively with HBase around 6T, unless you have a very read-heavy workload.
In that case the Java heap should be 32GB (20G regions, 128M memstores, the rest defaults). With MSLAB in 0.94 that works.
Of course your needs may vary. You may have mostly readonly load, in which case you can shrink the memstores. Or the disk space might be shared with other applications.
Maybe you need smaller regions or larger memstores. In that case he maximum disk space you can serve per machine would be less.
Future JVMs might support bigger heap effectively (JDK7's G1 comes to mind).
In any case. The formula above provides a reasonable starting point.
Update Monday, June 10, 2013:
Kevin O'dell pointed out on the mailing list that the overall WAL size should match the overall memstore size. So here's an update adding this to the sizing considerations.
HBase uses three config option to size the HLog per Regionserver:
- hbase.regionserver.hlog.blocksize, defaults to the HDFS block size
- hbase.regionserver.logroll.multiplier, defaults to 0.95
- hbase.regionserver.maxlogs, default: 32
Hence with a 64MB default HDFS blocksize the HLogs would capped less than 2GB.
hbase.regionserver.global.memstore.lowerLimit should be <=
I would only change hbase.regionserver.maxlogs. Individual HLogs should remain smaller than a single HDFS block.
Thanks Lars. Very informative.ReplyDelete
Great post, Lars. A lot of the documentation out there recommends staying below 16 gb of regionserver heap size in fear of extend gc time. Have you used these larger (heap) sizes in prod envs before?ReplyDelete
As you noted,if we stick with Max heaps of 16gb, we either need massive regions or really small memstore flush sizes.
sorry I missed your comment. You probably moved on, but maybe for other folks interested let me reply anyway.
Currently we use 12GB heaps at Salesforce, but we are investigating larger heaps. With current JVMs you probably want to stay below 32GB in order to make use of compressed oops.
There's also work undergoing in the HBase community to move the block cache off heap.
I was wondering if the math changes when using compression ?
That would tilt imbalance even further. Blocks are compressed on disk, but the data in the memstores is not compressed.ReplyDelete
Hi, this is really informative.ReplyDelete
I was wondering : is it a good idea to have several RegionSever running on the same DataNode ? I inherited a Hadoop cluster with quite large DataNodes (48T Disk, 256G RAM) and i was thinking this might help getting the most of it for HBase...
It's hard to recommend an absolutely right setup.ReplyDelete
If these nodes run nothing but HBase (no Map/Reduce, no Yarn, etc), you can think about running multiple region servers, but note that these servers will know nothing about each other. They'll compete for CPU and IOPS.
Maybe if you isolate them in containers it would work, I have not tried.
Hi Lars, What is the impact of increasing WAL log size beyond HDFS block size ?ReplyDelete
Very nice article!!!ReplyDelete