Skip to main content

Performance Testing / Benchmarking a Hbase Cluster

Performance testing / Benchmarking a Hbase cluster
So you have setup a new Hbase cluster, and want to 'take it for a spin'.  Here is how, without writing a lot of code on your own.

BEFORE WE START

I like to have hbase command available in my PATH.  I put the following in my   ~/.bashrc  file:

export HBASE_HOME=/hadoop/hbase
export PATH=$PATH:$HBASE_HOME/bin

A) HBASE PERFORMANCEEVALUATION

class : org.apache.hadoop.hbase.PerformanceEvaluation
jar : hbase-*-tests.jar

This is a handy class that comes with the distribution.   It can do read/writes to hbase.   It spawns a map-reduce job to do the reads / writes in parallel.   There is also an option to do the operations in threads instead of map-reduce.

lets find out the usage:

# hbase org.apache.hadoop.hbase.PerformanceEvaluation

Usage: java org.apache.hadoop.hbase.PerformanceEvaluation \
  [--miniCluster] [--nomapred] [--rows=ROWS] <command> <nclients>
....
[snipped]
...

So lets run a randomWrite test:

# time hbase org.apache.hadoop.hbase.PerformanceEvaluation  randomWrite 5

we are running 5 clients.  By default, this would be running in map reduce mode
each client is inserting 1 million rows (default), about 1GB size (1000 bytes per row).  So total data size is 5 GB (5 x 1)
typically there will be 10 maps per client.  So we will see 50 (5 x 10) map tasks
you can watch the progress on the console and also at task tracker UI (http://task_tracker:50030).

Once this test is complete, it will print out summaries:

... <output clipped>
....
Hbase Performance Evaluation
     Row count=5242850
     Elapsed Time in millisconds = 1789049
.....

real    3m21.829s
user    0m2.944s
sys     0m0.232s

I actually liked to look at elapsed REAL time (that I measure using unix 'time' command).  Then do this calculation:

5 million rows = 5242850
total time = 3m 21 sec = 201secs

write throughput
= 5242850 rows   /  201 seconds  = 26083.8  rows / sec
= 5 GB data / 201 seconds  = 5 * 1000 M bytes /  201 sec = 24.87 MB / sec
insert time = 201 seconds / 5242850 rows = 0.038 ms / row


This should give you a good idea of the cluster throughput.


Now, lets do a READ benchmark

# time hbase org.apache.hadoop.hbase.PerformanceEvaluation  randomRead 5

and you can calculate read throughput

B) YCSB


YCSB is a performance testing tool released by Yahoo.  It has a HBase mode that we will use:

First, read an exellent tutorial by George Lars on using YCSB with Hbase.
And follow his instructions setting up hbase and YCSB. ( I won't repeat it here)


YCSB ships with a few 'work loads'.  I am going to run  'workloada'  - it is a mix of read and write (50%  / 50%)

step 1)  setting up work load:
java -cp build/ycsb.jar:db/hbase/lib/* com.yahoo.ycsb.Client -load -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p recordcount=10000000  -threads 10 -s > load.dat
-load : we are loading the data
-P workloads/workloada : we are using workloada
-p recordcount=100000000   : 10 million rows
-threads 10 : use 10 threads to parallelize inserts
-s  : print progress on stederr (console) every 10 secs
> load.dat :   save the data into this file

examine the file 'load.dat'.  Here are the first few lines:

YCSB Client 0.1
Command line: -load -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p recordcount=10000000 -threads 10 -s
[OVERALL], RunTime(ms), 786364.0
[OVERALL], Throughput(ops/sec), 12716.757125199018
[INSERT], Operations, 10000000
[INSERT], AverageLatency(ms), 0.5551727
[INSERT], MinLatency(ms), 0
[INSERT], MaxLatency(ms), 34580
[INSERT], 95thPercentileLatency(ms), 0
[INSERT], 99thPercentileLatency(ms), 1
[INSERT], Return=0, 10000000
[INSERT], 0, 9897989
[INSERT], 1, 99298

I have highlighted the important numbers in bold.  One interesting stat is how many ops were performed each second.  Also you can see the runtime in ms (~786 secs)


Step 2) running the workload
The previous step setup the workload.  Now lets run it.

java -cp build/ycsb.jar:db/hbase/lib/* com.yahoo.ycsb.Client -t -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=cf -p operationcount=1000000 -s -threads 10 > a.dat

Differences are:
-t : for transaction mode  (read/write)
operationcount : specifies how many ops to try
now lets examine a.dat:

YCSB Client 0.1
Command line: -t -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p operationcount=10000000 -threads 10 -s
[OVERALL], RunTime(ms), 2060800.0
[OVERALL], Throughput(ops/sec), 4852.484472049689
[UPDATE], Operations, 5002015
[UPDATE], AverageLatency(ms), 0.6575520065413638
[UPDATE], MinLatency(ms), 0
[UPDATE], MaxLatency(ms), 28364
[UPDATE], 95thPercentileLatency(ms), 0
[UPDATE], 99thPercentileLatency(ms), 0
[UPDATE], Return=0, 5002015
[UPDATE], 0, 4986514
[UPDATE], 1, 15075
[UPDATE], 2, 0
[UPDATE], 3, 2
....
....[snip]
....
[READ], Operations, 4997985
[READ], AverageLatency(ms), 3.3133978993534394
[READ], MinLatency(ms), 0
[READ], MaxLatency(ms), 2868
[READ], 95thPercentileLatency(ms), 13
[READ], 99thPercentileLatency(ms), 24
[READ], Return=0, 4997985
[READ], 0, 333453
[READ], 1, 1866771
[READ], 2, 1197919

Here is how to read it:

Overall details are printed on top then UPDATE stats are shown And  lots lines of percentiles for UPDATE follows scroll down more (or search for READ), to find READ stats we can see the avg latency is 3.13 ms
The percentiles are interesting too.  We can satisfy 95% requests in 13 ms.   Pretty good.  Almost as fast as a RDBMS

Comments

Popular posts from this blog

Python and Parquet Performance

In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask. This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of  columnar storage ,  columnar compression  and  data partitioning . Used together, these three optimizations can dramatically accelerate I/O for your Python applications compared to CSV, JSON, HDF or other row-based formats. Parquet makes applications possible that are simply impossible using a text format like JSON or CSV. Introduction I have recently gotten more familiar with how to work with  Parquet  datasets across the six major tools used to read and write from Parquet in the Python ecosystem:  Pandas ,  PyArrow ,  fastparquet ,  AWS Data Wrangler ,  PySpark  and  Dask . My work of late in algorithmic trading involves switching between these tools a lot and as I said I often mix up the APIs. I use Pandas and PyArrow for in-RAM comput...

How to construct a File System that lives in Shared Memory.

Shared Memory File System Goals 1. MOUNTED IN SHARED MEMORY The result is a very fast, real time file system. We use Shared Memory so that the file system is public and not private. 2. PERSISTS TO DISK When the file system is unmounted, what happens to it? We need to be able to save the file system so that a system reboot does not destroy it. A great way to achieve this is to save the file system to disk. 3. EXTENSIBLE IN PLACE We want to be able to grow the file system in place. 4. SUPPORTS CONCURRENCY We want multiple users to be able to access the file system at the same time. In fact, we want multiple users to be able to access the same file at the same time. With the goals now in mind we can now talk about the major design issues: FAT File System & Design Issues The  FAT File System  has been around for quite some time. Basically it provides a pretty good file structure. But I have two problems with it: 1. FAT IS NOT EXTENSIBLE IN PLAC...

Fetching Facebook Friends using Windows Azure Mobile Services

This tutorial shows you how to fetch Facebook Friends if you have Facebook accessToken. Here is the the code for Scheduled task called getFriends function getFriends() { //Name of the table where accounts are stored var accountTable = tables.getTable('FacebookAccounts'); //Name of the table where friends are stored var friendsTable = tables.getTable('Friends'); checkAccounts(); function checkAccounts(){ accountTable .read({success: function readAccounts(accounts){ if (accounts.length){ for (var i = 0; i < accounts.length; i++){ console.log("Creating query"); //Call createQuery function for all of the accounts that are found createQuery(accounts[i], getDataFromFacebook); } } else { console.log("Didn't find any account"); prepareAccountTable(); } }}); } function prepareAccountTable(){ var myAccount = { accessToken: "", //enter here you facebook accessToken. You can retrieve ...