## Big Data on the back of a napkin

If you’re thinking about using Hadoop to perform analytics, you are probably considering using Hive. Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Most analysts use Hive because HiveQL is a reasonable subset of SQL-92, so there isn’t a steep learning curve when building queries for your reports. Since HiveQL looks so much like SQL, you may be tempted just to build your tables and queries the same way you built your existing Oracle or SQL Server artifacts. Unfortunately, this surface similarity masks deep differences and a failure to understand these differences can short-circuit your new Big Data initiative. As a case in point, I had a client who wanted to move their historical data from DB2 to Hive for cost savings purposes. Their requirement was to have their same queries work on the same datasets at the same level of performance, which was 20 seconds. Normally, I recommend against these types of projects since Hadoop is meant to provide insight into data sets that are beyond the scope of current systems to bring new value to the enterprise, not provide an incremental savings on storage and licensing fees. However, the current performance they were getting was 8,528 seconds, so I felt there was an opportunity for improvement. So I went ablout estimating the job.

When engineers are building distributed systems, they use something called *asymptotic analysis* to estimate performance of very large systems. The asymptotic behavior of a function f(*n*) refers to the growth of f(*n*) as *n* gets large. In other words, asymptotic analysis is a back of the napkin calculation. Programmers call it Big O analysis and now you can too. The following refers to the upper bound of performance, fastest to slowest:

- O(1) constant
- O(log
*n*) logarithmic - O(
*n*) linear - O(
*n*log*n*) - O(
*n*^{2}) quadratic - O(
*n*^{3}) cubic - n
^{O(1}) polynomial - 2
^{O(n) }exponential

Why is this important? Relational databases, like Oracle and MySQL, all use relational algebra, expressed as SQL, against data stored in a B+tree in main memory and B+ trees provide logarithmic, or O(log *n*), performance. This is why relational database systems are so fast; the underlying mathematical model is extremely effective. There aren’t a lot of algorithms faster at searching than a B+tree. However, it is possible to construct a query in Hive to find a record in O(log *n*) assuming the binary search is done on a sorted array of elements. Hive uses the following process to generate map reduce jobs from HiveQL:

HiveQL -> [parser] -> Abstract Syntax Tree -> [semantic analyzer] -> Query Block -> [logical plan generator] -> Operator Tree -> [logical optimizer] -> Operator Tree -> [physical plan generation] -> Task Tree -> [physical optimizer] -> Task Tree -> [execution] -> Map Reduce

MapReduce, the underlying distributed programming model used by Hadoop and ultimately Hive, uses a divide and conquer algorithm, which is O(*n* log *n*). To calculate an estimate for a map reduce job, run EXPLAIN against the SELECT statement. From the ABSTRACT SYNTAX TREE, figure out the nesting depth by counting the number of indentation of the TOK_*JOIN statements. If there is an aggregation, multiply the JOIN result by ten to the power of the number of aggregations. For example, the converted DB2 stored procedure gave:

| ABSTRACT SYNTAX TREE:

| TOK_LEFTOUTERJOIN

| TOK_LEFTOUTERJOIN

| TOK_LEFTOUTERJOIN

| TOK_LEFTOUTERJOIN

| TOK_LEFTOUTERJOIN

| TOK_LEFTOUTERJOIN

| TOK_LEFTOUTERJOIN

| TOK_JOIN

| TOK_GROUPBY

The query as constructed runs at a factor of (*n*^{8 }* 10)^{ }. O(*n*^{8 }) is not good. The first step was to remove the joins and create a single, large, denormalized table. This is considered bad practice in the relational database world because there is not a substantial performance hit in a B+ tree based system when you have joins. As long as the data fits in main memory, a B+ tree will be able to retrieve the data very rapidly. However, you can see that the impact is substantial in Hadoop, although there is no requirement that the data reside in main memory.

Once the data is denormalized and the Abstract Syntax Tree is down to one level, we are at O(*n* log *n*). This is good, but not as good as O(log *n*). There exists a special case of the divide a conquer algorithm called the natural form that relies on the data being sorted. The fact that the data is sorted pretty much removes the divide out of divide and conquer so we are left with a speed of O(log *n*). This data happened to be archival, so a one time flattening and sorting of the data was a very easy design decision to make.

So by using some knowledge about how Hive works under the hood and using that to estimate queries, I was pretty comfortable going so far as to say I could probably get the query to run in around twenty seconds since I could estimate n. If (*n*^{8 }* 10) = 8,528 seconds, then *n* is around 2.3 seconds. This calculation really only covers the time spent in the Abstract Syntax Tree and doesn’t cover real world issue like launching a JVM on a node (which is a fixed time cost), but I felt pretty comfortable being within a single order of magnitude of my goal. After all, I was just looking for a back of the napkin estimate.