The next few blogs will try to evaluate the different approaches to
integrating Hadoop and a standard RDBMS… so the first thing I’ll try in
this post is to suggest a criteria based on some architectural choices
for making the evaluation. Further, I’ll inject a little surprise and
make the point by using the criteria to say something about a product
that is not an integration of an RDBMS and Hadoop.
For the purposes of this let me clear that by “Hadoop” I mean at least HDFS plus MapReduce… so I will discuss integrating a parallel RDBMS with data stored in HDFS: a massively parallel file system with a programming capability included. By “integration” I mean that queries using the full set of SQL supported by the RDBMS must be available for processing queries that refer to data across the Hadoop-RDBMS divide.
Since we’ve assumed that all SQL functionality is supported the architectural issue left to solve is performance and this issue revolves on one topic: how do we minimize the cost of moving data between the two partners for a given query?
Now to get on with it…
The easiest, but not all that easy, problem involves using parallelism to move data from one system to the other… so the first criteria we will evaluate for each product will consider how parallel is their movement of data.
The next criteria involves intelligence in the RDBMS to push down some execution operators to the data layer. Of course the RDBMS must scan remote data… so in this part of the evaluation we will grade each product’s ability to push processing down to apply predicates and project the minimal amount of data up to the RDBMS.
Finally, a most intelligent product would push more than just predicates down… it would push down joins and aggregation… and the decisions around splitting processing would be fully optimized. A most intelligent product would fully federate the HDFS data into the RDBMS.
So there you have it… I will start evaluating RDBMS-Hadoop architecture by three criteria:
Now for the surprise… in the next blog we’ll look at how Exadata’s architecture maps to these criteria… since it is a two-tiered architecture with an RDBMS tied to a parallel file system… to be continued...
For the purposes of this let me clear that by “Hadoop” I mean at least HDFS plus MapReduce… so I will discuss integrating a parallel RDBMS with data stored in HDFS: a massively parallel file system with a programming capability included. By “integration” I mean that queries using the full set of SQL supported by the RDBMS must be available for processing queries that refer to data across the Hadoop-RDBMS divide.
Since we’ve assumed that all SQL functionality is supported the architectural issue left to solve is performance and this issue revolves on one topic: how do we minimize the cost of moving data between the two partners for a given query?
Now to get on with it…
The easiest, but not all that easy, problem involves using parallelism to move data from one system to the other… so the first criteria we will evaluate for each product will consider how parallel is their movement of data.
The next criteria involves intelligence in the RDBMS to push down some execution operators to the data layer. Of course the RDBMS must scan remote data… so in this part of the evaluation we will grade each product’s ability to push processing down to apply predicates and project the minimal amount of data up to the RDBMS.
Finally, a most intelligent product would push more than just predicates down… it would push down joins and aggregation… and the decisions around splitting processing would be fully optimized. A most intelligent product would fully federate the HDFS data into the RDBMS.
So there you have it… I will start evaluating RDBMS-Hadoop architecture by three criteria:
- how parallel is the data movement between the RDBMS and Hadoop;
- is there intelligence to minimize data movement by pushing the least data and the associated query plan to one system or another… this requires parallel pipes in both directions; and
- is there intelligence to build an optimal query plan that splits steps across both systems to completely minimize the movement of data and/or optimize the compute.
- If we imagine a 10-node Hadoop cluster talking to a 10-node RDBMS with 10 parallel pipes and compared it to the same setup with only 1 pipe (not parallel) then we might suggest that the parallel pipes provide a 10X performance increase.
- If we imagine intelligence that moved 100K rows rather than 10M then we might suggest that intelligent push down might provide a 100X performance increase…
- If we had even more intelligence and further optimized processing then another 10X-100X might be possible.
Now for the surprise… in the next blog we’ll look at how Exadata’s architecture maps to these criteria… since it is a two-tiered architecture with an RDBMS tied to a parallel file system… to be continued...
Comments
Post a Comment