Skip to main content

Approaches to integrating Hadoop and a standard RDBMS Part 1

The next few blogs will try to evaluate the different approaches to integrating Hadoop and a standard RDBMS… so the first thing I’ll try in this post is to suggest a criteria based on some architectural  choices for making the evaluation. Further, I’ll inject a little surprise and make the point by using the criteria to say something about a product that is not an integration of an RDBMS and Hadoop.

For the purposes of this let me clear that by “Hadoop” I mean at least HDFS plus MapReduce… so I will discuss integrating a parallel RDBMS with data stored in HDFS: a massively parallel file system with a programming capability included. By “integration” I mean that queries using the full set of SQL supported by the RDBMS must be available for processing queries that refer to data across the Hadoop-RDBMS divide.

Since we’ve assumed that all SQL functionality is supported the architectural issue left to solve is performance and this issue revolves on one topic: how do we minimize the cost of moving data between the two partners for a given query?

Now to get on with it…
The easiest, but not all that easy, problem involves using parallelism to move data from one system to the other… so the first criteria we will evaluate for each product will consider how parallel is their movement of data.

The next criteria involves intelligence in the RDBMS to push down some execution operators to the data layer. Of course the RDBMS must scan remote data… so in this part of the evaluation we will grade each product’s ability to push processing down to apply predicates and project the minimal amount of data up to the RDBMS.

Finally, a most intelligent product would push more than just predicates down… it would push down joins and aggregation… and the decisions around splitting processing would be fully optimized. A most intelligent product would fully federate the HDFS data into the RDBMS.
So there you have it… I will start evaluating RDBMS-Hadoop architecture by three criteria:
  • how parallel is the data movement between the RDBMS and Hadoop;
  • is there intelligence to minimize data movement by pushing the least data and the associated query plan to one system or another… this requires parallel pipes in both directions; and
  • is there intelligence to build an optimal query plan that splits steps across both systems to completely minimize the movement of data and/or optimize the compute.
And a final word on the relative strength of each criteria:
  • If we imagine a 10-node Hadoop cluster talking to a 10-node RDBMS with 10 parallel pipes and compared it to the same setup with only 1 pipe (not parallel) then we might suggest that the parallel pipes provide a 10X performance increase.
  • If we imagine intelligence that moved 100K rows rather than 10M then we might suggest that intelligent push down might provide a 100X performance increase…
  • If we had even more intelligence and further optimized processing then another 10X-100X might be possible.
So all three criteria are not equal… intelligent query planning trumps wide pipes…
Now for the surprise… in the next blog we’ll look at how Exadata’s architecture maps to these criteria… since it is a two-tiered architecture with an RDBMS tied to a parallel file system… to be continued...

Comments

Popular posts from this blog

How to construct a File System that lives in Shared Memory.

Shared Memory File System Goals 1. MOUNTED IN SHARED MEMORY The result is a very fast, real time file system. We use Shared Memory so that the file system is public and not private. 2. PERSISTS TO DISK When the file system is unmounted, what happens to it? We need to be able to save the file system so that a system reboot does not destroy it. A great way to achieve this is to save the file system to disk. 3. EXTENSIBLE IN PLACE We want to be able to grow the file system in place. 4. SUPPORTS CONCURRENCY We want multiple users to be able to access the file system at the same time. In fact, we want multiple users to be able to access the same file at the same time. With the goals now in mind we can now talk about the major design issues: FAT File System & Design Issues The  FAT File System  has been around for quite some time. Basically it provides a pretty good file structure. But I have two problems with it: 1. FAT IS NOT EXTENSIBLE IN PLAC...

Common Sense Identification of the Security Problems

Organizations make key information security mistakes, which leads to inefficient and ineffective control environment. High profile data breaches and cyber-attacks drive the industry to look for more comprehensive protection measures since many organizations feel that their capability to withstand persistent targeted attacks is minimal. But at the same time, these organizations make some key information security mistakes, that jeopardize their efforts towards control robustness. Although many firms invest in security technologies and people, no one has the confidence that the measures taken are good enough to protect their data from compromises. Below are the 10 worst mistakes which are common to find, and important to address in the path of mature information security posture. If you analyze the cyber security scenarios, and organizational capabilities, the prevailing trend is a vendor-driven approach. In many cases, security professionals adopt the attitude of procuring...

Ingesting IoT Sensor Data Into S3 With an RPI3

StreamSets Data Collector Edge is a lightweight agent used to create end-to-end data flow pipelines. We'll use it help stream data collected from a sensor. Due to the increasing amount of data produced from outside source systems, enterprises are facing difficulties in reading, collecting, and ingesting data into a desired, central database system. An edge pipeline runs on an edge device with limited resources, receives data from another pipeline or reads the data from the device, and controls the device based on the data. StreamSets Data Collector (SDC) Edge, an ultra-lightweight agent, is used to create end-to-end data flow pipelines in StreamSets Data Collector and to run the pipelines to read and export data in and out of systems. In this blog, StreamSets Data Collector Edge is used to read data from an air pressure sensor (BMP180) from an IoT device (Raspberry Pi3). Meanwhile, StreamSets Data Collector is used to load the data into Amazon's Simple Storage Service ...