Big Data and Cloud Computing

Posts

A Review of Processing Push-down Part 5:

I want to be sure that I’ve conveyed the concepts behind these criteria properly… I may have rushed it in the early parts of this series. Let’s imagine a query that joins a 2,000,000 row table with a 1000 row dimension table where both live in HDFS. If all of the data has to be moved from HDFS to the RDBMS then 2,001,000 rows must be read and moved in order to apply a predicate or any other processing.. For fun lets say that the cost of moving this data is 2001K. If there are 10 parallel pipes then the data movement is completed in one tenth the time… so the cost is 200K. If a predicate is included that selects only 5% of the data from the big table, and the predicate is pushed down the cost is reduced to 101K. Add in parallel pipes and the cost is 10K Imagine a query where there is a join between the two tables with predicates on one side and predicate push down… then you have to pay 101K to pull the projected data up and do the join in the RDBMS. If there is...

How Hadooped is Teradata? Part 4

Consider the Teradata SQL-H implementation using these criteria. First, Teradata has effective parallel pipes to move data from HDFS to the Teradata database with one pipe per node. There does not seem to be any inter-node IO parallelism. This is a solid feature. There is a limited ability to push down predicates… SQL-H does allow data to be partitioned on the HDFS side and it will perform partition elimination if the query explicitly calls out a predicate within a partionfilter() keyword. In addition there is an ability to project out columns using a columns() keyword to explicitly specify the columns to be returned. These features are klunky but effective. You would expect partitions to be eliminated when the partitioning column is referenced with a predicate in the query like any other query… and you would expect columns to be projected out if they are not referenced. Normal SQL predicates are applied after the data is moved over the network but before every record ...

Evaluating Exadata… Does it stack up with RDBMS-Hadoop systems? Part 2

In my earlier blog I suggested that we could evaluate RDBMS-Hadoop integration architecture using three criteria: How parallel are the pipes to move data between the RDBMS and the parallel file system; Is there intelligence to push down predicates; and Is there more intelligence to push down joins and other relational operators? But Exadata is a split RDBMS with a parallel file system backing it… how does it measure up by these criteria? There are effective parallel pipes between the Oracle RAC RDBMS and the Exadata Storage Subsystem… so Exadata passes the first test. Further, Exadata is smart about pushing scan and projection both down to the Storage layer. Unfortunately there is a fairly severe imbalance between the number of nodes on the RAC side and the number of nodes on the Storage side and this creates a bottleneck. We cannot give Exadata full marks here… but as far as parallel pipes goes it stacks up pretty well. The ability to push down predicates goes...

Approaches to integrating Hadoop and a standard RDBMS Part 1

The next few blogs will try to evaluate the different approaches to integrating Hadoop and a standard RDBMS… so the first thing I’ll try in this post is to suggest a criteria based on some architectural choices for making the evaluation. Further, I’ll inject a little surprise and make the point by using the criteria to say something about a product that is not an integration of an RDBMS and Hadoop. For the purposes of this let me clear that by “Hadoop” I mean at least HDFS plus MapReduce… so I will discuss integrating a parallel RDBMS with data stored in HDFS: a massively parallel file system with a programming capability included. By “integration” I mean that queries using the full set of SQL supported by the RDBMS must be available for processing queries that refer to data across the Hadoop-RDBMS divide. Since we’ve assumed that all SQL functionality is supported the architectural issue left to solve is performance and this issue revolves on one topic: how do ...

Using Teradata’s Appliance for Hadoop to Reduce TCO

Teradata has recently announced a very complete Teradata database-to-Hadoop integration. Is this note we’ll consider how a Teradata shop might effectively use these features to significantly reduce the TCO of any Teradata system. The Teradata Appliance for Hadoop ( here ) offering is quite well thought out and complete… including a Teradata appliance, a Hadoop appliance, and the new QueryGrid capability to seamlessly connect the two… so hardware, software, support, and services are all available in very easy-to-consume bundles. There is little published on the details of the QueryGrid feature… so I cannot evaluate where it stands on the query integration maturity curve (see here )… but it certainly provides a significant advance over the current offering (see here and Dan Graham’s associated comments). I believe that there is some instant financial gratification to be had by existing Teradata customers from this Hadoop mashup. Let’s consider...

AWS Redshift

The shared-nothing architecture has, from the beginning, offered the promise of using hardware to solve performance problems rather than applying staff and tuning. By this I mean… if you can add nodes and scale out to improve query response then why not throw hardware at performance problems rather than build a fragile infrastructure of aggregate tables, cubes, pre-joined/de-normalized marts, materialized views, indexes, etc. Each of these performance workarounds are both expensive to build and expensive to operate. There are several reasons, I think tuning has been more popular than scaling. Not in any particular order: First, hardware vendors made it too hard to order/provision new nodes. You could not just press a button and buy capacity. Vendors wanted to charge you for terabytes when all you wanted might be CPU and Memory to fix the problem (see here , sigh). You had to negotiate a deal with a rep, work through your procurement group, wait weeks for delivery. The...

Column Compression

There are three forms of columnar-orientation currently deployed by database systems today. Each builds upon the next. The simplest form uses column-orientation to provide better data compression. The next level of maturity stores columnar data in separate structures to support columnar projection. The most mature implementations support a columnar database engine that performs relational algebra on column-oriented data. Let me explain… Imagine a simple table with 1M rows… with the schema and the first several rows depicted in Figure 1. Conceptually, a row-orientation deploys data on disk and in-memory as depicted in Figure 2 and a column-orientation deploys data on disk and in-memory as depicted in Figure 3. The actual deployment may be significantly different, as we will see. Note that I am going to throw out some indicative numbers around compression. I will suggest that applying compression to rows will provide from 1.5X to 3.5X compression with and averag...