Skip to main content

A Review of Processing Push-down Part 5:

I want to be sure that I’ve conveyed the concepts behind these criteria properly… I may have rushed it in the early parts of this series.
Let’s imagine a query that joins a 2,000,000 row table with a 1000 row dimension table where both live in HDFS.
If all of the data has to be moved from HDFS to the RDBMS then 2,001,000  rows must be read and moved in order to apply a predicate or any other processing.. For fun lets say that the cost of moving this data is 2001K.
If there are 10 parallel pipes then the data movement is completed in one tenth the time… so the cost is 200K.
If a predicate is included that selects only 5% of the data from the big table, and the predicate is pushed down the cost is reduced to 101K. Add in parallel pipes and the cost is 10K
Imagine a query where there is a join between the two tables with predicates on one side and predicate push down… then you have to pay 101K to pull the projected data up and do the join in the RDBMS. If there is a join predicate that reduces the final answer set by another 95% then after the join you return 6K rows. Since everybody returns the same 6K rows as an answer we won’t add that in.
But if you can push the join down as well as the predicates then only 6K rows are moved up… so you can see how 2001K shrinks to 6K through the effective push down of processing.
Further, you can build arbitrarily complex queries and model them pretty well knowing that most of the cost is in data movement.
So think about how Teradata processes these two tables in Hadoop when you use the specialized SQL constructs and then again if you build the query from a BI tool. And stay tuned as I’ll show you how HANA processes the data next…. and then talk about several others.

Comments

Popular posts from this blog

Python and Parquet Performance

In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask. This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of  columnar storage ,  columnar compression  and  data partitioning . Used together, these three optimizations can dramatically accelerate I/O for your Python applications compared to CSV, JSON, HDF or other row-based formats. Parquet makes applications possible that are simply impossible using a text format like JSON or CSV. Introduction I have recently gotten more familiar with how to work with  Parquet  datasets across the six major tools used to read and write from Parquet in the Python ecosystem:  Pandas ,  PyArrow ,  fastparquet ,  AWS Data Wrangler ,  PySpark  and  Dask . My work of late in algorithmic trading involves switching between these tools a lot and as I said I often mix up the APIs. I use Pandas and PyArrow for in-RAM comput...

Design of Large-Scale Services on Cloud Services PART 2

Decompose the Application by Workload Applications are typically composed of multiple workloads. Different workloads can, and often do, have different requirements, different levels of criticality to the business, and different levels of financial consideration associated with them. By decomposing an application into workloads, an organization provides itself with valuable flexibility. A workload-centric approach provides better controls over costs, more flexibility in choosing technologies best suited to the workload, workload specific approaches to availability and security, flexibility and agility in adding and deploying new capabilities, etc. Scenarios When thinking about resiliency, it’s sometimes helpful to do so in the context of scenarios. The following are examples of typical scenarios: Scenario 1 – Sports Data Service  A customer provides a data service that provides sports information. The service has two primary workloads. The first provides statistics for th...

Design of Large-Scale Services on Cloud Services PART 1

Cloud computing is distributed computing; distributing computing requires thoughtful planning and delivery – regardless of the platform choice. The purpose of this document is to provide thoughtful guidance based on real-world customer scenarios for building scalable applications Fail-safe   noun . Something designed to work or function automatically to prevent breakdown of a mechanism, system, or the like. Individuals - whether in the context of employee, citizen, or consumer – demand instant access to application, compute and data services. The number of people connected and the devices they use to connect to these services are ever growing. In this world of always-on services, the systems that support them must be designed to be both available and resilient. The Fail-Safe initiative  is intended to deliver general guidance for building resilient cloud architectures, guidance for implementing those architectures  and recipes for implementing these architectures...