I want to be sure that I’ve conveyed the concepts behind these criteria properly… I may have rushed it in the early parts of this series. Let’s imagine a query that joins a 2,000,000 row table with a 1000 row dimension table where both live in HDFS. If all of the data has to be moved from HDFS to the RDBMS then 2,001,000 rows must be read and moved in order to apply a predicate or any other processing.. For fun lets say that the cost of moving this data is 2001K. If there are 10 parallel pipes then the data movement is completed in one tenth the time… so the cost is 200K. If a predicate is included that selects only 5% of the data from the big table, and the predicate is pushed down the cost is reduced to 101K. Add in parallel pipes and the cost is 10K Imagine a query where there is a join between the two tables with predicates on one side and predicate push down… then you have to pay 101K to pull the projected data up and do the join in the RDBMS. If there is...