Big Data and Cloud Computing

Posts

Hadoop sharp edges annoy

1. Pig vs. Hive You cannot use Hive UDFs in Pig. You have to use HCatalog to access Hive tables in Pig. You cannot use Pig UDFs in Hive. Whether it's one little extra functionality I need while in Hive, but don’t really feel like writing a full-on Pig script or it's the “gee, I could easily do this if I were just in Hive” while I’m writing Pig scripts, I frequently think, “Tear down this wall!” when I’m writing in either. 2. Being forced to store all my shared libraries in HDFS This is a recurring theme in Hadoop. If you store your Pig script on HDFS, then it automatically assumes any JAR files will be there as well (I’m working on fixing that myself). This general theme repeats in Oozie and other tools. It's usually sensible, but at times, having an organization-wide forced shared library version is painful. Besides, more than half the time, these are the same JAR files you installed everywhere you installed the client, so why store them twice? 3. Oozi...

How Hadooped is SQL Server PDW with Polybase? Part 8

Before we start I will suggest a fourth criteria that will be more fully explored later when we consider networks and pipes… that is: how is data sharded/hashed/distributed as it moves from the distribution scheme in HDFS to an optimal, usually hashed, scheme in the target RDBMS. Consider Greenplum as an example… they move data in parallel as quickly as possible to the GPDB and then redistribute the data across GPDB segment nodes using scatter-gather, a very efficient distribution mechanism. We will consider how PDW Poybase manages this as part of our first criteria. Also note… since I started this series Teradata has come out with a new capability: the QueryGrid. I will add a post to consider this separately… and in this note I will assume the older Teradata capability. This is a little unfair to Teradata and I apologize for that… but otherwise this post becomes too complex. I’ll make things right for Teradata ASAP. Now on to Microsoft… Fir...

How Hadooped is HANA? Part 6:

As you will see HANA may well have the best RDBMS-Hadoop integration in the market. I try hard not to blow foam about HANA in this blog… and I hope that the objective criteria I have devised to evaluate all of the products will keep this post credible… but please look at this post harder than most and push back if you think that I overstep. First… surprisingly, HANA’s first release has only a single pipe to the Hadoop side. This is worrisome but easily fixed. It will negatively impact performance when large tables/files have to be moved up for processing. But HANA includes Hadoop as a full partner in a federated data architecture using the Smart Data Access (SDA) engine inside the HANA address space. As a result, HANA not only pushes predicates but it uses cost-based optimization to determine what to push down and what to pull up. HANA interrogates the Hadoop system to gather statistics and uses the HANA optimizer to develop smart execution plans with awareness of both the s...