Teradata has recently announced a
very complete Teradata database-to-Hadoop integration. Is this note
we’ll consider how a Teradata shop might effectively use these features
to significantly reduce the TCO of any Teradata system.
The Teradata Appliance for Hadoop (here)
offering is quite well thought out and complete… including a Teradata
appliance, a Hadoop appliance, and the new QueryGrid capability to
seamlessly connect the two… so hardware, software, support, and services
are all available in very easy-to-consume bundles.
There is little published on the details
of the QueryGrid feature… so I cannot evaluate where it stands on the
query integration maturity curve (see here)… but it certainly provides a significant advance over the current offering (see here and Dan Graham’s associated comments).
I believe that there is some instant
financial gratification to be had by existing Teradata customers from
this Hadoop mashup. Let’s consider this…
Before the possibility of a Hadoop annex
to Teradata, Teradata customers had no choice but to store cold, old,
data in the Teradata database. If, on occasion, you wanted to perform
year by year comparisons over ten years of data then you needed to keep
ten years of data in the database at a rough cost of $50K/TB (see here) …
even if these queries were rarely executed and were not expected to run
against a high performance service level requirement. If you wanted to
perform some sophisticated predictive analysis against this data it had
to be online. If fact, the Teradata mantra… one which I wholeheartedly
agree with… suggests that you really should keep the details online
forever as the business will almost always find a way to glean value
from this history.
This mantra is the basis of what the
Hadoop vendors call a data lake. A data warehouse expert would quickly
recognize a data lake as a staging area for un-scrubbed detailed data…
with the added benefit that a Hadoop-based data lake can store and
process data at a $1K/TB price point… and this makes it
cost-effective to persist the staged data online forever.
So what does this mean to a Teradata EDW owner? Teradata has published numbers (here)
suggesting that 92% of the queries in an EDW only touch 20% of the
data. I would suggest that there is some sort of similar ratio that
holds for 90% of the remaining queries… they may touch only another 40%
of the data. This suggests that the 40% of the data remaining is online
to service less than 1% of the queries… and I suggest that these queries
can be effectively serviced from the $1K/TB Hadoop annex.
In other words, almost every Teradata
shop can immediately benefit from Teradata’s new product announcements
by moving 40% of their Teradata database data to Hadoop. Such a move
would free Teradata disk space and likely take pressure off to upgrade
the cluster. Further, when an upgrade is required, users can reduce the
disk footprint of the Teradata database side of the system; add a Hadoop
annex, and significantly reduce the TCO of the overall configuration.
Some time back I suggested that Teradata would be squeezed by Hadoop (here and here).
To their credit Teradata is going to try and mitigate the squeeze. But
the economics remain… and Teradata customers should seriously consider
how to leverage the low $/TB of Teradata’s Hadoop offering to reduce
costs. Data needs to reside in the lowest cost infrastructure that still
provides the required level of service… and the Teradata Hadoop
integration provides an opportunity to leverage a new, low-cost,
infrastructure.
Comments
Post a Comment