Big Data and Cloud Computing

Posts

Showing posts from May, 2014

Storm – the “Hadoop of realtime”

Storm is a distributed realtime computation system. The past decade has seen a revolution in data processing. MapReduce, Hadoop, and related technologies have made it possible to store and process data at scales previously unthinkable. Unfortunately, these data processing technologies are not realtime systems, nor are they meant to be. There’s no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing. However, realtime data processing at massive scale is becoming more and more of a requirement for businesses. The lack of a “Hadoop of realtime” has become the biggest hole in the data processing ecosystem. Storm fills that hole. Storm has two basic units of processing: the Spouts and the Bolts. The Spouts are the elements that generate the data to be processed, they may get that data from external sources or generate it themselves but their mission is to introduce it to the cluster. Bolts are pr...

Integrity in the Cloud

All communication, SSH and HTTP(s) using HTTP Signature, is verified as accurate/legitimate. For all data I upload to my CSP, I can attach a signature (MAC) to verify if the contents have changed. For all data I upload to my CSP, I am sure it will be the same in the future. To have mutual auditability, I now want to expand that list to include the following: For all data I upload to my CSP, I can ensure I have the latest revision of my data. Currently, this can’t happen without much work on either the customer's or the CSP's end. I can ensure I am charged the correct amount for data at rest and data in motion by my CSP. but what about data fingerprints on my data? Keyless Signature (KS) is a system based on the building blocks of MACs, high resolution timestamping, hash chains and hash trees. KS provides: proof of integrity, proof of time and proof of signing authority. It is considered keyless as it is based entirely on formal methods of hash functions and n...

Task Configuration (Hadoop 2.2.0)

There are a number of configuration variables for tuning the performance of your MapReduce jobs. This section describes some of the important task-related settings. TOPICS Task JVM Memory Settings (AMI 3.0.0) Avoiding Cluster Slowdowns (AMI 3.0.0) Task JVM Memory Settings (AMI 3.0.0) Avoiding Cluster Slowdowns (AMI 3.0.0) Task JVM Memory Settings (AMI 3.0.0) Hadoop 2.2.0 uses two parameters to configure memory for map and reduce: mapreduce.map.java.opts and mapreduce.reduce.java.opts, respectively. These replace the single configuration option from previous Hadoop versions: mapreduce.map.java.opts. The defaults for these settings per instance type are shown in the following tables. m1.medium Configuration Option Default Value mapreduce.map.java.opts -Xmx768m mapreduce.reduce.java.opts -Xmx768m mapreduce.map.memory.mb 1024 mapreduce.reduce.memory.mb 1024 yarn.scheduler.minimum-allocation-mb 512 yarn.scheduler.maximum-allocation-mb 2048 yarn.nodemanager.re...