Big Data and Cloud Computing

Posts

Showing posts from April, 2014

Setting up a development machine on Ubuntu

These steps will guide you on setting up an Ubuntu machine as a development machine. Here is what I use the machine for... Ruby on Rails development PHP development Java development Photo management All these commands are executed in TERMINAL Basic Setup # make sure every thing is update sudo apt-get update sudo apt-get upgrade # this will install adobe-flash, sun-jre # http://packages.ubuntu.com/jaunty/ubuntu-restricted-extras # #if you are using KUBUNTU sudo apt-get install -y kubuntu-restricted-extras #if you are using UBUNTU sudo apt-get install -y ubuntu-restricted-extras sudo apt-get install -y firefox sudo apt-get install -y vim-gtk sudo apt-get install -y synaptic Development Stuff # will install compilers (gcc and dev libraries) sudo apt-get install -y build-essential sudo apt-get install -y pkg-config Apache sudo apt-get install -y apache2 sudo apt-get install -y apache2-prefork-dev Mysql & PHP sudo apt-get...

Amazon Elastic Map Reduce (EMR) Beyond Basics

ENVIRONMENT I run these commands from an linux EC2 instance. It doesn't have to be a 'powerful' instance, as it doesn't do much work. So an M1.SMALL type is fine. The following needs to be installed ec2 tools We use these tools to launch and monitor EMR jobs. follow the guides from https://help.ubuntu.com/community/EC2StartersGuide and http://aws.amazon.com/developertools/351?_encoding=UTF8&jiveRedirect=1 s3cmd to copy files to and from S3. get it from : http://s3tools.org/s3cmd INPUT PATHS For testing MR jobs on the local hadoop instance, we might use an input path like 'hdfs://localhost:9000/input'. For running on EMR, we can use S3 as input: 's3://my_bucket/input' Since hadoop supports reading from S3 natively, S3 input works just like a HDFS url So how to do this, wih out hard-coding the path into the code? We passs it like a command line argument. The following example illustrates how to pass two argument...

Hadoop Useful Utility Classes

Some handy classes for using Hadoop / Map Reduce / Hbase IDENTITYMAPPER / IDENTITYREDUCER org.apache.hadoop.mapreduce.Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> org.apache.hadoop.mapreduce.Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> jar : hadoop-core.jar if your mappers and reducers write inputs to outputs, then use these guys. No need to receate them. SHELL / SHELLCOMMANDEXECUTOR org.apache.hadoop.util.Shell org.apache.hadoop.util.Shell.ShellCommandExecutor jar : hadoop-core.jar handy for executing commands on local machine and inspect outputs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import org.apache.hadoop.util.Shell.ShellCommandExecutor ; String [] cmd = { "ls" , "/usr" }; ShellCommandExecutor shell = new ShellCommandExecutor ( cmd ); shell . execute (); System . out . println ( "* shell exit code : " + shell . getExitCode ()); System . out . println ( "* shell output...

Performance Testing / Benchmarking a Hbase Cluster

Performance testing / Benchmarking a Hbase cluster So you have setup a new Hbase cluster, and want to 'take it for a spin'. Here is how, without writing a lot of code on your own. BEFORE WE START I like to have hbase command available in my PATH. I put the following in my ~/.bashrc file: export HBASE_HOME=/hadoop/hbase export PATH=$PATH:$HBASE_HOME/bin A) HBASE PERFORMANCEEVALUATION class : org.apache.hadoop.hbase.PerformanceEvaluation jar : hbase-*-tests.jar This is a handy class that comes with the distribution. It can do read/writes to hbase. It spawns a map-reduce job to do the reads / writes in parallel. There is also an option to do the operations in threads instead of map-reduce. lets find out the usage: # hbase org.apache.hadoop.hbase.PerformanceEvaluation Usage: java org.apache.hadoop.hbase.PerformanceEvaluation \ [--miniCluster] [--nomapred] [--rows=ROWS] <command> <nclients> .... [snippe...

Hadoop / HBase and DNS

Hadoop and HBase (especially HBase) are very picky about DNS entries. When setting up a Hadoop cluster one doesn't always have access to a DNS server. So here is 'poor developers' guide to getting DNS correct. Following these simple steps, can avoid a few thorny issues down the line. set Hostname verify hostname --> IP address resolution is working (DNS resolution) verify IP address --> hostname resolution is working (reverse DNS) DNS verification tool 1) HOSTNAME I like to set these to FULLY QUALIFIED NAMES. so ' hadoop1.lab.mycompany.com' is good just 'hadoop1' is not. on CENTOS set this in '/etc/sysconfig/network' HOSTNAME=hadoop1.lab.mycompany.com on UBUNTU: set this on '/etc/hostname' hadoop1.lab.mycompany.com just reboot the host for hostname settings to take effect (to be safe) Do this at every node. 2) DNS ENTRIES WHEN YOU DON'T HAVE DNS SERV...