Skip to main content

Task Configuration (Hadoop 2.2.0)

There are a number of configuration variables for tuning the performance of your MapReduce jobs. This section describes some of the important task-related settings.
TOPICS
Task JVM Memory Settings (AMI 3.0.0)
Avoiding Cluster Slowdowns (AMI 3.0.0)

Task JVM Memory Settings (AMI 3.0.0)
Avoiding Cluster Slowdowns (AMI 3.0.0)
Task JVM Memory Settings (AMI 3.0.0)

Hadoop 2.2.0 uses two parameters to configure memory for map and reduce: mapreduce.map.java.opts and mapreduce.reduce.java.opts, respectively. These replace the single configuration option from previous Hadoop versions: mapreduce.map.java.opts.
The defaults for these settings per instance type are shown in the following tables.
m1.medium
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx768m
mapreduce.reduce.java.opts-Xmx768m
mapreduce.map.memory.mb1024
mapreduce.reduce.memory.mb1024
yarn.scheduler.minimum-allocation-mb512
yarn.scheduler.maximum-allocation-mb2048
yarn.nodemanager.resource.memory-mb2048

m1.large
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx864m
mapreduce.reduce.java.opts-Xmx1536m
mapreduce.map.memory.mb1024
mapreduce.reduce.memory.mb2048
yarn.scheduler.minimum-allocation-mb512
yarn.scheduler.maximum-allocation-mb3072
yarn.nodemanager.resource.memory-mb5120

m1.xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx512m
mapreduce.reduce.java.opts-Xmx1536m
mapreduce.map.memory.mb768
mapreduce.reduce.memory.mb2048
yarn.scheduler.minimum-allocation-mb256
yarn.scheduler.maximum-allocation-mb8192
yarn.nodemanager.resource.memory-mb12288

m2.xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx1280m
mapreduce.reduce.java.opts-Xmx2304m
mapreduce.map.memory.mb1536
mapreduce.reduce.memory.mb2560
yarn.scheduler.minimum-allocation-mb512
yarn.scheduler.maximum-allocation-mb7168
yarn.nodemanager.resource.memory-mb14336

m2.2xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx1280m
mapreduce.reduce.java.opts-Xmx2304m
mapreduce.map.memory.mb1536
mapreduce.reduce.memory.mb2560
yarn.scheduler.minimum-allocation-mb512
yarn.scheduler.maximum-allocation-mb8192
yarn.nodemanager.resource.memory-mb30720

m3.xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx1515m
mapreduce.reduce.java.opts-Xmx1792m
mapreduce.map.memory.mb1904
mapreduce.reduce.memory.mb2150
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb3788
yarn.nodemanager.resource.memory-mb5273

m3.2xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx2129m
mapreduce.reduce.java.opts-Xmx2560m
mapreduce.map.memory.mb2826
mapreduce.reduce.memory.mb3072
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb5324
yarn.nodemanager.resource.memory-mb9113

m2.4xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx1280m
mapreduce.reduce.java.opts-Xmx2304m
mapreduce.map.memory.mb1536
mapreduce.reduce.memory.mb2560
yarn.scheduler.minimum-allocation-mb512
yarn.scheduler.maximum-allocation-mb8192
yarn.nodemanager.resource.memory-mb61440

c1.medium
Configuration OptionDefault Value
io.sort.mb100
mapreduce.map.java.opts-Xmx288m
mapreduce.reduce.java.opts-Xmx288m
mapreduce.map.memory.mb512
mapreduce.reduce.memory.mb512
yarn.scheduler.minimum-allocation-mb256
yarn.scheduler.maximum-allocation-mb512
yarn.nodemanager.resource.memory-mb1024

c1.xlarge
Configuration OptionDefault Value
io.sort.mb150
mapreduce.map.java.opts-Xmx864m
mapreduce.reduce.java.opts-Xmx1536m
mapreduce.map.memory.mb1024
mapreduce.reduce.memory.mb2048
yarn.scheduler.minimum-allocation-mb512
yarn.scheduler.maximum-allocation-mb2048
yarn.nodemanager.resource.memory-mb5120

c3.large
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx768m
mapreduce.reduce.java.opts-Xmx768m
mapreduce.map.memory.mb921
mapreduce.reduce.memory.mb921
yarn.scheduler.minimum-allocation-mb499
yarn.scheduler.maximum-allocation-mb1920
yarn.nodemanager.resource.memory-mb1920

c3.xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx1177m
mapreduce.reduce.java.opts-Xmx1356m
mapreduce.map.memory.mb1413
mapreduce.reduce.memory.mb1628
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb2944
yarn.nodemanager.resource.memory-mb3302

c3.2xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx1515m
mapreduce.reduce.java.opts-Xmx1792m
mapreduce.map.memory.mb1904
mapreduce.reduce.memory.mb2150
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb3788
yarn.nodemanager.resource.memory-mb5273

c3.4xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx2129m
mapreduce.reduce.java.opts-Xmx2560m
mapreduce.map.memory.mb2826
mapreduce.reduce.memory.mb3072
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb5324
yarn.nodemanager.resource.memory-mb9113

c3.8xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx4669m
mapreduce.reduce.java.opts-Xmx4915m
mapreduce.map.memory.mb4669
mapreduce.reduce.memory.mb4915
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb8396
yarn.nodemanager.resource.memory-mb16793

cc1.4xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx1280m
mapreduce.reduce.java.opts-Xmx2304m
mapreduce.map.memory.mb1536
mapreduce.reduce.memory.mb2560
yarn.scheduler.minimum-allocation-mb512
yarn.scheduler.maximum-allocation-mb5120
yarn.nodemanager.resource.memory-mb20480

cg1.4xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx1280m
mapreduce.reduce.java.opts-Xmx2304m
mapreduce.map.memory.mb1536
mapreduce.reduce.memory.mb2560
yarn.scheduler.minimum-allocation-mb512
yarn.scheduler.maximum-allocation-mb5120
yarn.nodemanager.resource.memory-mb20480

cc2.8xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx1280m
mapreduce.reduce.java.opts-Xmx2304m
mapreduce.map.memory.mb1536
mapreduce.reduce.memory.mb2560
yarn.scheduler.minimum-allocation-mb512
yarn.scheduler.maximum-allocation-mb8192
yarn.nodemanager.resource.memory-mb56320

cr1.8xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx10895m
mapreduce.reduce.java.opts-Xmx13516m
mapreduce.map.memory.mb15974
mapreduce.reduce.memory.mb16220
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb27238
yarn.nodemanager.resource.memory-mb63897

g2.2xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx512m
mapreduce.reduce.java.opts-Xmx1536m
mapreduce.map.memory.mb768
mapreduce.reduce.memory.mb2048
yarn.scheduler.minimum-allocation-mb256
yarn.scheduler.maximum-allocation-mb8192
yarn.nodemanager.resource.memory-mb12288

hi1.4xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx3379m
mapreduce.reduce.java.opts-Xmx4121m
mapreduce.map.memory.mb4700
mapreduce.reduce.memory.mb4945
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb8448
yarn.nodemanager.resource.memory-mb16921

hs1.8xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx1280m
mapreduce.reduce.java.opts-Xmx2304m
mapreduce.map.memory.mb1536
mapreduce.reduce.memory.mb2560
yarn.scheduler.minimum-allocation-mb512
yarn.scheduler.maximum-allocation-mb8192
yarn.nodemanager.resource.memory-mb56320

i2.xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx2150m
mapreduce.reduce.java.opts-Xmx2585m
mapreduce.map.memory.mb2856
mapreduce.reduce.memory.mb3102
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb5376
yarn.nodemanager.resource.memory-mb9241

i2.2xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx3399m
mapreduce.reduce.java.opts-Xmx4147m
mapreduce.map.memory.mb4730
mapreduce.reduce.memory.mb4976
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb8499
yarn.nodemanager.resource.memory-mb17049

i2.4xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx5898m
mapreduce.reduce.java.opts-Xmx7270m
mapreduce.map.memory.mb8478
mapreduce.reduce.memory.mb8724
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb14745
yarn.nodemanager.resource.memory-mb32665

i2.8xlarge
Configuration OptionDefault Value
mapreduce.map.java.opts-Xmx10895m
mapreduce.reduce.java.opts-Xmx13516m
mapreduce.map.memory.mb15974
mapreduce.reduce.memory.mb16220
yarn.scheduler.minimum-allocation-mb532
yarn.scheduler.maximum-allocation-mb27238
yarn.nodemanager.resource.memory-mb63897

You can start a new JVM for every task, which provides better task isolation, or you can share JVMs between tasks, providing lower framework overhead. If you are processing many small files, it makes sense to reuse the JVM many times to amortize the cost of start-up. However, if each task takes a long time or processes a large amount of data, then you might choose to not reuse the JVM to ensure that all memory is freed for subsequent tasks.
Use the mapred.job.reuse.jvm.num.tasks option to configure the JVM reuse settings.
To modify JVM using a bootstrap action
  • In the directory where you installed the Amazon EMR CLI, run the following from the command line.

    Linux, UNIX, and Mac OS X users:
    • ./elastic-mapreduce --create --alive --name "JVM infinite reuse" \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
      --bootstrap-name "Configuring infinite JVM reuse" \
      --args "-m,mapred.job.reuse.jvm.num.tasks=-1"
    • Windows users:
      ruby elastic-mapreduce --create --alive --name "JVM infinite reuse" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --bootstrap-name "Configuring infinite JVM reuse" --args "-m,mapred.job.reuse.jvm.num.tasks=-1"
    • Note
Amazon EMR sets the value of mapred.job.reuse.jvm.num.tasks to 20, but you can override it with a bootstrap action. A value of -1 means infinite reuse within a single job, and 1 means do not reuse tasks.

Avoiding Cluster Slowdowns (AMI 3.0.0)

In a distributed environment, you are going to experience random delays, slow hardware, failing hardware, and other problems that collectively slow down your cluster. This is known as the stragglers problem. Hadoop has a feature called speculative execution that can help mitigate this issue. As the cluster progresses, some machines complete their tasks. Hadoop schedules tasks on nodes that are free. Whichever task finishes first is the successful one, and the other tasks are killed. This feature can substantially cut down on the run time of jobs. The general design of a mapreduce algorithm is such that the processing of map tasks is meant to be idempotent. However, if you are running a job where the task execution has side effects (for example, a zero reducer job that calls an external resource), it is important to disable speculative execution.
You can enable speculative execution for mappers and reducers independently. By default, Amazon EMR enables it for mappers and reducers in AMI 2.3. You can override these settings with a bootstrap action. For more information about using bootstrap actions
Speculative Execution Parameters
ParameterDefault Setting
mapred.map.tasks.speculative.executiontrue
mapred.reduce.tasks.speculative.executiontrue

To disable reducer speculative execution using a bootstrap action
  • In the directory where you installed the Amazon EMR CLI, run the following from the command line. 
    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Reducer speculative execution" \
      --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
      --bootstrap-name "Disable reducer speculative execution" \
      --args "-m,mapred.reduce.tasks.speculative.execution=false"
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Reducer speculative execution" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --bootstrap-name "Disable reducer speculative execution" --args "-m,mapred.reduce.tasks.speculative.execution=false"

Comments

Popular posts from this blog

Python and Parquet Performance

In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask. This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of  columnar storage ,  columnar compression  and  data partitioning . Used together, these three optimizations can dramatically accelerate I/O for your Python applications compared to CSV, JSON, HDF or other row-based formats. Parquet makes applications possible that are simply impossible using a text format like JSON or CSV. Introduction I have recently gotten more familiar with how to work with  Parquet  datasets across the six major tools used to read and write from Parquet in the Python ecosystem:  Pandas ,  PyArrow ,  fastparquet ,  AWS Data Wrangler ,  PySpark  and  Dask . My work of late in algorithmic trading involves switching between these tools a lot and as I said I often mix up the APIs. I use Pandas and PyArrow for in-RAM comput...

Kubernetes Configuration Provider to load data from Secrets and Config Maps

Using Kubernetes Configuration Provider to load data from Secrets and Config Maps When running Apache Kafka on Kubernetes, you will sooner or later probably need to use Config Maps or Secrets. Either to store something in them, or load them into your Kafka configuration. That is true regardless of whether you use Strimzi to manage your Apache Kafka cluster or something else. Kubernetes has its own way of using Secrets and Config Maps from Pods. But they might not be always sufficient. That is why in Strimzi, we created Kubernetes Configuration Provider for Apache Kafka which we will introduce in this blog post. Usually, when you need to use data from a Config Map or Secret in your Pod, you will either mount it as volume or map it to an environment variable. Both methods are configured in the spec section or the Pod resource or in the spec.template.spec section when using higher level resources such as Deployments or StatefulSets. When mounted as a volume, the contents of the Secr...

Andriod Bug

A bug that steals cash by racking up charges from sending premium rate text messages has been found in Google Play.  Security researchers have identified 32 apps on Google Play that harbour the bug called BadNews. A security firm Lookout, which uncovered BadNews, said that the malicious program lays dormant on handsets for weeks to escape detection.  The malware targeted Android owners in Russia, Ukraine, Belarus and other countries in eastern Europe. 32 apps were available through four separate developer accounts on Google Play. Google has now suspended those accounts and it has pulled all the affected apps from Google Play, it added. Half of the 32 apps seeded with BadNews are Russian and the version of AlphaSMS it installed is tuned to use premium rate numbers in Russia, Ukraine, Belarus, Armenia and Kazakhstan.