Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Installing, configuring and adding new datanodes to the hadoop cluster takes time. Deploying hadoop nodes as a container can be really quick. On a CentOS7 VM in Virtualbox, it took 9 seconds to bring up 1 YarnMaster, 1 NameNode and 3 DataNodes, yes just 9 seconds. The steps below demonstrates howto quickly deploy a hadoop cluster using docker.

Pull Docker Images

These docker images are build using CentOS7 and Cloudera CHD 5.9

~# docker pull swapnillinux/cloudera-hadoop-namenode

~# docker pull swapnillinux/cloudera-hadoop-yarnmaster

~# docker pull swapnillinux/cloudera-hadoop-datanode

Create a Docker Bridged Network

This step is optional but I strongly recommend creating a separate network for hadoop cluster.

~# docker network create hadoop

Create Yarnmaster

~# docker run -d --net hadoop --net-alias yarnmaster  --name yarnmaster -h yarnmaster -p 8032:8032 -p 8088:8088 swapnillinux/cloudera-hadoop-yarnmaster

Create Namenode

~# docker run -d --net hadoop --net-alias namenode --name namenode -h namenode -p 8020:8020 swapnillinux/cloudera-hadoop-namenode

Create First Datanode

~# docker run -d --net hadoop --net-alias datanode1  -h datanode1 --name datanode1 --link namenode --link yarnmaster swapnillinux/cloudera-hadoop-datanode

Creating additional Datanodes

You can keep on adding additional datanodes just change --net-alias datanodeN -h datanodeN --name datanodeN

~# docker run -d --net hadoop --net-alias datanode2  -h datanode2 --name datanode2 --link namenode --link yarnmaster swapnillinux/cloudera-hadoop-datanode


open your browser pointing to http://docker-host-ip:8088 and click on Nodes

replace docker-host-ip with IP Address of the linux box where you are running these containers.

Run A Test

Login to Namenode

[root@centos ~]# docker exec -it namenode bash
[root@namenode /]# hadoop version
Hadoop 2.6.0-cdh5.9.0
Subversion http://github.com/cloudera/hadoop -r 1c8ae0d951319fea693402c9f82449447fd27b07
Compiled by jenkins on 2016-10-21T08:10Z
Compiled with protoc 2.5.0
From source with checksum 5448863f1e597b97d9464796b0a451
This command was run using /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.9.0.jar
[root@namenode /]#

Run a PI calculation test

[root@namenode /]# hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 10
Number of Maps  = 10
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
17/01/12 07:00:49 INFO client.RMProxy: Connecting to ResourceManager at yarnmaster/
17/01/12 07:00:50 INFO input.FileInputFormat: Total input paths to process : 10
17/01/12 07:00:50 INFO mapreduce.JobSubmitter: number of splits:10
17/01/12 07:00:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1484201581901_0001
17/01/12 07:00:51 INFO impl.YarnClientImpl: Submitted application application_1484201581901_0001
17/01/12 07:00:51 INFO mapreduce.Job: The url to track the job: http://yarnmaster:8088/proxy/application_1484201581901_0001/
17/01/12 07:00:51 INFO mapreduce.Job: Running job: job_1484201581901_0001
17/01/12 07:01:04 INFO mapreduce.Job: Job job_1484201581901_0001 running in uber mode : false
17/01/12 07:01:04 INFO mapreduce.Job:  map 0% reduce 0%
17/01/12 07:01:36 INFO mapreduce.Job:  map 10% reduce 0%
17/01/12 07:01:38 INFO mapreduce.Job:  map 20% reduce 0%
17/01/12 07:01:58 INFO mapreduce.Job:  map 20% reduce 7%
17/01/12 07:02:05 INFO mapreduce.Job:  map 50% reduce 7%
17/01/12 07:02:06 INFO mapreduce.Job:  map 60% reduce 7%
17/01/12 07:02:07 INFO mapreduce.Job:  map 100% reduce 20%
17/01/12 07:02:09 INFO mapreduce.Job:  map 100% reduce 100%
17/01/12 07:02:09 INFO mapreduce.Job: Job job_1484201581901_0001 completed successfully
17/01/12 07:02:10 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=226
		FILE: Number of bytes written=1300574
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=2620
		HDFS: Number of bytes written=215
		HDFS: Number of read operations=43
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
	Job Counters 
		Launched map tasks=10
		Launched reduce tasks=1
		Data-local map tasks=10
		Total time spent by all maps in occupied slots (ms)=528262
		Total time spent by all reduces in occupied slots (ms)=30384
		Total time spent by all map tasks (ms)=528262
		Total time spent by all reduce tasks (ms)=30384
		Total vcore-seconds taken by all map tasks=528262
		Total vcore-seconds taken by all reduce tasks=30384
		Total megabyte-seconds taken by all map tasks=540940288
		Total megabyte-seconds taken by all reduce tasks=31113216
	Map-Reduce Framework
		Map input records=10
		Map output records=20
		Map output bytes=180
		Map output materialized bytes=280
		Input split bytes=1440
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=280
		Reduce input records=20
		Reduce output records=0
		Spilled Records=40
		Shuffled Maps =10
		Failed Shuffles=0
		Merged Map outputs=10
		GC time elapsed (ms)=6213
		CPU time spent (ms)=4740
		Physical memory (bytes) snapshot=2266828800
		Virtual memory (bytes) snapshot=28556414976
		Total committed heap usage (bytes)=1741398016
	Shuffle Errors
	File Input Format Counters 
		Bytes Read=1180
	File Output Format Counters 
		Bytes Written=97
Job Finished in 81.108 seconds
Estimated value of Pi is 3.20000000000000000000
[root@namenode /]# 

Enjoy :)

Like it? Click here to Tweet your feedback