QUICKLY DEPLOY HADOOP CLUSTER USING DOCKER

Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Installing, configuring and adding new datanodes to the hadoop cluster takes time. Deploying hadoop nodes as a container can be really quick. On a CentOS7 VM in Virtualbox, it took 9 seconds to bring up 1 YarnMaster, 1 NameNode and 3 DataNodes, yes just 9 seconds. The steps below demonstrates howto quickly deploy a hadoop cluster using docker.

PULL DOCKER IMAGES
These docker images are build using CentOS7 and Cloudera CHD 5.9
~# docker pull swapnillinux/cloudera-hadoop-namenode
~# docker pull swapnillinux/cloudera-hadoop-yarnmaster
~# docker pull swapnillinux/cloudera-hadoop-datanode
CREATE A DOCKER BRIDGED NETWORK
This step is optional but I strongly recommend creating a separate network for hadoop cluster.
~# docker network create hadoop
CREATE YARNMASTER
~# docker run -d --net hadoop --net-alias yarnmaster --name yarnmaster -h yarnmaster -p 8032:8032 -p 8088:8088 swapnillinux/cloudera-hadoop-yarnmaster
CREATE NAMENODE
~# docker run -d --net hadoop --net-alias namenode --name namenode -h namenode -p 8020:8020 swapnillinux/cloudera-hadoop-namenode
CREATE FIRST DATANODE
~# docker run -d --net hadoop --net-alias datanode1 -h datanode1 --name datanode1 --link namenode --link yarnmaster swapnillinux/cloudera-hadoop-datanode
CREATING ADDITIONAL DATANODES
You can keep on adding additional datanodes just change --net-alias datanodeN
-h datanodeN
--name datanodeN
~# docker run -d --net hadoop --net-alias datanode2 -h datanode2 --name datanode2 --link namenode --link yarnmaster swapnillinux/cloudera-hadoop-datanode
VERIFY
open your browser pointing to http://docker-host-ip:8088
and click on Nodes
replace docker-host-ip
with IP Address of the linux box where you are running these containers.

RUN A TEST
Login to Namenode
[root@centos ~]# docker exec -it namenode bash
[root@namenode /]# hadoop version
Hadoop 2.6.0-cdh5.9.0
Subversion http://github.com/cloudera/hadoop -r 1c8ae0d951319fea693402c9f82449447fd27b07
Compiled by jenkins on 2016-10-21T08:10Z
Compiled with protoc 2.5.0
From source with checksum 5448863f1e597b97d9464796b0a451
This command was run using /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.9.0.jar
[root@namenode /]#
Run a PI calculation test
[root@namenode /]# hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 10
Number of Maps = 10
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
17/01/12 07:00:49 INFO client.RMProxy: Connecting to ResourceManager at yarnmaster/172.18.0.3:8032
17/01/12 07:00:50 INFO input.FileInputFormat: Total input paths to process : 10
17/01/12 07:00:50 INFO mapreduce.JobSubmitter: number of splits:10
17/01/12 07:00:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1484201581901_0001
17/01/12 07:00:51 INFO impl.YarnClientImpl: Submitted application application_1484201581901_0001
17/01/12 07:00:51 INFO mapreduce.Job: The url to track the job: http://yarnmaster:8088/proxy/application_1484201581901_0001/
17/01/12 07:00:51 INFO mapreduce.Job: Running job: job_1484201581901_0001
17/01/12 07:01:04 INFO mapreduce.Job: Job job_1484201581901_0001 running in uber mode : false
17/01/12 07:01:04 INFO mapreduce.Job: map 0% reduce 0%
17/01/12 07:01:36 INFO mapreduce.Job: map 10% reduce 0%
17/01/12 07:01:38 INFO mapreduce.Job: map 20% reduce 0%
17/01/12 07:01:58 INFO mapreduce.Job: map 20% reduce 7%
17/01/12 07:02:05 INFO mapreduce.Job: map 50% reduce 7%
17/01/12 07:02:06 INFO mapreduce.Job: map 60% reduce 7%
17/01/12 07:02:07 INFO mapreduce.Job: map 100% reduce 20%
17/01/12 07:02:09 INFO mapreduce.Job: map 100% reduce 100%
17/01/12 07:02:09 INFO mapreduce.Job: Job job_1484201581901_0001 completed successfully
17/01/12 07:02:10 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=226
FILE: Number of bytes written=1300574
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2620
HDFS: Number of bytes written=215
HDFS: Number of read operations=43
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=10
Launched reduce tasks=1
Data-local map tasks=10
Total time spent by all maps in occupied slots (ms)=528262
Total time spent by all reduces in occupied slots (ms)=30384
Total time spent by all map tasks (ms)=528262
Total time spent by all reduce tasks (ms)=30384
Total vcore-seconds taken by all map tasks=528262
Total vcore-seconds taken by all reduce tasks=30384
Total megabyte-seconds taken by all map tasks=540940288
Total megabyte-seconds taken by all reduce tasks=31113216
Map-Reduce Framework
Map input records=10
Map output records=20
Map output bytes=180
Map output materialized bytes=280
Input split bytes=1440
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=280
Reduce input records=20
Reduce output records=0
Spilled Records=40
Shuffled Maps =10
Failed Shuffles=0
Merged Map outputs=10
GC time elapsed (ms)=6213
CPU time spent (ms)=4740
Physical memory (bytes) snapshot=2266828800
Virtual memory (bytes) snapshot=28556414976
Total committed heap usage (bytes)=1741398016
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1180
File Output Format Counters
Bytes Written=97
Job Finished in 81.108 seconds
Estimated value of Pi is 3.20000000000000000000
[root@namenode /]#

Enjoy :)
Like it? Click here to Tweet your feedback