Quickly deploy hadoop cluster using Docker

Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Installing, configuring and adding new datanodes to the hadoop cluster takes time. Deploying hadoop nodes as a container can be really quick. On a CentOS7 VM in Virtualbox, it took 9 seconds to bring up 1 YarnMaster, 1 NameNode and 3 DataNodes, yes just 9 seconds. The steps below demonstrates howto quickly deploy a hadoop cluster using docker.

Pull Docker Images

These docker images are build using CentOS7 and Cloudera CHD 5.9

~# docker pull swapnillinux/cloudera-hadoop-namenode

~# docker pull swapnillinux/cloudera-hadoop-yarnmaster

~# docker pull swapnillinux/cloudera-hadoop-datanode

Create a Docker Bridged Network

This step is optional but I strongly recommend creating a separate network for hadoop cluster.

~# docker network create hadoop

Create Yarnmaster

~# docker run -d --net hadoop --net-alias yarnmaster  --name yarnmaster -h yarnmaster -p 8032:8032 -p 8088:8088 swapnillinux/cloudera-hadoop-yarnmaster

Create Namenode

~# docker run -d --net hadoop --net-alias namenode --name namenode -h namenode -p 8020:8020 swapnillinux/cloudera-hadoop-namenode

Create First Datanode

~# docker run -d --net hadoop --net-alias datanode1  -h datanode1 --name datanode1 --link namenode --link yarnmaster swapnillinux/cloudera-hadoop-datanode

Creating additional Datanodes

You can keep on adding additional datanodes just change --net-alias datanodeN -h datanodeN --name datanodeN

~# docker run -d --net hadoop --net-alias datanode2  -h datanode2 --name datanode2 --link namenode --link yarnmaster swapnillinux/cloudera-hadoop-datanode


open your browser pointing to http://docker-host-ip:8088 and click on Nodes

replace docker-host-ip with IP Address of the linux box where you are running these containers.

Run A Test

Login to Namenode

[root@centos ~]# docker exec -it namenode bash
[root@namenode /]# hadoop version
Hadoop 2.6.0-cdh5.9.0  
Subversion http://github.com/cloudera/hadoop -r 1c8ae0d951319fea693402c9f82449447fd27b07  
Compiled by jenkins on 2016-10-21T08:10Z  
Compiled with protoc 2.5.0  
From source with checksum 5448863f1e597b97d9464796b0a451  
This command was run using /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.9.0.jar  
[root@namenode /]#

Run a PI calculation test

[root@namenode /]# hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 10
Number of Maps  = 10  
Samples per Map = 10  
Wrote input for Map #0  
Wrote input for Map #1  
Wrote input for Map #2  
Wrote input for Map #3  
Wrote input for Map #4  
Wrote input for Map #5  
Wrote input for Map #6  
Wrote input for Map #7  
Wrote input for Map #8  
Wrote input for Map #9  
Starting Job  
17/01/12 07:00:49 INFO client.RMProxy: Connecting to ResourceManager at yarnmaster/  
17/01/12 07:00:50 INFO input.FileInputFormat: Total input paths to process : 10  
17/01/12 07:00:50 INFO mapreduce.JobSubmitter: number of splits:10  
17/01/12 07:00:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1484201581901_0001  
17/01/12 07:00:51 INFO impl.YarnClientImpl: Submitted application application_1484201581901_0001  
17/01/12 07:00:51 INFO mapreduce.Job: The url to track the job: http://yarnmaster:8088/proxy/application_1484201581901_0001/  
17/01/12 07:00:51 INFO mapreduce.Job: Running job: job_1484201581901_0001  
17/01/12 07:01:04 INFO mapreduce.Job: Job job_1484201581901_0001 running in uber mode : false  
17/01/12 07:01:04 INFO mapreduce.Job:  map 0% reduce 0%  
17/01/12 07:01:36 INFO mapreduce.Job:  map 10% reduce 0%  
17/01/12 07:01:38 INFO mapreduce.Job:  map 20% reduce 0%  
17/01/12 07:01:58 INFO mapreduce.Job:  map 20% reduce 7%  
17/01/12 07:02:05 INFO mapreduce.Job:  map 50% reduce 7%  
17/01/12 07:02:06 INFO mapreduce.Job:  map 60% reduce 7%  
17/01/12 07:02:07 INFO mapreduce.Job:  map 100% reduce 20%  
17/01/12 07:02:09 INFO mapreduce.Job:  map 100% reduce 100%  
17/01/12 07:02:09 INFO mapreduce.Job: Job job_1484201581901_0001 completed successfully  
17/01/12 07:02:10 INFO mapreduce.Job: Counters: 49  
    File System Counters
        FILE: Number of bytes read=226
        FILE: Number of bytes written=1300574
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=2620
        HDFS: Number of bytes written=215
        HDFS: Number of read operations=43
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=3
    Job Counters 
        Launched map tasks=10
        Launched reduce tasks=1
        Data-local map tasks=10
        Total time spent by all maps in occupied slots (ms)=528262
        Total time spent by all reduces in occupied slots (ms)=30384
        Total time spent by all map tasks (ms)=528262
        Total time spent by all reduce tasks (ms)=30384
        Total vcore-seconds taken by all map tasks=528262
        Total vcore-seconds taken by all reduce tasks=30384
        Total megabyte-seconds taken by all map tasks=540940288
        Total megabyte-seconds taken by all reduce tasks=31113216
    Map-Reduce Framework
        Map input records=10
        Map output records=20
        Map output bytes=180
        Map output materialized bytes=280
        Input split bytes=1440
        Combine input records=0
        Combine output records=0
        Reduce input groups=2
        Reduce shuffle bytes=280
        Reduce input records=20
        Reduce output records=0
        Spilled Records=40
        Shuffled Maps =10
        Failed Shuffles=0
        Merged Map outputs=10
        GC time elapsed (ms)=6213
        CPU time spent (ms)=4740
        Physical memory (bytes) snapshot=2266828800
        Virtual memory (bytes) snapshot=28556414976
        Total committed heap usage (bytes)=1741398016
    Shuffle Errors
    File Input Format Counters 
        Bytes Read=1180
    File Output Format Counters 
        Bytes Written=97
Job Finished in 81.108 seconds  
Estimated value of Pi is 3.20000000000000000000  
[root@namenode /]# 

Enjoy :)

Like it? Click here to Tweet your feedback

Swapnil Jain

RHCA Level IX, RHCI ♦ Solution Architect ♦ DevOps Trainer & Consultant