Installing CDH5 from the tarball distribution is not a really difficult, but getting the pseudo-distributed configuration right is all but straightforward. And since there are a few bugs that need fixing and configuring that needs to be done I automated it.
Automating the steps
All steps that need to be automated are described in my previous blog: Local and Pseudo-distributed CDH5 Hadoop on your laptop
All I needed to do was write some Ansible configuration scripts to perform these steps. For now I automated the steps to download and install CDH5, Spark, Hive, Pig and Mahout. Any extra packages are left as an exercise to the reader. I welcome your pull requests.
Configuration
Ansible needs some information from the user about the directory to install the software into. I first tried to use ansible’s vars_prompt module. this kind of works, but the scope of the variable is within the same yml file only. And I need it to be a global variable. After testing several of ansibles ways to provide variables I decided upon using a bash script to get the user’s input and provide ansible with that information through the --extra-vars command line option.
Next to that we want to use ansible to run a playbook. This means that we need to have the ansible-playbook command available. We assume ansible-playbook is on the PATH and will work.
Getting the install scripts
Getting the install scripts is done by issuing a git clone command:
$ git clone git@github.com:krisgeus/ansible_local_cdh_hadoop.git
Install
Installing the software has become a single line command:
$ start-playbook.sh
The script will ask the user for a directory to install the software into. Then it will start to download the packages into the $HOME.ansible-downloads directory. And it will unpack into the install directory the user provided.
In the install directory the script will create a bash_profile add-on to set the correct aliases.
$ source ${INSTALL_DIR}/.bash_profile_hadoop
Testing Hadoop in local mode
$ switch_local_cdh5
Now all the familiar hadoop commands should work. There is no notion of HDFS other then your local filesystem so the hadoop fs -ls / command will show you the same output as ls /
$ hadoop fs -ls /
drwxrwxr-x - root admin 2686 2014-04-18 09:47 /Applications
drwxr-xr-x - root wheel 2210 2014-02-26 02:46 /Library
drwxr-xr-x - root wheel 68 2013-08-25 05:45 /Network
drwxr-xr-x - root wheel 136 2013-10-23 03:05 /System
drwxr-xr-x - root admin 204 2013-10-23 03:09 /Users
drwxrwxrwt - root admin 136 2014-04-18 12:34 /Volumes
[...]
$ ls -l /
drwxrwxr-x+ 79 root admin 2.6K Apr 18 09:47 Applications
drwxr-xr-x+ 65 root wheel 2.2K Feb 26 02:46 Library
drwxr-xr-x@ 2 root wheel 68B Aug 25 2013 Network
drwxr-xr-x+ 4 root wheel 136B Oct 23 03:05 System
drwxr-xr-x 6 root admin 204B Oct 23 03:09 Users
drwxrwxrwt@ 4 root admin 136B Apr 18 12:34 Volumes
Running a MapReduce job should also work out of the box.
$ cd $HADOOP_PREFIX $ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar pi 10 100 Number of Maps = 10 Samples per Map = 100 2014-04-19 18:05:01.596 java[74281:1703] Unable to load realm info from SCDynamicStore 14/04/19 18:05:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job .... Job Finished in 1.587 seconds Estimated value of Pi is 3.14800000000000000000
Testing Hadoop in pseudo-distributed mode
$ switch_psuedo_cdh5
$ hadoop namenode -format
$ start-dfs.sh
$ hadoop fs -ls /
$ hadoop fs -mkdir /bogus
$ hadoop fs -ls /
2014-04-19 19:46:32.233 java[78176:1703] Unable to load realm info from SCDynamicStore
14/04/19 19:46:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-
java classes where applicable
Found 1 items
drwxr-xr-x - user supergroup 0 2014-04-19 19:46 /bogus
Ok HDFS is working, now on to a MapReduce job
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /cdh5.0.0/hadoop-2.3.0-cdh5.0.2/logs/yarn-user-resourcemanager-localdomain.local.out
Password:
localhost: starting nodemanager, logging to /cdh5.0.0/hadoop-2.3.0-cdh5.0.2/logs/yarn-user-nodemanager-localdomain.local.out
$ cd $HADOOP_PREFIX
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar pi 10 100
Number of Maps = 10
Samples per Map = 100
2014-04-20 10:21:56.696 java[80777:1703] Unable to load realm info from SCDynamicStore
14/04/20 10:22:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
14/04/20 10:22:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/04/20 10:22:12 INFO input.FileInputFormat: Total input paths to process : 10
14/04/20 10:22:12 INFO mapreduce.JobSubmitter: number of splits:10
14/04/20 10:22:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1397969462544_0001
14/04/20 10:22:13 INFO impl.YarnClientImpl: Submitted application application_1397969462544_0001
14/04/20 10:22:13 INFO mapreduce.Job: The url to track the job: http://localdomain.local:8088/proxy/application_1397969462544_0001/
14/04/20 10:22:13 INFO mapreduce.Job: Running job: job_1397969462544_0001
14/04/20 10:22:34 INFO mapreduce.Job: Job job_1397969462544_0001 running in uber mode : false
14/04/20 10:22:34 INFO mapreduce.Job: map 0% reduce 0%
14/04/20 10:22:53 INFO mapreduce.Job: map 10% reduce 0%
14/04/20 10:22:54 INFO mapreduce.Job: map 20% reduce 0%
14/04/20 10:22:55 INFO mapreduce.Job: map 30% reduce 0%
14/04/20 10:22:56 INFO mapreduce.Job: map 40% reduce 0%
14/04/20 10:22:57 INFO mapreduce.Job: map 50% reduce 0%
14/04/20 10:22:58 INFO mapreduce.Job: map 60% reduce 0%
14/04/20 10:23:12 INFO mapreduce.Job: map 70% reduce 0%
14/04/20 10:23:13 INFO mapreduce.Job: map 80% reduce 0%
14/04/20 10:23:15 INFO mapreduce.Job: map 90% reduce 0%
14/04/20 10:23:16 INFO mapreduce.Job: map 100% reduce 100%
14/04/20 10:23:16 INFO mapreduce.Job: Job job_1397969462544_0001 completed successfully
...
Job Finished in 64.352 seconds
Estimated value of Pi is 3.14800000000000000000
Testing Spark in local mode
$ switch_local_cdh5
$ spark-shell
SLF4J: Class path contains multiple SLF4J bindings.
...
2014-04-20 09:48:25,238 INFO [main] spark.HttpServer (Logging.scala:logInfo(49)) - Starting HTTP Server
2014-04-20 09:48:25,302 INFO [main] server.Server (Server.java:doStart(266)) - jetty-7.6.8.v20121106
2014-04-20 09:48:25,333 INFO [main] server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started SocketConnector@0.0.0.0:62951
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 0.9.0
/_/
Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_15)
Type in expressions to have them evaluated.
Type :help for more information.
...
Created spark context..
Spark context available as sc.
scala>
And we’re in!!
Testing Spark in pseudo-distributed mode
Now as a final test we check if spark will work on our pseudo distributed Hadoop config
$ switch_pseudo_cdh5
$ start-dfs.sh
$ start-yarn.sh
$ hadoop fs -mkdir /sourcedata
$ hadoop fs -put somelocal-textfile.txt /sourcedata/sometext.txt
$ spark-shell
scala> val file = sc.textFile("/sourcedata/sometext.txt")
file.take(5)
res1: Array[String] = Array("First", "five lines", "of", "the", "textfile" )
The current version of the ansible scripts are set to install the CDH version 5.0.2 packages. When a new version becomes available this version is easily changed by updating the vars/common.yml Yaml file.
If you have created ansible files to add other packages I welcome you to send me a pull request.