This tutorial will cover setting up a Hadoop cluster on DigitalOcean. The Hadoop software library is an Apache framework that lets you process large data sets in a distributed way across server clusters through leveraging basic programming models. The scalability provided by Hadoop allows you to scale up from single servers to thousands of machines. It also provides failure detection at the application layer, so it can detect and handle failures as a high-availability service.
There are 4 important modules that we will be working with in this tutorial:
In this tutorial, we will be setting up and running a Hadoop cluster on four DigitalOcean Droplets.
This tutorial will require the following:
Four Ubuntu 16.04 Droplets with non-root sudo users set up. If you do not have this set up, follow along with steps 1-4 of the Initial Server Setup with Ubuntu 16.04. This tutorial will assume that you are using an SSH key from a local machine. Per Hadoop’s language, we’ll refer to these Droplets by the following names:
hadoop-master
hadoop-worker-01
hadoop-worker-02
hadoop-worker-03
Additionally, you may want to use DigitalOcean Snapshots after the initial server set up and the completion of Steps 1 and 2 (below) of your first Droplet.
With these prerequisites in place, you will be ready to begin setting up a Hadoop cluster.
We’re going to be installing Java and Hadoop on each of our four Droplets. If you don’t want to repeat each step on each Droplet, you can use DigitalOcean Snapshots at the end of Step 2 in order to replicate your initial installation and configuration.
First, we’ll update Ubuntu with the latest software patches available:
sudo apt-get update && sudo apt-get -y dist-upgrade
Next, let’s install the headless version of Java for Ubuntu on each Droplet. “Headless” refers to the software that is capable of running on a device without a graphical user interface.
sudo apt-get -y install openjdk-8-jdk-headless
To install Hadoop on each Droplet, let’s make the directory where Hadoop will be installed. We can call it my-hadoop-install
and then move into that directory.
mkdir my-hadoop-install && cd my-hadoop-install
Once we’ve created the directory, let’s install the most recent binary from the Hadoop releases list. At the time of this tutorial, the most recent is Hadoop 3.0.1
.
Note: Keep in mind that these downloads are distributed via mirror sites, and it is recommended that it be checked first for tampering using either GPG or SHA-256.
When you are satisfied with the download you have selected, you can use the wget
command with the binary link you have chosen, such as:
wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-3.0.1/hadoop-3.0.1.tar.gz
Once your download is complete, unzip the file’s contents using tar
, a file archiving tool for Ubuntu:
tar xvzf hadoop-3.0.1.tar.gz
We’re now ready to start our initial configuration.
For each Droplet node, we’ll need to set up JAVA_HOME
. Open the following file with nano or another text editor of your choice so that we can update it:
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hadoop-env.sh
Update the following section, where JAVA_HOME
is located:
hadoop-env.sh
...
###
# Generic settings for HADOOP
###
# Technically, the only required environment variable is JAVA_HOME.
# All others are optional. However, the defaults are probably not
# preferred. Many sites configure these options outside of Hadoop,
# such as in /etc/profile.d
# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
# export JAVA_HOME=
# Location of Hadoop. By default, Hadoop will attempt to determine
# this location based upon its execution path.
# export HADOOP_HOME=
...
To look like this:
hadoop-env.sh
...
###
# Generic settings for HADOOP
###
# Technically, the only required environment variable is JAVA_HOME.
# All others are optional. However, the defaults are probably not
# preferred. Many sites configure these options outside of Hadoop,
# such as in /etc/profile.d
# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
# Location of Hadoop. By default, Hadoop will attempt to determine
# this location based upon its execution path.
# export HADOOP_HOME=
...
We’ll also need to add some environment variables to run Hadoop and its modules. They should be added to the bottom of the file so it looks like the following, where sammy
would be your sudo non-root user’s username.
Note: If you are using a different username across your cluster Droplets, you will need to edit this file in order to reflect the correct username for each specific Droplet.
hadoop-env.sh
...
#
# To prevent accidents, shell commands be (superficially) locked
# to only allow certain users to execute certain subcommands.
# It uses the format of (command)_(subcommand)_USER.
#
# For example, to limit who can execute the namenode command,
export HDFS_NAMENODE_USER="sammy"
export HDFS_DATANODE_USER="sammy"
export HDFS_SECONDARYNAMENODE_USER="sammy"
export YARN_RESOURCEMANAGER_USER="sammy"
export YARN_NODEMANAGER_USER="sammy"
At this point, you can save and exit the file. Next, run the following command to apply our exports:
source ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hadoop-env.sh
With the hadoop-env.sh
script updated and sourced, we need to create a data directory for the Hadoop Distributed File System (HDFS) to store all relevant HDFS
files.
sudo mkdir -p /usr/local/hadoop/hdfs/data
Set the permissions for this file with your respective user. Remember, if you have different usernames on each Droplet, be sure to allow your respective sudo user to have these permissions:
sudo chown -R sammy:sammy /usr/local/hadoop/hdfs/data
If you would like to use a DigitalOcean Snapshot to replicate these commands across your Droplet nodes, you can create your Snapshot now and create new Droplets from this image. For guidance on this, you can read An Introduction to DigitalOcean Snapshots.
When you have completed the steps above across all four Ubuntu Droplets, you can move on to completing this configuration across nodes.
At this point, we need to update the core_site.xml
file for all 4 of your Droplet nodes. Within each individual Droplet, open the following file:
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/core-site.xml
You should see the following lines:
core-site.xml
...
<configuration>
</configuration>
Change the file to look like the following XML so that we include each Droplet’s respective IP inside of the property value, where we have server-ip
written. If you are using a firewall, you’ll need to open port 9000.
core-site.xml
...
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://server-ip:9000</value>
</property>
</configuration>
Repeat the above writing in the relevant Droplet IP for all four of your servers.
Now all of the general Hadoop settings should be updated for each server node, and we can continue onto connecting our nodes via SSH keys.
In order for Hadoop to work properly, we need to set up passwordless SSH between the master node and the worker nodes (the language of master
and worker
is Hadoop’s language to refer to primary
and secondary
servers).
For this tutorial, the master node will be hadoop-master
and the worker nodes will be collectively referred to as hadoop-worker
, but you’ll have three of them in total (referred to as -01
, -02
, and -03
). We first need to create a public-private key-pair on the master node, which will be the node with the IP address belonging to hadoop-master
.
While on the hadoop-master
Droplet, run the following command. You’ll press enter
to use the default for the key location, then press enter
twice to use an empty passphrase:
ssh-keygen
For each of the worker nodes, we need to take the master node’s public key and copy it into each of the worker nodes’ authorized_keys
file.
Get the public key from the master node by running cat
on the id_rsa.pub
file located in your .ssh
folder, to print to console:
cat ~/.ssh/id_rsa.pub
Now log into each worker node Droplet, and open the authorized_keys
file:
[environment fourth]
nano ~/.ssh/authorized_keys
You’ll copy the master node’s public key — which is the output you generated from the cat ~/.ssh/id_rsa.pub
command on the master node — into each Droplet’s respective ~/.ssh/authorized_keys
file. Be sure to save each file before closing.
When you are finished updating the 3 worker nodes, also copy the master node’s public key into its own authorized_keys
file by issuing the same command:
[environment second]
nano ~/.ssh/authorized_keys
On hadoop-master
, you should set up the ssh
configuration to include each of the hostnames of the related nodes. Open the configuration file for editing, using nano:
[environment second]
nano ~/.ssh/config
You should modify the file to look like the following, with relevant IPs and usernames added.
[environment second]
[label config]
Host hadoop-master-server-ip
HostName hadoop-example-node-server-ip
User sammy
IdentityFile ~/.ssh/id_rsa
Host hadoop-worker-01-server-ip
HostName hadoop-worker-01-server-ip
User sammy
IdentityFile ~/.ssh/id_rsa
Host hadoop-worker-02-server-ip
HostName hadoop-worker-02-server-ip
User sammy
IdentityFile ~/.ssh/id_rsa
Host hadoop-worker-03-server-ip
HostName hadoop-worker-03-server-ip
User sammy
IdentityFile ~/.ssh/id_rsa
Save and close the file.
From the hadoop-master
, SSH into each node:
[environment second]
ssh sammy@hadoop-worker-01-server-ip
Since it’s your first time logging into each node with the current system set up, it will ask you the following:
[environment second]
[secondary_label Output]
are you sure you want to continue connecting (yes/no)?
Reply to the prompt with yes
. This will be the only time it needs to be done, but it is required for each worker node for the initial SSH connection. Finally, log out of each worker node to return to hadoop-master
:
[environment fourth]
logout
Be sure to repeat these steps for the remaining two worker nodes.
Now that we have successfully set up passwordless SSH for each worker node, we can now continue to configure the master node.
For our Hadoop cluster, we need to configure the HDFS properties on the master node Droplet.
While on the master node, edit the following file:
[environment second]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hdfs-site.xml
Edit the configuration
section to look like the XML below:
[environment second]
[label hdfs-site.xml]
...
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/hdfs/data</value>
</property>
</configuration>
Save and close the file.
We’ll next configure the MapReduce
properties on the master node. Open mapred.site.xml
with nano or another text editor:
[environment second]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/mapred-site.xml
Then update the file so that it looks like this, with your current server’s IP address reflected below:
[environment second]
[label mapred-site.xml]
...
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>hadoop-master-server-ip:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Save and close the file. If you are using a firewall, be sure to open port 54311.
Next, set up YARN on the master node. Again, we are updating the configuration section of another XML file, so let’s open the file:
[environment second]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/yarn-site.xml
Now update the file, being sure to input your current server’s IP address:
[environment second]
[label yarn-site.xml]
...
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master-server-ip</value>
</property>
</configuration>
Finally, let’s configure Hadoop’s point of reference for what the master and worker nodes should be. First, open the masters
file:
[environment second]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/masters
Into this file, you’ll add your current server’s IP address:
[environment second]
[label masters]
hadoop-master-server-ip
Now, open and edit the workers
file:
[environment second]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/workers
Here, you’ll add the IP addresses of each of your worker nodes, underneath where it says localhost
.
[environment second]
[label workers]
localhost
hadoop-worker-01-server-ip
hadoop-worker-02-server-ip
hadoop-worker-03-server-ip
After finishing the configuration of the MapReduce
and YARN
properties, we can now finish configuring the worker nodes.
We’ll now configure the worker nodes so that they each have the correct reference to the data directory for HDFS.
On each worker node, edit this XML file:
[environment fourth]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hdfs-site.xml
Replace the configuration section with the following:
[label hdfs-site.xml]
[environment fourth]
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/hdfs/data</value>
</property>
</configuration>
Save and close the file. Be sure to replicate this step on all three of your worker nodes.
At this point, our worker node Droplets are pointing to the data directory for HDFS, which will allow us to run our Hadoop cluster.
We have reached a point where we can start our Hadoop cluster. Before we start it up, we need to format the HDFS on the master node. While on the master node Droplet, change directories to where Hadoop is installed:
[environment second]
cd ~/my-hadoop-install/hadoop-3.0.1/
Then run the following command to format HDFS:
[environment second]
sudo ./bin/hdfs namenode -format
A successful formatting of the namenode will result in a lot of output, consisting of mostly INFO
statements. At the bottom you will see the following, confirming that you’ve successfully formatted the storage directory.
[environment second]
[secondary_label Output]
...
2018-01-28 17:58:08,323 INFO common.Storage: Storage directory /usr/local/hadoop/hdfs/data has been successfully formatted.
2018-01-28 17:58:08,346 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/local/hadoop/hdfs/data/current/fsimage.ckpt_0000000000000000000 using no compression
2018-01-28 17:58:08,490 INFO namenode.FSImageFormatProtobuf: Image file /usr/local/hadoop/hdfs/data/current/fsimage.ckpt_0000000000000000000 of size 389 bytes saved in 0 seconds.
2018-01-28 17:58:08,505 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2018-01-28 17:58:08,519 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-example-node/127.0.1.1
************************************************************/
Now, start the Hadoop cluster by running the following scripts (be sure to check scripts before running by using the less
command):
[environment second]
sudo ./sbin/start-dfs.sh
You’ll then see output that contains the following:
[environment second]
[secondary_label Output]
Starting namenodes on [hadoop-master-server-ip]
Starting datanodes
Starting secondary namenodes [hadoop-master]
Then run YARN, using the following script:
[environment second]
./sbin/start-yarn.sh
The following output will appear:
[environment second]
[secondary_label Output]
Starting resourcemanager
Starting nodemanagers
Once you run those commands, you should have daemons running on the master node and one on each of the worker nodes.
We can check the daemons by running the jps
command to check for Java processes:
[environment second]
jps
After running the jps
command, you will see that the NodeManager
, SecondaryNameNode
, Jps
, NameNode
, ResourceManager
, and DataNode
are running. Something similar to the following output will appear:
[environment second]
[secondary_label Output]
9810 NodeManager
9252 SecondaryNameNode
10164 Jps
8920 NameNode
9674 ResourceManager
9051 DataNode
This verifies that we’ve successfully created a cluster and verifies that the Hadoop daemons are running.
In a web browser of your choice, you can get an overview of the health of your cluster by navigating to:
http://hadoop-master-server-ip:9870
If you have a firewall, be sure to open port 9870. You’ll see something that looks similar to the following:
From here, you can navigate to the Datanodes
item in the menu bar to see the node activity.
In this tutorial, we went over how to set up and configure a Hadoop multi-node cluster using DigitalOcean Ubuntu 16.04 Droplets. You can also now monitor and check the health of your cluster using Hadoop’s DFS Health web interface.
To get an idea of possible projects you can work on to utilize your newly configured cluster, check out Apache’s long list of projects powered by Hadoop.