Step-by-Step Guide to Setting Up Hadoop on Ubuntu: Installation and Configuration Walkthrough

7 min readNov 7, 2023

Apache Hadoop is an open-source, Java-based software platform used to manage, store, and process large datasets across clusters of computers. It’s designed for handling big data and employs the Hadoop Distributed File System (HDFS) for data storage and MapReduce for data processing.

Hadoop’s significance lies in its ability to manage vast volumes of data across numerous interconnected servers effectively. This framework is widely utilized in various industries, serving as the standard for handling big data tasks. It’s specifically engineered to distribute and process enormous datasets by leveraging a network of servers that work collectively.

The platform includes a suite of tools that complement its core functionalities, allowing it to tackle diverse data challenges efficiently. Deployed across a network of multiple servers, Hadoop provides the infrastructure for storing, managing, and analyzing data. Understanding its fundamental architecture is crucial for configuring and optimizing its performance.

The image below provides an architectural overview, while a more detailed structure can be explored for a comprehensive understanding. This architecture aids in comprehending the configuration and setup processes for harnessing the capabilities of Hadoop.

Let's Begin.

Step 1: Install Java Development Kit

To start, you’ll need to install the Java Development Kit (JDK) on your Ubuntu system. The default Ubuntu repositories offer Java 8 and Java 11, but it’s recommended to use Java 8 for compatibility with Hive. You can use the following command to install it.

sudo apt update && sudo apt install openjdk-8-jdk

Step 2: Verify Java Version

Once the Java Development Kit is successfully installed, you should check the version to ensure it’s working correctly.

java -version

You should get the following output:

Step 3: Install SSH

SSH (Secure Shell) is crucial for Hadoop, as it facilitates secure communication between nodes in the Hadoop cluster. This is essential for maintaining data integrity and confidentiality and enabling efficient distributed data processing across the cluster.

sudo apt install ssh

Step 4: Create the Hadoop User

You need to create a user specifically for running Hadoop components. This user will also be used to log in to Hadoop’s web interface. Run the following command to create the user and set a password.

sudo adduser hadoop

Step 5: Switch User

Switch to the newly created ‘hadoop’ user using the following command:

su - hadoop

Step 6: Configure SSH

Next, you should set up password-less SSH access for the ‘Hadoop’ user to streamline the authentication process. You’ll generate an SSH keypair for this purpose. This avoids the need to enter a password or passphrase each time you want to access the Hadoop system.

ssh-keygen -t rsa

You should get the following output:

Step 7: Set permissions

Copy the generated public key to the authorised key file and set the proper permissions.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys

Step 8: SSH to the localhost

Verify if the password-less SSH is functional.

ssh localhost

You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes and hit Enter to authenticate the local host.

You should get the following output:

Step 9: Switch User

Switch to the ‘Hadoop’ user again using the following command.

su - hadoop

Step 10: Install Hadoop

To begin, download Hadoop version 3.3.6 using the ‘wget’ command

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Once the download is complete, extract the contents of the downloaded file using the ‘tar’ command. Optionally, you can rename the extracted folder to ‘Hadoop’ for easier configuration.

tar -xvzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 hadoop

Next, you must set up environment variables for Java and Hadoop in your system. Open the ‘~/.bashrc’ file in your preferred text editor. If you’re using ‘nano,’ you can paste code with ‘Ctrl+Shift+V,’ save with ‘Ctrl+X,’ ‘Ctrl+Y,’ and hit ‘Enter.’

nano ~/.bashrc

Append the following lines to the file

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Load the above configuration into the current environment

source ~/.bashrc

Additionally, you should configure the ‘JAVA_HOME’ in the ‘hadoop-env.sh’ file. Edit this file with a text editor

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Search for the ‘export JAVA_HOME’ line and configure it

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Step 11: Configuring Hadoop

Create the name node and data node directories within the ‘hadoop’ user’s home directory using the following commands

cd hadoop/
mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}

Next, edit the ‘core-site.xml’ file and replace the name with your system hostname

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Change the ‘fs.defaultFS’ property value according to your system hostname

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Save and close the file. Then, edit the ‘hdfs-site.xml’ file]

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Modify the ‘dfs.replication,’dfs.namenode.name.dir,’ and ‘dfs.datanode.data.dir’ properties as shown below

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

Next, edit the ‘mapred-site.xml’ file

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Make the following changes

<configuration>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
</configuration>

Finally, edit the ‘yarn-site.xml’ file

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Make the following changes

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Save the file and close it.

Step 12: Start the Hadoop Cluster

Before starting the Hadoop cluster, you must format the Namenode as the ‘hadoop’ user. Format the Hadoop Namenode with the following command

hdfs namenode -format

You should get the following output:

Once the Namenode directory is formatted with the HDFS file system, the message “Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted.” Start the Hadoop cluster using the following code.

start-all.sh

You can check the status of all Hadoop services using the command

jps

You should get the following output:

Step 13: Access Hadoop Namenode and Resource Manager

First, determine your IP address by running the following code

 Ifconfig

After that, install ‘net-tools’ if needed using the following code

sudo apt install net-tools

To access the Namenode, open your web browser and visit http://your-server-ip:9870. Replace ‘your-server-ip’ with your actual IP address. You should see the Namenode web interface.

To access Resource Manage, open your web browser and visit the URL http://your-server-ip:8088. You should see the following screen.

Step 14: Verify the Hadoop Cluster

The Hadoop cluster is installed and configured. Next, we will create directories in the HDFS filesystem to test the Hadoop. Create directories in the HDFS filesystem using the following command

hdfs dfs -mkdir /test1
hdfs dfs -mkdir /logs

Next, run the following command to list the above directory

hdfs dfs -ls /

You should get the following output:

Also, put some files into the Hadoop file system. For example, putting log files from the host machine into the Hadoop file system.

hdfs dfs -put /var/log/* /logs/

You can verify the above files and directory in the Hadoop web interface. Go to the web interface, click on the Utilities => Browse the file system. You should see your directories, which you created earlier, in the following screen

Step 14: To stop Hadoop services

To stop the Hadoop service, run the following command as a Hadoop user

stop-all.sh

You should get the following output:

Having successfully installed Hadoop on your Ubuntu system, you’re now poised to delve into the realm of big data analytics. Get set to embark on an exciting journey of exploration and discovery!

Conclusion

Through this guide, you’ve successfully set up Hadoop in a stand-alone mode and confirmed its functionality by executing a sample program included with the installation. To deepen your understanding of crafting your MapReduce programs, consider exploring Apache Hadoop’s MapReduce tutorial, which meticulously explores the underlying code of these examples.

When you’re prepared to establish a cluster environment, I recommend consulting the Apache Foundation’s comprehensive guide on Hadoop Cluster Setup. This resource will provide the necessary steps and insights to create a robust and scalable Hadoop cluster environment.

Step-by-Step Guide to Setting Up Hadoop on Ubuntu: Installation and Configuration Walkthrough

Step 1: Install Java Development Kit

Step 2: Verify Java Version

Step 3: Install SSH

Conclusion

Written by Arjun Krishna K

No responses yet