Step-by-Step Guide to Setting Up Hadoop on Ubuntu: Installation and Configuration Walkthrough
Apache Hadoop is an open-source, Java-based software platform used to manage, store, and process large datasets across clusters of computers. It’s designed for handling big data and employs the Hadoop Distributed File System (HDFS) for data storage and MapReduce for data processing.
Hadoop’s significance lies in its ability to manage vast volumes of data across numerous interconnected servers effectively. This framework is widely utilized in various industries, serving as the standard for handling big data tasks. It’s specifically engineered to distribute and process enormous datasets by leveraging a network of servers that work collectively.
The platform includes a suite of tools that complement its core functionalities, allowing it to tackle diverse data challenges efficiently. Deployed across a network of multiple servers, Hadoop provides the infrastructure for storing, managing, and analyzing data. Understanding its fundamental architecture is crucial for configuring and optimizing its performance.
The image below provides an architectural overview, while a more detailed structure can be explored for a comprehensive understanding. This architecture aids in comprehending the configuration and setup processes for harnessing the capabilities of Hadoop.
Let's Begin.
Step 1: Install Java Development Kit
To start, you’ll need to install the Java Development Kit (JDK) on your Ubuntu system. The default Ubuntu repositories offer Java 8 and Java 11, but it’s recommended to use Java 8 for compatibility with Hive. You can use the following command to install it.
sudo apt update && sudo apt install openjdk-8-jdk
Step 2: Verify Java Version
Once the Java Development Kit is successfully installed, you should check the version to ensure it’s working correctly.
java -version
You should get the following output:
Step 3: Install SSH
SSH (Secure Shell) is crucial for Hadoop, as it facilitates secure communication between nodes in the Hadoop cluster. This is essential for maintaining data integrity and confidentiality and enabling efficient distributed data processing across the cluster.
sudo apt install ssh
Step 4: Create the Hadoop User
You need to create a user specifically for running Hadoop components. This user will also be used to log in to Hadoop’s web interface. Run the following command to create the user and set a password.
sudo adduser hadoop
Step 5: Switch User
Switch to the newly created ‘hadoop’ user using the following command:
su - hadoop
Step 6: Configure SSH
Next, you should set up password-less SSH access for the ‘Hadoop’ user to streamline the authentication process. You’ll generate an SSH keypair for this purpose. This avoids the need to enter a password or passphrase each time you want to access the Hadoop system.
ssh-keygen -t rsa
You should get the following output:
Step 7: Set permissions
Copy the generated public key to the authorised key file and set the proper permissions.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys
Step 8: SSH to the localhost
Verify if the password-less SSH is functional.
ssh localhost
You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes and hit Enter to authenticate the local host.
You should get the following output:
Step 9: Switch User
Switch to the ‘Hadoop’ user again using the following command.
su - hadoop
Step 10: Install Hadoop
To begin, download Hadoop version 3.3.6 using the ‘wget’ command
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Once the download is complete, extract the contents of the downloaded file using the ‘tar’ command. Optionally, you can rename the extracted folder to ‘Hadoop’ for easier configuration.
tar -xvzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 hadoop
Next, you must set up environment variables for Java and Hadoop in your system. Open the ‘~/.bashrc’ file in your preferred text editor. If you’re using ‘nano,’ you can paste code with ‘Ctrl+Shift+V,’ save with ‘Ctrl+X,’ ‘Ctrl+Y,’ and hit ‘Enter.’
nano ~/.bashrc
Append the following lines to the file
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Load the above configuration into the current environment
source ~/.bashrc
Additionally, you should configure the ‘JAVA_HOME’ in the ‘hadoop-env.sh’ file. Edit this file with a text editor
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Search for the ‘export JAVA_HOME’ line and configure it
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Step 11: Configuring Hadoop
Create the name node and data node directories within the ‘hadoop’ user’s home directory using the following commands
cd hadoop/
mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}
Next, edit the ‘core-site.xml’ file and replace the name with your system hostname
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Change the ‘fs.defaultFS’ property value according to your system hostname
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Save and close the file. Then, edit the ‘hdfs-site.xml’ file]
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Modify the ‘dfs.replication,’dfs.namenode.name.dir,’ and ‘dfs.datanode.data.dir’ properties as shown below
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Next, edit the ‘mapred-site.xml’ file
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Make the following changes
<configuration>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
</configuration>
Finally, edit the ‘yarn-site.xml’ file
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Make the following changes
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Save the file and close it.
Step 12: Start the Hadoop Cluster
Before starting the Hadoop cluster, you must format the Namenode as the ‘hadoop’ user. Format the Hadoop Namenode with the following command
hdfs namenode -format
You should get the following output:
Once the Namenode directory is formatted with the HDFS file system, the message “Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted.” Start the Hadoop cluster using the following code.
start-all.sh
You can check the status of all Hadoop services using the command
jps
You should get the following output:
Step 13: Access Hadoop Namenode and Resource Manager
First, determine your IP address by running the following code
Ifconfig
After that, install ‘net-tools’ if needed using the following code
sudo apt install net-tools
To access the Namenode, open your web browser and visit http://your-server-ip:9870. Replace ‘your-server-ip’ with your actual IP address. You should see the Namenode web interface.
To access Resource Manage, open your web browser and visit the URL http://your-server-ip:8088. You should see the following screen.
Step 14: Verify the Hadoop Cluster
The Hadoop cluster is installed and configured. Next, we will create directories in the HDFS filesystem to test the Hadoop. Create directories in the HDFS filesystem using the following command
hdfs dfs -mkdir /test1
hdfs dfs -mkdir /logs
Next, run the following command to list the above directory
hdfs dfs -ls /
You should get the following output:
Also, put some files into the Hadoop file system. For example, putting log files from the host machine into the Hadoop file system.
hdfs dfs -put /var/log/* /logs/
You can verify the above files and directory in the Hadoop web interface. Go to the web interface, click on the Utilities => Browse the file system. You should see your directories, which you created earlier, in the following screen
Step 14: To stop Hadoop services
To stop the Hadoop service, run the following command as a Hadoop user
stop-all.sh
You should get the following output:
Having successfully installed Hadoop on your Ubuntu system, you’re now poised to delve into the realm of big data analytics. Get set to embark on an exciting journey of exploration and discovery!
Conclusion
Through this guide, you’ve successfully set up Hadoop in a stand-alone mode and confirmed its functionality by executing a sample program included with the installation. To deepen your understanding of crafting your MapReduce programs, consider exploring Apache Hadoop’s MapReduce tutorial, which meticulously explores the underlying code of these examples.
When you’re prepared to establish a cluster environment, I recommend consulting the Apache Foundation’s comprehensive guide on Hadoop Cluster Setup. This resource will provide the necessary steps and insights to create a robust and scalable Hadoop cluster environment.