GrabDuck

How to Install and Configure Apache Hadoop on a Single Node in CentOS 7

:

Apache Hadoop is an Open Source framework build for distributed Big Data storage and processing data across computer clusters. The project is based on the following components:

  1. Hadoop Common – it contains the Java libraries and utilities needed by other Hadoop modules.
  2. HDFS – Hadoop Distributed File System – A Java based scalable file system distributed across multiple nodes.
  3. MapReduce – YARN framework for parallel big data processing.
  4. Hadoop YARN: A framework for cluster resource management.

This article will guide you on how you can install Apache Hadoop on a single node cluster in CentOS 7 (also works for RHEL 7 and Fedora 23+ versions). This type of configuration is also referenced as Hadoop Pseudo-Distributed Mode.

Step 1: Install Java on CentOS 7

1. Before proceeding with Java installation, first login with root user or a user with root privileges setup your machine hostname with the following command.

# hostnamectl set-hostname master

Also, add a new record in hosts file with your own machine FQDN to point to your system IP Address.

# vi /etc/hosts

Add the below line:

192.168.1.41 master.hadoop.lan

Replace the above hostname and FQDN records with your own settings.

2. Next, go to Oracle Java download page and grab the latest version of Java SE Development Kit 8 on your system with the help of curl command:

# curl -LO -H "Cookie: oraclelicense=accept-securebackup-cookie" “http://download.oracle.com/otn-pub/java/jdk/8u92-b14/jdk-8u92-linux-x64.rpm”

3. After the Java binary download finishes, install the package by issuing the below command:

# rpm -Uvh jdk-8u92-linux-x64.rpm

Step 2: Install Hadoop Framework in CentOS 7

4. Next, create a new user account on your system without root powers which we’ll use it for Hadoop installation path and working environment. The new account home directory will reside in /opt/hadoop directory.

# useradd -d /opt/hadoop hadoop
# passwd hadoop

5. On the next step visit Apache Hadoop page in order to get the link for the latest stable version and download the archive on your system.

# curl -O http://apache.javapipe.com/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz 

6. Extract the archive the copy the directory content to hadoop account home path. Also, make sure you change the copied files permissions accordingly.

#  tar xfz hadoop-2.7.2.tar.gz
# cp -rf hadoop-2.7.2/* /opt/hadoop/
# chown -R hadoop:hadoop /opt/hadoop/

7. Next, login with hadoop user and configure Hadoop and Java Environment Variables on your system by editing the .bash_profile file.

# su - hadoop
$ vi .bash_profile

Append the following lines at the end of the file:

## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
## HADOOP env variables
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

8. Now, initialize the environment variables and check their status by issuing the below commands:

$ source .bash_profile
$ echo $HADOOP_HOME
$ echo $JAVA_HOME

9. Finally, configure ssh key based authentication for hadoop account by running the below commands (replace the hostname or FQDN against the ssh-copy-id command accordingly).

Also, leave the passphrase filed blank in order to automatically login via ssh.

$ ssh-keygen -t rsa
$ ssh-copy-id master.hadoop.lan