Configure Hadoop and Start cluster services using Ansible Playbook.

Harsh Agrawal
6 min readFeb 8, 2021

--

Hello everyone, In this blog we gonna configure Hadoop Master node(Namenode) and slave node(Datanode) via Ansible.

Before we start our practical let me just brief about Hadoop and Ansible.

About Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

NameNode works as Master in Hadoop cluster.Below listed are the main function performed by NameNode:

1. Stores metadata of actual data. E.g. Filename, Path, No. of Data Blocks, Block IDs, Block Location, No. of Replicas, Slave related configuration
2. Manages File system namespace.
3. Regulates client access request for actual file data file.
4. Assign work to Slaves(DataNode).
5. Executes file system name space operation like opening/closing files, renaming files and directories.
6. As Name node keep metadata in memory for fast retrieval, the huge amount of memory is required for its operation. This should be hosted on reliable hardware.

DataNode works as Slave in Hadoop cluster.Below listed are the main function performed by DataNode:

1. Actually stores Business data.
2. This is actual worker node were Read/Write/Data processing is handled.
3. Upon instruction from Master, it performs creation/replication/deletion of data blocks.
4. As all the Business data is stored on DataNode, the huge amount of storage is required for its operation. Commodity hardware can be used for hosting DataNode.

About Ansible

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code. It runs on many Unix-like systems, and can configure both Unix-like systems as well as Microsoft Windows.

For more info can refer my previous blog Use case of Ansible in automating today’s industries…

Now Lets start!

Firstly lets set up everything to perform the task-

We gonna perform our practical on Oracle VirtualBox, so launching three Red Hat Enterprise Linux 8 (RHEL8).

Our rhel8_arth machine would be installed with Ansible from where we will configure Namenode and Datanode machine .

  • rhel8_arth — 192.168.29.202 — machine where Ansible is installed
  • rhel-exp-2 — 192.168.29.225 — Name-node Machine
  • rhel8 hadoop slave — 192.168.29.195 — Data-node Machine

Here rhel8_arth(192.168.29.202) is Control Node and the other two are managed nodes.

Configuring Control Node.

  • Setting the inventory file so that ansible can connect with the managed nodes.
Setting inventory file
  • Now since inventory file is configured. Ansible can ping to managed nodes by the command -
anisble <ip/name of managed nodes> -m ping.
Checking the connectivity

Namenode Configuration

To configure rhel-exp-2 (192.168.29.225) machine as a Namenode of Hadoop Cluster by Ansible from rhel8_arth (192.168.29.202 ) we need to configure some of the files

Create a directory and add the below files

  1. core-site.xml
  2. hdfs-site.xml
  3. hadoop.yml
  4. nam_var.yml

These are —

  • core-site.xml
core-site.xml
  • hdfs-site.xml
hdfs-site.xml
  • nam_var.yml file

This is the variable file which contains those variables which we use in our playbook.

hadoop_path: "/root/hadoop-1.2.1-1.x86_64.rpm"
jdk_path: "/root/jdk-8u171-linux-x64.rpm"
hadoop_software: "hadoop-1.2.1-1.x86_64.rpm"
jdk_software: "jdk-8u171-linux-x64.rpm"
  • Playbook for Namenode
- hosts: namenode
vars_files:
- name_var.yml
tasks:
- name: "copying jdk files on namenode"
copy:
dest: "/root"
src: "{{ jdk_path }}"
- name: "copying hadoop files on namenode"
copy:
dest: "/root"
src: "{{ hadoop_path }}"
- name: "installing jdk"
command: rpm -i "{{ jdk_software }}"
ignore_errors: yes


- name: "installing hadoop"
command: rpm -i hadoop-1.2.1-1.x86_64.rpm --force
ignore_errors: yes
- name: configuring core-site.xml file
template:
dest: "/etc/hadoop/core-site.xml"
src: "core-site.xml"
- name: configuring hdfs-site.xml file
template:
dest: "/etc/hadoop/hdfs-site.xml"
src: "hdfs-site.xml"
- name: formatting master
command: hadoop namenode -format
- name: starting hadoop service
command: hadoop-daemon.sh start namenode

Now put all the above four files (core-site.xml, hdfs-site.xml, nam_var.yml, playbook) in one directory ans run the playbook as below by the command -

ansible-playbook <name_of_YML_file>

Run the above command from same directory where you kept your yml file.

Running playbook to configure Namenode
Running playbook to configure Namenode

Now go to rhel-exp-2 (192.168.29.225) machine, as you configured it Namenode and run the command jps command.If it show Namenode (as below) than your Namenode is successfully configured by ansible.

Datanode Configuration

Now we need to configure Datanode machine by ansible which will provide its space(or memory) to Namenode.

To configure rhel hadoop slave(192.168.29.195) machine as a Datanode of Hadoop Cluster by Ansible from rhel8_arth (192.168.29.202) we need to configure all above files as below.

  • core-site.xml
core-site.xml
  • hdfs-site.xml
hdfs-site.xml
  • nam_var.yml file
hadoop_path: "/root/hadoop-1.2.1-1.x86_64.rpm"
jdk_path: "/root/jdk-8u171-linux-x64.rpm"
hadoop_software: "hadoop-1.2.1-1.x86_64.rpm"
jdk_software: "jdk-8u171-linux-x64.rpm"
  • Playbook for Datanode
- hosts: datanode
vars_files:
- name_var.yml
tasks:
- name: "copying jdk files on datanode"
copy:
dest: "/root"
src: "{{ jdk_path }}"
- name: "copying hadoop files on datanode"
copy:
dest: "/root"
src: "{{ hadoop_path }}"
- name: "installing jdk"
command: rpm -i "{{ jdk_software }}"
ignore_errors: yes


- name: "installing hadoop"
command: rpm -i hadoop-1.2.1-1.x86_64.rpm --force
ignore_errors: yes
- name: configuring core-site.xml file
template:
dest: "/etc/hadoop/core-site.xml"
src: "core-site.xml"
- name: configuring hdfs-site.xml file
template:
dest: "/etc/hadoop/hdfs-site.xml"
src: "hdfs-site.xml"


- name: starting hadoop service on datanode
command: hadoop-daemon.sh start datanode

Running the playbook by the command -

ansible-playbook <name_of_yml_file>
Running playbook to configure Datanode
Running playbook to configure Datanode

Now go to Datanode machine run jps command. If you see Datanode written as below than your Datanode is successfully configured and is ready to give storage to your Namenode machine.

Again go to your namenode machine and run the command-

hadoop dfsadmin -report

You can clearly see the ip address of the Datanode(192.168.29.195) and has configured 46.81 GB of space to Namenode machine.

So that’s how we can set up the Hadoop cluster using Ansible.

If you have any query, can contact me via LinkedIn Harsh Agrawal .

Also can refer my scripts on GitHub.

Thank you

Harsh Agrawal..

--

--