Configure Hadoop and Start cluster services using Ansible Playbook.

6 min readFeb 8, 2021

Hello everyone, In this blog we gonna configure Hadoop Master node(Namenode) and slave node(Datanode) via Ansible.

Before we start our practical let me just brief about Hadoop and Ansible.

About Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

NameNode works as Master in Hadoop cluster.Below listed are the main function performed by NameNode:

1. Stores metadata of actual data. E.g. Filename, Path, No. of Data Blocks, Block IDs, Block Location, No. of Replicas, Slave related configuration
2. Manages File system namespace.
3. Regulates client access request for actual file data file.
4. Assign work to Slaves(DataNode).
5. Executes file system name space operation like opening/closing files, renaming files and directories.
6. As Name node keep metadata in memory for fast retrieval, the huge amount of memory is required for its operation. This should be hosted on reliable hardware.

DataNode works as Slave in Hadoop cluster.Below listed are the main function performed by DataNode:

1. Actually stores Business data.
2. This is actual worker node were Read/Write/Data processing is handled.
3. Upon instruction from Master, it performs creation/replication/deletion of data blocks.
4. As all the Business data is stored on DataNode, the huge amount of storage is required for its operation. Commodity hardware can be used for hosting DataNode.

About Ansible

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code. It runs on many Unix-like systems, and can configure both Unix-like systems as well as Microsoft Windows.

For more info can refer my previous blog Use case of Ansible in automating today’s industries…

Now Lets start!

Firstly lets set up everything to perform the task-

We gonna perform our practical on Oracle VirtualBox, so launching three Red Hat Enterprise Linux 8 (RHEL8).

Our rhel8_arth machine would be installed with Ansible from where we will configure Namenode and Datanode machine .

rhel8_arth — 192.168.29.202 — machine where Ansible is installed
rhel-exp-2 — 192.168.29.225 — Name-node Machine
rhel8 hadoop slave — 192.168.29.195 — Data-node Machine

Here rhel8_arth(192.168.29.202) is Control Node and the other two are managed nodes.

Configuring Control Node.

Setting the inventory file so that ansible can connect with the managed nodes.

Now since inventory file is configured. Ansible can ping to managed nodes by the command -

anisble <ip/name of managed nodes> -m ping.

Namenode Configuration

To configure rhel-exp-2 (192.168.29.225) machine as a Namenode of Hadoop Cluster by Ansible from rhel8_arth (192.168.29.202 ) we need to configure some of the files

Create a directory and add the below files

core-site.xml
hdfs-site.xml
hadoop.yml
nam_var.yml

These are —

core-site.xml

hdfs-site.xml

nam_var.yml file

This is the variable file which contains those variables which we use in our playbook.

hadoop_path: "/root/hadoop-1.2.1-1.x86_64.rpm"
jdk_path: "/root/jdk-8u171-linux-x64.rpm"
hadoop_software: "hadoop-1.2.1-1.x86_64.rpm"
jdk_software: "jdk-8u171-linux-x64.rpm"

Playbook for Namenode

- hosts: namenode
  vars_files:
          - name_var.yml
  tasks:
     - name: "copying jdk files on namenode"
       copy:
         dest: "/root"
         src: "{{ jdk_path }}"- name: "copying hadoop files on namenode"
       copy:
         dest: "/root"
         src: "{{ hadoop_path }}"- name: "installing jdk"
       command: rpm -i "{{ jdk_software }}" 
       ignore_errors: yes
        
  
- name: "installing hadoop"
       command: rpm -i hadoop-1.2.1-1.x86_64.rpm --force
       ignore_errors: yes- name: configuring core-site.xml file
       template:
          dest: "/etc/hadoop/core-site.xml"
          src: "core-site.xml"- name: configuring hdfs-site.xml file
       template:
          dest: "/etc/hadoop/hdfs-site.xml"
          src: "hdfs-site.xml"- name: formatting master
       command: hadoop namenode -format- name: starting hadoop service
       command: hadoop-daemon.sh start namenode

Now put all the above four files (core-site.xml, hdfs-site.xml, nam_var.yml, playbook) in one directory ans run the playbook as below by the command -

ansible-playbook <name_of_YML_file>

Run the above command from same directory where you kept your yml file.

Now go to rhel-exp-2 (192.168.29.225) machine, as you configured it Namenode and run the command jps command.If it show Namenode (as below) than your Namenode is successfully configured by ansible.

Datanode Configuration

Now we need to configure Datanode machine by ansible which will provide its space(or memory) to Namenode.

To configure rhel hadoop slave(192.168.29.195) machine as a Datanode of Hadoop Cluster by Ansible from rhel8_arth (192.168.29.202) we need to configure all above files as below.

core-site.xml

hdfs-site.xml

nam_var.yml file

hadoop_path: "/root/hadoop-1.2.1-1.x86_64.rpm"
jdk_path: "/root/jdk-8u171-linux-x64.rpm"
hadoop_software: "hadoop-1.2.1-1.x86_64.rpm"
jdk_software: "jdk-8u171-linux-x64.rpm"

Playbook for Datanode

- hosts: datanode
  vars_files:
          - name_var.yml
  tasks:
     - name: "copying jdk files on datanode"
       copy:
         dest: "/root"
         src: "{{ jdk_path }}"- name: "copying hadoop files on datanode"
       copy:
         dest: "/root"
         src: "{{ hadoop_path }}"- name: "installing jdk"
       command: rpm -i "{{ jdk_software }}" 
       ignore_errors: yes
        
  
     - name: "installing hadoop"
       command: rpm -i hadoop-1.2.1-1.x86_64.rpm --force
       ignore_errors: yes- name: configuring core-site.xml file
       template:
          dest: "/etc/hadoop/core-site.xml"
          src: "core-site.xml"- name: configuring hdfs-site.xml file
       template:
          dest: "/etc/hadoop/hdfs-site.xml"
          src: "hdfs-site.xml"
 
     
- name: starting hadoop service on datanode
       command: hadoop-daemon.sh start datanode

Running the playbook by the command -

ansible-playbook <name_of_yml_file>

Now go to Datanode machine run jps command. If you see Datanode written as below than your Datanode is successfully configured and is ready to give storage to your Namenode machine.