Yogesh-R | March 7th, 2022
Difference between Hadoop 1 & Hadoop 2

Machine Learning

 

Hadoop V1:  

CHALLENGES:

  1.  Batch processing only supported i.e.  Only map reduce processing is achieved. 
  2.  Single point of failure due to name node down.
  3.  External data storage is needed for real time processing or graph analysis.
  4.  Doesn't support multi-tenancy   (means can't processing multiple jobs at the same time).
  5.  Can't run more 4000 cluster with better performance.

Hadoop V2:  

  1.  HDFS  federation 
  2.  Multiple namenode 
  3.  For Map reducing here YARN             
  • For better  processing control
  • Support for non-MapReduce type processing
  • Support for multi-tenancy 
https://3.bp.blogspot.com/-tEpt1A2uGoU/WM-1tErdLSI/AAAAAAAAEbo/LQZ54IIaNzce0ikGuSJbu2eS5pkJPxngwCLcB/s1600/HDFS_Federation-528x235.jpg

HDFS Architecture: Hadoop 1 v/s Hadoop 2

https://4.bp.blogspot.com/-H8kz6A51P6k/WS74mmPNRRI/AAAAAAAAE68/wn4xjZoOkhw46ZxfBBSzH5sbk0m_PSwRACLcB/s400/Hadoop-2.0-yarn.jpg

Hadoop 2 & YARN Architecture


1.  Resource Manager

  • Generally one Resource manager per cluster.
  • You may have 1 active resource manager and 1 stand by resource manager.

2.  Node Manager

  •  It launches and monitors resources for containers
  1. Application Manager 
  • It manages and arranges the task running for a MapReduce job.

Prerequisite: Conceptual knowledge of Big Data and Hadoop Frame work are required. Steps to configure are mentioned.

For Multinode Cluster I am assuming that we have 3 different hardware machines where Redhat/Cenots 7 is installed.

Important:   If in your lab or Virtual Machine you don't have DNS server then please follow the given steps.

Note:   For Web Portal Communication every node will be able to communicate using IP and Hostname.

Note:   You can use hostname command to setup hostname of any system as given below.

Step 1:   Setup hostname accordingly

[[email protected]~]# hostnamectl set-hostname  namenode.cluster1.com

For Making  it persistent
[[email protected] ~]# cat  /etc/hostname
NETWORKING=yes
HOSTNAME=namenode.cluster1.com

Step 2 :    Make  your local DNS   on  every system this file will look like this

[[email protected] ~]#    cat   /etc/hosts

192.168.0.254         namenode.cluster1.com
192.168.0.201         datanode1.cluster1.com
192.168.0.202         datanode2.cluster1.com
192.168.0.203         datanode3.cluster1.com


Step  3 :   Flush   firewall  and  disable  Selinux   on  each  node

[[email protected] ~]# setenforce  0
[[email protected] ~]# iptables -F
OR
0
[[email protected] ~]# systemctl disable --now firewalld   #  for production you can add firewall ruels

Step 4:   Make sure you have yum configured for software installation or you can download from apache on each node

[[email protected] ~]#

System 1:   (NameNode)

IP Address :     192.168.0.254
Hostname   :       namenode.cluster1.com

System 2:     (DataNode)

IP Address :     192.168.0.200
Hostname   :       datanode1.cluster1.com

System 3:     (DataNode)

IP Address :     192.168.0.202
Hostname   :       datanode2.cluster1.com

System 4:      (DataNode)

IP Address :     192.168.0.203
Hostname   :       datanode3.cluster1.com


Note:   For making  Hadoop cluster  you need to  install  java jdk  1.8 or higher   Hadoop 2  on each  node

Important:    I am assuming  that you have already yum  client  configured  or  you can download software from  oracle or apache  website.



HADOOP V2 CLUSTER SETUP:

Software required: 

1. Download apache Hadoop version 2.7 or later from given below link
https://archive.apache.org/dist/hadoop/core/


2.  Download jdk 1.8 from oracle website 
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

3.  Setting jdk path 
[namenode.cluster1.com ]#  cat  /root/.bashrc
# .bashrc

# User specific aliases and functions
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

JAVA_HOME=/usr/java/jdk1.8.0_121

4.  Installing  apache  Hadoop from tar

[[email protected]namenode hadoop]# tar  -xvzf  hadoop-2.7.3.tar.gz
move this to /  for permission and security reason
[[email protected]namenode hadoop]#   mv  hadoop-2.7.3  /hadoop2
Important : file must be look like this.
[[email protected] hadoop]# cat /root/.bashrc
# .bashrc

# User specific aliases and functions
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# Source global definitions
if [ -f /etc/bashrc ]; then
    . /etc/bashrc
fi
JAVA_HOME=/usr/java/jdk1.8.0_121
HADOOP_HOME=/hadoop2
PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export PATH

Note :    You will find  files something like given below

[[email protected]namenode hadoop]# cd /hadoop2/etc/hadoop/
[[email protected]namenode hadoop]# ls
capacity-scheduler.xml      hadoop-policy.xml        kms-log4j.properties        slaves
configuration.xsl           hdfs-site.xml            kms-site.xml                ssl-client.xml.example
container-executor.cfg      httpfs-env.sh            log4j.properties            ssl-server.xml.example
core-site.xml               httpfs-log4j.properties  mapred-env.cmd              yarn-env.cmd
hadoop-env.cmd              httpfs-signature.secret  mapred-env.sh               yarn-env.sh
hadoop-env.sh               httpfs-site.xml          mapred-queues.xml.template  yarn-site.xml
hadoop-metrics2.properties  kms-acls.xml             mapred-site.xml
hadoop-metrics.properties   kms-env.sh               mapred-site.xml.template

Now setting some important file for HDFS cluster: 

1.   In hadoop-env.sh setup jdk path then it will look like this

[[email protected]namenode hadoop]# cat   hadoop-env.sh 

# The only required environment variable is JAVA_HOME.  All others are

# set JAVA_HOME in this file, so that it is correctly defined on

export JAVA_HOME=/usr/java/jdk1.8.0_121

2.   hdfs-site.xml will look like this 

[[email protected]namenode hadoop]# cat hdfs-site.xml 

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

<name>dfs.namenode.name.dir</name>

<value>/nnnhhh22</value>

</property>

</configuration>

3.   core-site.xml will look like this

[[email protected]namenode hadoop]# cat core-site.xml 

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://192.168.10.120:10002</value>

</property>

</configuration>

FORMAT NAMENODE AND START SERVICE  

[[email protected]namenode hadoop]# hdfs namenode -format 

Start the service of namenode and datanode 

[[email protected]namenode hadoop]# hadoop-daemon.sh start namenode 



SETUP DATANODE 
Note:    Repeat the same for installing jdk and Hadoop v2

Steps to setup datanode and make entry in

 hdfs-site.xml 
[[email protected]datanode1 hadoop]# cat hdfs-site.xml 

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

<name>dfs.datanode.data.dir</name>

<value>/nnnhhh22ddnn</value>

</property>

</configuration>

 

[[email protected]datanode1 hadoop]#  cat core-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://192.168.10.120:10002</value>

</property>

</configuration>

[[email protected]datanode1 hadoop]# hadoop-daemon.sh  start  datanode

Now  time  for  Yarn cluster setup :

MR2 support 3 framework named:

Local : only run locally and only require master service to run (resourcemanager)
Classic : run MR1 framework
Yarn : run on multinode cluster require nodemanager, and require
container service for mapreduce_shuffle at nodemanger side

RESOURCE MANAGER & CLIENT ARE SAME NODE

[[email protected] hadoop]# cat  mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>
</configuration>

[[email protected] hadoop]# cat  yarn-site.xml

<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>resource_masterip:8025</value>
</property>


<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>resource_masterip:8030</value>
</property>

<property>
<name>yarn.resourcemanager.address</name>
<value>resource_masterip:8032</value>
</property>


[[email protected] hadoop]#   yarn-daemon.sh start resourcemanager

FOR NODE MANAGER:  (SLAVE)
vim yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>resource_masterip:8025</value>
</property>