Tech Blog

Ajith Vijayan Tech Blog

Tag: Big data

Best Practices in Cluster provisioning and Management- An Overview

For creating a cluster, we want some idea about how the cluster should be created and cluster management.Here i am noting down some points for cluster creation and management process.

Platform Requirements

  • Cloudera distribution is a good option to create a Hadoop cluster since it got a well structured repository and a well defined documentation set(Advanced user may go for the builds from Apache community).
  • Cloudera Manager is designed to make administration of Hadoop simple and straight forward at any scale. With help of Cloudera manager, you can easily deploy and centrally operate the complete Hadoop stack. The application automates the installation process, reducing deployment time from week to minutes.
  • Centos is the best option as OS since it’s developed on RHEL architecture and supports all RHEL add-ones.
  • Yum install <packages> is a command that is used frequently for installing packages from remote repository. Yum install will pick the repository url from /etc/yum.repos.d, download the packages and install it in the machine. Normally yum will work in machine having internet access. But if we want to install packages in isolated environment, normal yum install will not work, because the remote repository may not be accessible in isolated environment. In that situation, we are creating a local yum repository.
  • It’s better to turn off Graphical user experience in all the host machines, for efficient utilization of memory.
  • For each installation add required environmental variables in /etc/bashrc file or /etc/profile for public access.
  • For updating environment variables from /etc/bashrc file or /etc/profile files use ‘source’ command.

Required Services

  • Ensure sshd service is running in each node to make Secure Shell access active.
  • Ensure IPtables service is stopped.
  • Oracle jdk 1.6+ should be used (instead of open JDK) for JVM Process Status(JPS) which is used for displaying currently running Hadoop daemons.

Generic Points

  • For tarball or source build installations ‘/opt’ location is preferred.
  • Rebooting the Linux machine to change configurations is a bad practice and may negatively affect the overall cluster balance.
  • For network connection issues restart network service other than rebooting the host machine.

How to run multiple Tomcat instances in one server

Last week, i install multiple tomcat instances in one server.Usually most of the services are listening to single port.which means in a machine, one tomcat instance will listen one port.So for running multiple Tomcat instances in a single server, we have to change the ports.The steps are given below

1.) Download three tomcat instances and unpack it in three different folders.

2.) Then edit WEBSERVER_HOME/conf/server.xml file individually.

for eg.

The first Tomcat is in the location /opt/Ajith/Tomcat (This is the WEBSERVER_HOME).

edit WEBSERVER_HOME/conf/server.xml

when we unpack the tomcat, the server.xml will look like this

[code lang="shell"]

&lt;Connector port="10090" protocol="HTTP/1.1"

connectionTimeout"20000"

redirectport="8443"&nbsp; /&gt;

[/code]

As i explained earlier, No two tomcat server(or any other server) will not  listen/run in same port.

So we have to change the ports accordingly.

[code lang="shell"]

&lt;Connector port="10030" protocol="HTTP/1.1"

connectionTimeout"20000"

redirectport="8553"&nbsp; /&gt;

[/code]

3.) Then restart the tomcat.

4.) But most of the cause it will not start.Because we have to change some more ports on server.xml .

So we have to change “shutdown port” and “AJP” port.

[code lang="shell"]

&lt;Server port="8005" shutdown="SHUTDOWN"&gt;

.................

&lt;Connector port="8010" protocol="AJP/1.3" redirectport="8553" /&gt;

[/code]

Note:Make sure that the redirect port should be same in one tomcat server.xml(in this cause 8553″

5.)Restart the tomcat

Repeat the steps in other tomcat with different ports which is not used in that machine.

Hadoop Single Node Installation

step 1: Extracting Hadoop tarball

We are creating a user home for Hadoop installation.Here we are using ‘/home/hduser/Utilities’ as user home.You need to extract the tarball in this location and change the permissions recursively on the extracted directory.

Here we are using hadoop-1.0.3.tar

mkdir -p /home/hduser/Utilities

cd /home/hduser/Utilities

sudo tar -xzvf hadoop-1.0.3.tar.gz

sudo chown -R hduser:hadoop hadoop-1.0.3

step 2:Configuring Hadoop on environment variables

We are adding HADOOP_HOME as environment variables on bash.bashrc files.By doing this Hadoop commands can access every user.

sudo nano /etc/bash.bashrc

Append the following lines to add HADOOP_HOME to PATH.

#set HADOOP_HOME

export HADOOP_HOME=/home/hduser/Utilities/hadoop-1.0.3

export PATH=$HADOOP_HOME/bin:$PATH

step 3: Configuring java for Hadoop

sudo nano /home/hduser/Utilities/hadoop-1.0.3/conf/hadoop-env.sh

JAVA_HOME will be commented by default.Edit the value for JAVA_HOME with your installation path and uncomment the line.The bin folder should not contain this JAVA_HOME path.

# The Java implementation to use.

export JAVA_HOME=<absolute path to java directory>

Step 4:Configuring Hadoop properties

In Hadoop, we have three configuration files core-site.xml,mapred-site.xml,hdfs-site.xml present in HADOOP_HOME/conf directory.

  Editing the Configuration files

1. Core-site.xml

hadoop.tmp.dir’ , the directory specified by this property is used to store file system Meta information by namenode and block information by datanode.By default two directories by the name and data will be created in the mp dir.We need to ensure that ‘hduser’ has sufficient permission on the newly provided ‘hadoop.tmp.dir‘. We are configuring it to ‘/home/hduser/app/hadoop/tmp’.

The property ‘fs.default.name’ is required to provide the hostname and port of  the namenode.Creating the directory and changing the ownership and permission to ‘hduser’

cd /home/hduser/Utilities

sudo mkdir -p /app/hadoop/tmp

sudo chown hduser:hadoop /app/hadoop/tmp

sudo chmod 755 /app/hadoop/tmp

Setting ownership and permission is very important.If you forget this, you will get into some exceptions while formatting the namenode.

Open the core-site.xml, you can see empty configuration tags.Add the following lines between the configuration tags.

sudo nano /home/hduser/Utilities/hadoop-1.0.3/conf/core-site.xml

[code lang=”shell”]
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/Utilities/app/hadoop/tmp</value>
<description>A base for other temporary directories.<description>
</property>
<name>fs.default.name</name>
<value>hdfs://<hostname/Ip address of the system where namenode is installed>:54310</value>
<description>The name of the default file system</description>
<property> </configuration>
[/code]

2. hdfs-site.xml

It is used for file systems and storage.In the hdfs-site.xml, add the following property between the configuration tags.

sudo nano /home/hduser/Utilities/hadoop-1.0.3/conf/hdfs-site.xml

[code lang=”shell”]
<configuration>
<property>
<name>dfs.replication<name>
<value>1</value>
<description>Default block replication</description>
<property> </configuration>
[/code]

3.mapred-site.xml

This is used for processing.In the mapred-site.xml, we need to provide the hostname and port for jobtracker as tasktracker would be using this for their communication.

sudo nano /home/hduser/Utilities/hadoop-1.0.3/conf/mapred-site.xml

[code lang=”shell”]
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hostname/ipaddress of the system where jobtracker is installed>:54311</value>
<description>The host and port that the mapreduce jobtracker runs</description>
</property> </configuration>
[/code]

Step 5: Formatting Namenode

Before starting the hdfs daemons like Namenode for the first time, it is mandatory that you format the Namenode.This is only for the first run.For subsequent runs, the formatting of namenode will lose all data.Be careful not to format an already running cluster,even if you need to restart the namenode daemon.

Namenode can be formatted as

/home/hduser/Utilities/hadoop/bin/hadoop namenode -format

Step 6: Starting Hadoop Daemons

/home/hduser/Utilities/hadoop-1.0.3/bin/start-all.sh

This will run all the hadoop daemons namenode,datanode, secondarynamenode, jobtracker and tasktracker.

For stopping Hadoop daemons

we are using the command

 /home/hduser/Utilities/hadoop-1.0.3/bin/stop-all.sh

This will stop all the Hadoop daemons.Only jps is showing after stopping the Hadoop daemons.