Ajith Vijayan

Tech Blog

Ajith Vijayan Tech Blog

Hadoop on Windows using HDP

We all know that Hadoop is a framework that enables the distributed processing of large datasets across the clusters of commodity hardware. HDP is HortonWorks Data Platform which is an enterprise grade and hardened hadoop distribution.HDP contains the most useful and stable version of Apache hadoop and its related projects into single tested and certified package.Using HDP, we can install hadoop on windows.


  • Microsoft Visual C++
  • Microsoft .Net framework4.0
  • java 1.6
  • Python 2.7

Note : After installing java and python , Set environment variables accordingly.

Firewall Configuration

For the communication between Clients and service components, HDP uses multiple ports.So we have to open the ports accordingly.

[code lang="shell"]
netshadvfirewall firewall add rule name=AllowRPCCommunicationdir=in action=allow protocol=TCP localport=$PORT_NUMBER

Pre-Installation steps

Create a text file which describes the hadoop components.This text file contains the hostname and other properties of hadoop components.


  • Ensure that all properties in the text file are separated by the new line character.
  • Ensure that the directory paths do not contain any whitespace character

Example of a text file(here clusterpProperties.txt) file as shown below.

[code lang="shell"]
#Log directory
HDP_LOG_DIR=<Path to Log directory>
#Data directory
HDP_DATA_DIR=<Path to Data directory>
NAMENODE_HOST=<Hostname of namenode machine>
SECONDARY_NAMENODE_HOST=<Hostname of secondary namenode machine>
JOBTRACKER_HOST=<Hostname of Jobtracker machine>
HIVE_SERVER_HOST=<Hostname of Hive server machine>
OOZIE_SERVER_HOST=<Hostname of Oozie server machine>
WEBHCAT_HOST=<Hostname of server machine>
FLUME_HOSTS=<Hostname of server machine>
HBASE_MASTER=<Hostname of Hbase master>
HBASE_REGIONSERVERS=<Hostname of Hbaseregionserver>
ZOOKEEPER_HOSTS=<Hostname of zookeeper machines>
SLAVE_HOSTS=<Hostname of slave machines>
#Database host
DB_HOSTNAME=<Hostname of server machine>
#Hive properties
HIVE_DB_NAME=<Hive Database name>
HIVE_DB_USERNAME=<Database username>
HIVE_DB_PASSWORD=<Database password>
#Oozie properties
OOZIE_DB_NAME=<oozie database name>
OOZIE_DB_USERNAME=<oozie database username&gt>
OOZIE_DB_PASSWORD=<oozie database password&gt>



In command prompt, we have to execute the installation command for HDP

[code lang="shell"]
HDP_DIR="<$PATH_to_HDP_Install_Dir>" DESTROY_DATA="<Yes_OR_No>"

for example

[code lang="shell"]
msiexec/i"D:\HDPINSTALLATION\hdp-\hdp-\hdp-" /lv "D:\HDP\log\installer.log" HDP_LAYOUT="D:\HDPINSTALLATION\hdp-\hdp-\clusterproperties.txt" HDP_DIR="D:\HDP\hdp_wd" DESTROY_DATA="no"

After the installation, a pop-up window will appear to indicate the installation completion.

Start HDP components

  • cd <Path to HDP directory>
  • start_local_hdp_services

Alternatively we can start these daemons from the services console of windows operating system.

Execute a Mapreduce job

To run a mapreduce job, we have to execute one command ‘Run-SmokeTests.cmd’

  • Run-SmokeTests.cmd

Cloudera Sentry – Authorization Mechanisms in Hadoop

Cloudera Sentry is an authorization mechanism in hadoop. Normally many peoples are confused between authorization and authentication.

So what is the difference between authentication and authorization???

Authentication verifies “who you are“. authentication is the mechanisms where systems may identified their users. 

Authorization means “what you are authorized to do“. In other words authorization is the mechanisms by which a system determines what level of access a particular(authenticated) user should have to resource controlled by the system.

Authorization is happening after a successful authentication.

Cloudera sentry is a cloudera based product which used for authorization mechanisms in hadoop.Now sentry is a fine grained authorization for hive-server2 and impala.

Features of Sentry :

1.) Secure Authorization – Using cloudera sentry, we can control  access to data and provide privileges on data for authenticated users.

2.)Fine-grained authorization – Using cloudera sentry, we can provide access/restrictions on databases, tables and views.we can provide permissions in particular rows/columns in a particular table

3.) Role based authorization – Authorization based on functional roles.which means a normal user can access a limited number of files.but a super user or admin user can access to many files or databases.

How to run multiple Tomcat instances in one server

Last week, i install multiple tomcat instances in one server.Usually most of the services are listening to single port.which means in a machine, one tomcat instance will listen one port.So for running multiple Tomcat instances in a single server, we have to change the ports.The steps are given below

1.) Download three tomcat instances and unpack it in three different folders.

2.) Then edit WEBSERVER_HOME/conf/server.xml file individually.

for eg.

The first Tomcat is in the location /opt/Ajith/Tomcat (This is the WEBSERVER_HOME).

edit WEBSERVER_HOME/conf/server.xml

when we unpack the tomcat, the server.xml will look like this

[code lang="shell"]

&lt;Connector port="10090" protocol="HTTP/1.1"


redirectport="8443"&nbsp; /&gt;


As i explained earlier, No two tomcat server(or any other server) will not  listen/run in same port.

So we have to change the ports accordingly.

[code lang="shell"]

&lt;Connector port="10030" protocol="HTTP/1.1"


redirectport="8553"&nbsp; /&gt;


3.) Then restart the tomcat.

4.) But most of the cause it will not start.Because we have to change some more ports on server.xml .

So we have to change “shutdown port” and “AJP” port.

[code lang="shell"]

&lt;Server port="8005" shutdown="SHUTDOWN"&gt;


&lt;Connector port="8010" protocol="AJP/1.3" redirectport="8553" /&gt;


Note:Make sure that the redirect port should be same in one tomcat server.xml(in this cause 8553″

5.)Restart the tomcat

Repeat the steps in other tomcat with different ports which is not used in that machine.

HBase Installation on Linux Machine

Apache HBase is a distributed, scalable, bigdata store.Actually it is the Hadoop database.Apache HBase is the subproject of the Apache Hadoop project and is used to provide real-time read and write access to your bigdata.HBase runs on top of HDFS(Hadoop Distributed File System).

Before installing HBase, we have to install Hadoop and zookeeper.The zookeeper cluster act as a coordination service for entire HBase cluster.For Single node HBase installation, the zookeeper is installed and started automatically.

For HBase development, selecting Hadoop version is very critical. The Hadoop version should be compactable with HBase version. Here we are using HBase-0.94.x, HBase requires hadoop-1.0.3 at a minimum.

Basic Prerequisites

This section lists required services and some required system configuration.

Software’s required

  • java JDK

As of now the recommended and tested versions of java for Hadoop and HBase installation include

Oracle jdk 1.6(u 20,21,26,28,31)

Hadoop requires java1.6+. It is built and tested on Oracle java, which is the only “supported” JVM.

  • The latest stable version of Hadoop 1.x.x( here we are using the current stable version of hadoop-1.0.3)
  • The latest stable version of HBase-0.9x.x( here we are using the current stable version of HBase-0.94.x)

Before Installing HBase, we have to install Hadoop and Zookeeper.

Installing HBase

HBase is a Distributed, non-relational and open source database. One of the key value store type database is HBase which runs over Hadoop architecture and Hdfs file system.

Step 1:  Copy the HBase tar in a particular location and untar it.We are using hbase-0.94.x.tar.gz

cd /home/hduser/Utilities

tar -xzvf hbase-0.94.x.tar.gz

Step 2: Edit the /home/hduser/Utilities/hbase/conf/hbase-env.sh and define the $JAVA_HOME

nano /hbase/conf/hbase-env.sh

HBASE_MANAGES_ZK needs to be set as ‘true’ to use zookeeper.If it is set as false, HBase will not consider zookeeper.

export HBASE_MANAGES_ZK=true

Step 3: Edit the hbase-site.xml

A typical hbase-site.xml is given below

[code lang="shell"]







&lt;name&gt;hbase.cluster.distributed &lt;/name&gt;




&lt;value&gt;&lt;ipaddress of the zookeeper installed system&gt;&lt;/value&gt;





&lt;description&gt;The host and port that HBase master runs&lt;/description&gt;




The Value of ‘hbase.rootdir’ is the hostname and port number of system where namenode is running.(The port number should be same as that of core-site.xml, where namenode is running)

we have to create a folder in hdfs to store HBase data( This folder should mention in ‘hbase.rootdir’ value

hadoop fs -ls /

hadoop fs -mkdir /hbase

For multinode HBase installation,the property ‘hbase.zookeeper.quorum’ is important.the property is used to identify the zookeeper installed system.

  • To start HBase daemons


  • To stop HBase daemons


The default UI port of HBase is 60010.The UI of HBase can view by

http://<IP address of the system :60010>

Hadoop Single Node Installation

step 1: Extracting Hadoop tarball

We are creating a user home for Hadoop installation.Here we are using ‘/home/hduser/Utilities’ as user home.You need to extract the tarball in this location and change the permissions recursively on the extracted directory.

Here we are using hadoop-1.0.3.tar

mkdir -p /home/hduser/Utilities

cd /home/hduser/Utilities

sudo tar -xzvf hadoop-1.0.3.tar.gz

sudo chown -R hduser:hadoop hadoop-1.0.3

step 2:Configuring Hadoop on environment variables

We are adding HADOOP_HOME as environment variables on bash.bashrc files.By doing this Hadoop commands can access every user.

sudo nano /etc/bash.bashrc

Append the following lines to add HADOOP_HOME to PATH.


export HADOOP_HOME=/home/hduser/Utilities/hadoop-1.0.3


step 3: Configuring java for Hadoop

sudo nano /home/hduser/Utilities/hadoop-1.0.3/conf/hadoop-env.sh

JAVA_HOME will be commented by default.Edit the value for JAVA_HOME with your installation path and uncomment the line.The bin folder should not contain this JAVA_HOME path.

# The Java implementation to use.

export JAVA_HOME=<absolute path to java directory>

Step 4:Configuring Hadoop properties

In Hadoop, we have three configuration files core-site.xml,mapred-site.xml,hdfs-site.xml present in HADOOP_HOME/conf directory.

  Editing the Configuration files

1. Core-site.xml

hadoop.tmp.dir’ , the directory specified by this property is used to store file system Meta information by namenode and block information by datanode.By default two directories by the name and data will be created in the mp dir.We need to ensure that ‘hduser’ has sufficient permission on the newly provided ‘hadoop.tmp.dir‘. We are configuring it to ‘/home/hduser/app/hadoop/tmp’.

The property ‘fs.default.name’ is required to provide the hostname and port of  the namenode.Creating the directory and changing the ownership and permission to ‘hduser’

cd /home/hduser/Utilities

sudo mkdir -p /app/hadoop/tmp

sudo chown hduser:hadoop /app/hadoop/tmp

sudo chmod 755 /app/hadoop/tmp

Setting ownership and permission is very important.If you forget this, you will get into some exceptions while formatting the namenode.

Open the core-site.xml, you can see empty configuration tags.Add the following lines between the configuration tags.

sudo nano /home/hduser/Utilities/hadoop-1.0.3/conf/core-site.xml

[code lang=”shell”]
<description>A base for other temporary directories.<description>
<value>hdfs://<hostname/Ip address of the system where namenode is installed>:54310</value>
<description>The name of the default file system</description>
<property> </configuration>

2. hdfs-site.xml

It is used for file systems and storage.In the hdfs-site.xml, add the following property between the configuration tags.

sudo nano /home/hduser/Utilities/hadoop-1.0.3/conf/hdfs-site.xml

[code lang=”shell”]
<description>Default block replication</description>
<property> </configuration>


This is used for processing.In the mapred-site.xml, we need to provide the hostname and port for jobtracker as tasktracker would be using this for their communication.

sudo nano /home/hduser/Utilities/hadoop-1.0.3/conf/mapred-site.xml

[code lang=”shell”]
<value>hostname/ipaddress of the system where jobtracker is installed>:54311</value>
<description>The host and port that the mapreduce jobtracker runs</description>
</property> </configuration>

Step 5: Formatting Namenode

Before starting the hdfs daemons like Namenode for the first time, it is mandatory that you format the Namenode.This is only for the first run.For subsequent runs, the formatting of namenode will lose all data.Be careful not to format an already running cluster,even if you need to restart the namenode daemon.

Namenode can be formatted as

/home/hduser/Utilities/hadoop/bin/hadoop namenode -format

Step 6: Starting Hadoop Daemons


This will run all the hadoop daemons namenode,datanode, secondarynamenode, jobtracker and tasktracker.

For stopping Hadoop daemons

we are using the command


This will stop all the Hadoop daemons.Only jps is showing after stopping the Hadoop daemons.

Page 4 of 4