Tech Blog

Ajith Vijayan Tech Blog

Tag: cloudera

Best Practices in Cluster provisioning and Management- An Overview

For creating a cluster, we want some idea about how the cluster should be created and cluster management.Here i am noting down some points for cluster creation and management process.

Platform Requirements

  • Cloudera distribution is a good option to create a Hadoop cluster since it got a well structured repository and a well defined documentation set(Advanced user may go for the builds from Apache community).
  • Cloudera Manager is designed to make administration of Hadoop simple and straight forward at any scale. With help of Cloudera manager, you can easily deploy and centrally operate the complete Hadoop stack. The application automates the installation process, reducing deployment time from week to minutes.
  • Centos is the best option as OS since it’s developed on RHEL architecture and supports all RHEL add-ones.
  • Yum install <packages> is a command that is used frequently for installing packages from remote repository. Yum install will pick the repository url from /etc/yum.repos.d, download the packages and install it in the machine. Normally yum will work in machine having internet access. But if we want to install packages in isolated environment, normal yum install will not work, because the remote repository may not be accessible in isolated environment. In that situation, we are creating a local yum repository.
  • It’s better to turn off Graphical user experience in all the host machines, for efficient utilization of memory.
  • For each installation add required environmental variables in /etc/bashrc file or /etc/profile for public access.
  • For updating environment variables from /etc/bashrc file or /etc/profile files use ‘source’ command.

Required Services

  • Ensure sshd service is running in each node to make Secure Shell access active.
  • Ensure IPtables service is stopped.
  • Oracle jdk 1.6+ should be used (instead of open JDK) for JVM Process Status(JPS) which is used for displaying currently running Hadoop daemons.

Generic Points

  • For tarball or source build installations ‘/opt’ location is preferred.
  • Rebooting the Linux machine to change configurations is a bad practice and may negatively affect the overall cluster balance.
  • For network connection issues restart network service other than rebooting the host machine.

Hadoop on Windows using HDP

We all know that Hadoop is a framework that enables the distributed processing of large datasets across the clusters of commodity hardware. HDP is HortonWorks Data Platform which is an enterprise grade and hardened hadoop distribution.HDP contains the most useful and stable version of Apache hadoop and its related projects into single tested and certified package.Using HDP, we can install hadoop on windows.


  • Microsoft Visual C++
  • Microsoft .Net framework4.0
  • java 1.6
  • Python 2.7

Note : After installing java and python , Set environment variables accordingly.

Firewall Configuration

For the communication between Clients and service components, HDP uses multiple ports.So we have to open the ports accordingly.

[code lang="shell"]
netshadvfirewall firewall add rule name=AllowRPCCommunicationdir=in action=allow protocol=TCP localport=$PORT_NUMBER

Pre-Installation steps

Create a text file which describes the hadoop components.This text file contains the hostname and other properties of hadoop components.


  • Ensure that all properties in the text file are separated by the new line character.
  • Ensure that the directory paths do not contain any whitespace character

Example of a text file(here clusterpProperties.txt) file as shown below.

[code lang="shell"]
#Log directory
HDP_LOG_DIR=&lt;Path to Log directory&gt;
#Data directory
HDP_DATA_DIR=&lt;Path to Data directory&gt;
NAMENODE_HOST=&lt;Hostname of namenode machine&gt;
SECONDARY_NAMENODE_HOST=&lt;Hostname of secondary namenode machine&gt;
JOBTRACKER_HOST=&lt;Hostname of Jobtracker machine&gt;
HIVE_SERVER_HOST=&lt;Hostname of Hive server machine&gt;
OOZIE_SERVER_HOST=&lt;Hostname of Oozie server machine&gt;
WEBHCAT_HOST=&lt;Hostname of server machine&gt;
FLUME_HOSTS=&lt;Hostname of server machine&gt;
HBASE_MASTER=&lt;Hostname of Hbase master&gt;
HBASE_REGIONSERVERS=&lt;Hostname of Hbaseregionserver&gt;
ZOOKEEPER_HOSTS=&lt;Hostname of zookeeper machines&gt;
SLAVE_HOSTS=&lt;Hostname of slave machines&gt;
#Database host
DB_HOSTNAME=&lt;Hostname of server machine&gt;
#Hive properties
HIVE_DB_NAME=&lt;Hive Database name&gt;
HIVE_DB_USERNAME=&lt;Database username&gt;
HIVE_DB_PASSWORD=&lt;Database password&gt;
#Oozie properties
OOZIE_DB_NAME=&lt;oozie database name&gt;
OOZIE_DB_USERNAME=&lt;oozie database username&amp;gt&gt;
OOZIE_DB_PASSWORD=&lt;oozie database password&amp;gt&gt;



In command prompt, we have to execute the installation command for HDP

[code lang="shell"]
HDP_DIR="&lt;$PATH_to_HDP_Install_Dir&gt;" DESTROY_DATA="&lt;Yes_OR_No&gt;"

for example

[code lang="shell"]
msiexec/i"D:\HDPINSTALLATION\hdp-\hdp-\hdp-" /lv "D:\HDP\log\installer.log" HDP_LAYOUT="D:\HDPINSTALLATION\hdp-\hdp-\clusterproperties.txt" HDP_DIR="D:\HDP\hdp_wd" DESTROY_DATA="no"

After the installation, a pop-up window will appear to indicate the installation completion.

Start HDP components

  • cd <Path to HDP directory>
  • start_local_hdp_services

Alternatively we can start these daemons from the services console of windows operating system.

Execute a Mapreduce job

To run a mapreduce job, we have to execute one command ‘Run-SmokeTests.cmd’

  • Run-SmokeTests.cmd

Cloudera Sentry – Authorization Mechanisms in Hadoop

Cloudera Sentry is an authorization mechanism in hadoop. Normally many peoples are confused between authorization and authentication.

So what is the difference between authentication and authorization???

Authentication verifies “who you are“. authentication is the mechanisms where systems may identified their users. 

Authorization means “what you are authorized to do“. In other words authorization is the mechanisms by which a system determines what level of access a particular(authenticated) user should have to resource controlled by the system.

Authorization is happening after a successful authentication.

Cloudera sentry is a cloudera based product which used for authorization mechanisms in hadoop.Now sentry is a fine grained authorization for hive-server2 and impala.

Features of Sentry :

1.) Secure Authorization – Using cloudera sentry, we can control  access to data and provide privileges on data for authenticated users.

2.)Fine-grained authorization – Using cloudera sentry, we can provide access/restrictions on databases, tables and views.we can provide permissions in particular rows/columns in a particular table

3.) Role based authorization – Authorization based on functional roles.which means a normal user can access a limited number of files.but a super user or admin user can access to many files or databases.