What is HDFS Architecture?
HDFS architecture or Hadoop Distributed File System is a type of file system where every file has a size that is predetermined and is divided into blocks. Every block is stored across a group of machines. HDFS architecture follows a rule of Master and Slave architecture, thus, there is a Master node which is a single NameNode from the group whereas the other nodes are called DataNodes which are Slave Nodes. The system can be installed on machines which are capable of running Java.
Though multiple DataNodes can run on a single machine but in reality these DataNodes are distributed across a number of machines.
NameNode:
It is the MasterNode in HDFS architecture. This node is responsible for retaining and managing the blocks present in slave nodes which are called DataNodes. The master node is also in charge of managing the File System Namespace and also manages the accessibility of files by the clients. This system is designed in such a manner that the data of users is never stored in NameNode but DataNodes only.
Here is a list of functions of NameNode
- It is the MasterNode and manages the DataNodes.
- It stores the Metadata of all files in the group which includes size, location, etc.
The files are of two types:-
- Fslmage:- It contains all the information about the file system namespace’s state.
- EditLogs:- It comprises all the information of the file system made recently with reference to recent Fslmage.
- It stores every edit done to file system Metadata just like if the file is discarded in HDFS, it will store this change in EditLog.
- To confirm that all the DataNodes are live, it receives a block report and Heartbeat from all the DataNodes in a group from time to time.
- It keeps track of all of the blocks that have been added and where they are located on the network.
- It also looks after the replication factor.
- In case the DataNodes fails, it selects new DataNodes for new replication, managing communication traffic between DataNodes and the rest of the network. This also manages disc usage.
DataNodes:-
It is a hardware commodity that is also known as a slave node in HDFS architecture. These are cheap, low quality, and low availability as compared to NameNode. It is basically a block server whose work is to store the data in a local file ext3 or ext4.
Functions:
- It is a process that runs on every slave machine.
- It stores data.
- It writes requests from file systems’ clients and also sends heartbeats.
Secondary NameNode:
Secondary NameNode is a helper node and works simultaneously with the primary NameNode. It is not a backup node. Further, let’s see the functions of secondary NameNode. This is also known as CheckpointNode.
Functions:-
- It reads the file system and Metadata from the RAM of NameNode and then writes it into the hard disk.
- It also combines the EditLogs with Fslmage from NameNode.
- Secondary NameNode periodically downloads the EditLogs and pertains to Fslmage.
Blocks:
Blocks are the smallest in the hard drive where data is stored. In HDFS architecture, each file is stored as blocks that are distributed throughout the group.
Replication Management:
HDFS architecture can store a huge amount of data in a cluster as data blocks. Moreover, the blocks are recreated to supply fault tolerance. Initially, the factor is set as 3 but it can be changed as per your choice.
Rack Awareness:
NameNode ensures that all imitations are stored in different racks. The rack Awareness algorithm is built in such a way that it reduces latency and also provides fault tolerance. As the replication factor is 3, the algorithm says that the first replica will be stored on a local rack and the other two will be saved on different racks and on different DataNodes within that rack. In case there are more replications, they will be stored on random DataNodes.
Benefits:-
- Enhances the performance of the network
The communication between nodes that are placed on different racks is controlled by switching devices. Network bandwidth will be greater among the machines on the same racks rather than those saved in different racks. Thus, it gives better write performance and reduces the traffic between racks. Therefore, the read performance will also enhance due to the bandwidth of many racks.
- Protect data loss
There will be no data loss if the rack fails due to loss of power or switch because the data is stored in different racks.
HDFS Write Architecture:
The copying of data takes place in three stages:-
- Pipeline Setup
First, it is ensured that DataNodes are ready to receive the data. Then the client connects each DataNode in the list of that block to create the pipeline.
- Streaming of Data
Now, as the pipeline is formed, therefore, the client will push the data in the following pipeline. Thus, the data will be replicated according to the replication factor.
- Pipeline Shutdown
Further, when the block has been imitated, an acknowledgment series will occur to confirm to the client and DataNode that the data is written successfully followed by the closing of the pipeline by the client for ending the TCS session.
HDFS Read Architecture:
In response to the client’s read requests, it selects the imitation that is nearest to the client. This degrades the consumption of bandwidth and reduces latency.