upGrad KnowledgeHut SkillFest Sale!-mobile

Bootcamps

Enterprise

Resources

Home
Blog
Big Data
What is HDFS? Architecture, Features, Benefits, and Examples

HomeBlogBig DataWhat is HDFS? Architecture, Features, Benefits, and Examples

What is HDFS? Architecture, Features, Benefits, and Examples

Blog Author

Dr. Manish Kumar Jain

Published

26th Apr, 2024

Views

Read TimeRead it in

23 Mins

In this article

HDFS Stands for Hadoop Distributed File System. Before knowing about it first, understand what is Hadoop. Hadoop is an open source, non-relational, distributed computing platform that provides the infrastructure for Big Data and analytics. Hadoop is a framework used for processing large amounts of data by breaking it down into smaller chunks and distributing those chunks across multiple machines. The data is then processed by these machines. The result of this is that multiple machines can be involved in the process and that it is more scalable. Hadoop has its file system, HDFS, and it provides a distributed computing framework called MapReduce. MapReduce is a programming paradigm.

What is HDFS?

HDFS is a distributed file system that is part of the Hadoop ecosystem. It offers a number of functions that can be used to provide greater flexibility to applications that run on Hadoop clusters, including file copy, replication, fault tolerance and backup.

HDFS is written in Java and is as open source system developed using Google file System(GFS) as model. GFS is a distributed file system developed for storing information on webpages crawled by Google. HDFS is built on top of the HDFS library, which provides a Java API for accessing HDFS. HDFS is a distributed file system designed to run on commodity hardware. It has a master/slave architecture. The master node is called the Namenode and manages the file system metadata. The slave nodes are called Datanodes and they store the actual data.

HDFS is highly scalable and can be used to store very large files. It is also fault tolerant and can continue to operate even if some of the nodes fail. Hadoop is a file system that allows for the processing of large data sets. Each file has a name and is stored on a specific node. The files are stored in directories. It has several layers of abstraction to facilitate the processing of the data. HDFS operates on top of base file systems using their APIs to do file operations. OS-level file systems like FAT, NTFS, EXT3, and HFS; native file systems like CIFS, SMB, NFS. Some of the native file systems are also used at the OS level. Thus one can say that HDFS is not a real file system.

To summarize, HDFS is a distributed filesystem that can be store and process data. It is a collection of servers called HDFS nodes, which store and manage files. These nodes run on commodity hardware and are optimized for speed and high throughput. They can be configured to provide high availability, scalability, and durability.

What is the Map-Reduce?

A centralized server is typically used by traditional enterprise systems to store and process data. Standard database servers cannot handle processing large volumes of scalable data using the traditional methodology. Additionally, the centralized system overburdens itself while handling several files at once. Google used the MapReduce algorithm to address this bottleneck issue. MapReduce breaks a problem into manageable pieces and distributes them among several machines. The results are afterward gathered in one location and combined to create the outcome dataset.

Let’s see the jargons of HDFS Architecture

Name Node: The GNU/Linux operating system and the namenode software are installed on the generic hardware known as a "namenode". The namenode-equipped system serves as the master server and performs the following duties:
- controls the namespace of the file system.
- controls the access of clients to files.
- Additionally, it performs operations on the file system including renaming, shutting, and opening files and directories.
Data Node: The datanode also runs on the GNU/Linux operating system. There will be a datanode installed on every node (commodity hardware/system) in a cluster. These nodes oversee the system's data storage.
- As requested by the client, datanodes execute read-write operations on the file systems.
- Additionally, they carry out tasks including block creation, deletion, and replication as directed by the namenode.
Block

Typically, HDFS files are where user data is kept. In a HDFS cluster the file will be split into one or more segments and/or kept in separate data nodes. Blocks are the name given to these file chunks. In other words, a block is the smallest unit of data that HDFS can read or write. The HDFS configuration allows for an increase in block size from the default value of 128 MB.

HDFS Architecture in Big Data

The distributed file system is organized into a number of machines called hosts (datanodes) which is handled by single node called namenode. The host machines are interconnected (Network) and share the same filesystem. This file system is managed by a master node called Kubernetes master node. We will see how to setup HDFS in modern days by using docker containers or Kubernetes.

Traditionally in HDFS, each machine has two parts, a file system and a filesystem engine which manages how the file system works with the filesystem. The file system and filesystem engine are responsible for performing three main functions. HDFS stores files in the storage pool reads from the storage pool, and it writes data to the filesystem.

A file system is a collection of files in a storage pool. Each file has its own unique identifier. A file system is a part of a cluster of machines because it is a single entity that can be used by multiple machines. It does not have a storage pool. The data in the file system is stored directly on the disk, whereas each file in the filesystem engine has an indirect representation on the filesystem.

The architecture concentrates on NameNodes and DataNodes, as we can see. The hardware that contains the GNU/Linux operating system and software is known as the NameNode. The Hadoop distributed file system serves as the master server and controls file management operations including renaming, opening, and closing as well as client access to files.

A DataNode is a hardware that runs the GNU/Linux operating system in addition to the DataNode application. You will find a DataNode for each node in an HDFS cluster. These nodes assist in managing the system's data storage since they can operate on file systems at the client's request and create, replicate, and block files at the NameNode's request.

In the modern world, we can use Docker Container or Kubernetes to setup HDFS on the cloud or on-premise servers instead of direct installation of a Server or VM. ON Dockerhub there are many images by verified publishers and sponsored OSS of Hadoop and HDFS. We can use Kubernetes, but in the case of on-premises or custom management, it is not advisable because any container or pod can go down. Kubernetes is a container orchestration and management tool. It doesn’t come with persistent storage. You have to add storage for each and every pod if the pod failed then the routing table has to update with the latest IP. due to this, it is not advisable to use a database engine in Kubernetes. But in the case of managed cloud services like AKS (Azure Kubernetes Service) or GKE (Google Kubernetes Engine) you don’t have to worry about it. Also, Google Cloud has DataProc services that are fully managed and scalable services which running apache and 30+ more open source projects and frameworks or . and Microsoft Azure Blob Storage gives you the liberty to the user to choose the file system they want and it supports native HDFS. You can deep dive into big data by attaining Big Data Certification.

NameNodes and DataNodes

1. The File System Namespace

Traditional hierarchical file organization is supported by HDFS. Directories can be made by a user or an application, and files can be stored there. Similar to the majority of other current file systems, the file system namespace structure allows for the creation and removal of files as well as the movement of files between directories and file renaming. User quotas are not yet implemented in HDFS. Hard links and soft links are not supported by HDFS. However, implementing these capabilities is not prohibited by the HDFS architecture.

The file system namespace is maintained by the NameNode. The NameNode keeps track of any modifications made to the file system namespace or its characteristics. The number of duplicate copies of a file that HDFS should keep up-to-date can be specified by an application. The replication factor of a file is the number of copies of that file. The NameNode keeps this data on file.

2. Data Replication

A big cluster of machines can safely store very large files thanks to HDFS. Each file is kept as a series of identically sized blocks, with the exception of the final block, which varies in size. For fault tolerance, blocks in a file are replicated. Block size and replication factor can be customized for each file. The number of copies of a file that an application can create. When creating a file, the replication factor can be provided and modified later. In HDFS, files are write-once and can only have one writer at a time.

All selections about block replication are made by the NameNode. Each DataNode in the cluster sends it a Heartbeat and a Blockreport regularly. Receiving a heartbeat indicates that the data node is operating normally. The blocks on a DataNode are listed in a Block report.

3. The Persistence of File System Metadata

The NameNode stores the HDFS namespace. Every modification to the file system metadata is persistently recorded by the NameNode using a transaction log called the EditLog. For instance, when a new file is created in HDFS, the NameNode inserts a record into the EditLog to record the event. Similar to this, updating a file's replication factor results in the addition of a new record to the EditLog. The EditLog is kept by the NameNode as a file in its native host OS file system. A file called the FsImage stores the whole file system namespace, including the mapping of blocks to files and file system characteristics. The local file system of the NameNode also contains a file containing the FsImage.

The NameNode maintains a copy of the file Blockmap and the whole namespace of the file system in memory. A NameNode with 4 GB of RAM is more than enough to support a large number of files and directories thanks to the compact architecture of this important piece of metadata. The FsImage and EditLog are both read from the disc by the NameNode at startup. The NameNode then applies all of the transactions from the EditLog to the FsImage's in-memory representation and flushes out this updated version into a new FsImage on disc. Once its transactions have been applied to the permanent FsImage, it can truncate the previous EditLog. A checkpoint is where this procedure is at. A checkpoint only takes place when the NameNode starts up in the current implementation.

4. The Communication Protocols

The TCP/IP protocol serves as the foundation for all HDFS communication methods. A client connects to the NameNode computer over a customizable TCP port. It converses with the NameNode via the ClientProtocol. Using the DataNode Protocol, the DataNodes communicate with the NameNode. Both the Client Protocol and the DataNode Protocol are encapsulated in the Remote Procedure Call (RPC) abstraction. The NameNode never starts any RPCs by design. Rather, it only reacts to RPC queries sent by clients or DataNodes.

5. Robustness

HDFS's main goal is to reliably store data even when there are problems. NameNode failures, DataNode failures, and network partitions are the three most frequent failure types.

Data Disk Failure, Heartbeats and Re-Replication

Periodically, each DataNode notifies the NameNode with a Heartbeat message. A portion of the DataNodes may lose connectivity to the NameNode due to a network partition. The absence of a Heartbeat message allows the NameNode to identify this circumstance. Without a recent Heartbeat, the NameNode declares DataNodes as dead and stops forwarding incoming IO requests to them. HDFS no longer has access to any data that was registered to a defunct DataNode. When a data node dies, the replication factor of some blocks may drop below the desired level. The NameNode continuously monitors which blocks require replication and starts it when necessary. Replication may be required for various reasons, including when a DataNode is no longer available, a replica is damaged, a DataNode's hard drive fails, or a file's replication factor is increased.

Cluster Rebalancing

Data rebalancing strategies are compatible with the HDFS design. If the amount of free space on a DataNode drops below a predetermined level, a method might automatically relocate data from one DataNode to another. A strategy might dynamically add extra replicas and rebalance other data in the cluster if there is an unexpectedly high demand for a specific file. These data rebalancing strategies have not yet been put into practice.

Data Integrity

A block of data that was obtained from a DataNode can arrive tainted. This corruption may be the result of software bugs, network issues, or storage device malfunctions. Checksum checking is implemented on the contents of HDFS files by the HDFS client program. Each block of an HDFS file that is created by a client has a checksum that is calculated and stored in a separate hidden file inside the same HDFS namespace. When a client downloads a file's contents, it checks to see if the information it got from each DataNode matches the checksum that's kept in the file that corresponds to that DataNode. If not, the client has the option of requesting that block from a different DataNode that has a copy of it.

Metadata Disk Failure

The two primary data structures in HDFS are the FsImage and the EditLog. These files could become corrupt, rendering the HDFS instance unusable. Because of this, it is possible to set the NameNode to support keeping multiple copies of the FsImage and EditLog. Each of the FsImages and EditLogs updates synchronously whenever either the FsImage or EditLog is updated. The number of namespace transactions per second that a NameNode can support may be reduced as a result of this simultaneous updating of multiple copies of the FsImage and EditLog. Although HDFS applications are relatively data-intensive by nature, they are not metadata-intensive, therefore this degradation is acceptable. A NameNode chooses the most recent consistent FsImage and EditLog to utilise when it restarts.

A single point of failure for an HDFS cluster is the NameNode computer. Manual intervention is required in the event that the NameNode machine malfunctions. The NameNode program cannot be automatically restarted or switched to a different system at this time.

Snapshot

Snapshots enable the storage of a copy of the data at a certain point in time. A corrupted HDFS instance may be restored to a known good point in time using the snapshot capability.

6. Data Organization

Data Blocks

Large file sizes are supported by HDFS by design. Applications that work with huge data sets are compatible with HDFS. These apps only write data once, but they read it once or more, and they demand that these reads be completed quickly enough to stream. Write-once-read-many semantics for files are supported by HDFS. HDFS typically uses blocks that are 128 MB in size. As a result, an HDFS file is divided into 128 MB chunks, with each chunk preferably residing on a distinct DataNode.

Staging

The NameNode does not receive a client request to create a file right away. In actuality, a temporary local file is where the HDFS client initially caches the file data. This local temporary file receives application writes and is transparently redirected there. The client contacts the NameNode once the local file has accumulated data worth more than one HDFS block size. The NameNode allots a data block for the file name and inserts it into the file system hierarchy. The DataNode's name and the target data block are returned by the NameNode in response to a client request.

The client then flushes the block of data to the designated DataNode from the local temporary file. The temporary local file's unflushed data is transferred to the DataNode when a file is closed. When the file is closed, the client notifies the NameNode. The NameNode now commits the operation to create the file into a persistent store. The file is lost if the NameNode perishes before it is closed.

Replication Pipelining

As was mentioned in the section before, before a client writes data to an HDFS file, its data is first written to a local file. Consider a three-fold replication factor for the HDFS file. The client requests a list of DataNodes from the NameNode once a full block of user data has been accumulated in the local file. The DataNodes that will host a replica of that block are listed in this list. The data block is then flushed by the client to the primary DataNode. Small (4 KB) chunks of data are first received by the first DataNode, which then writes each chunk to its local repository and sends it to the second DataNode in the list.

Each piece of the data block is then received by the second DataNode, which writes it to its repository before flushing it to the third DataNode. The third DataNode then writes the information to a local repository. As a result, a DataNode may be both receiving data from the pipeline's preceding node and sending data to the node after it. The data is consequently pipelined from one DataNode to the next.

7. Accessibility

There are numerous ways for apps to access HDFS. Applications can use the Java API that HDFS natively offers. There is also a C language wrapper for this Java API. The files of an HDFS instance can also be browsed using an HTTP browser. It is being worked on to make HDFS accessible via the WebDAV protocol. You can access it from the browser interface, FS shell, and DFS Admin as well.

8. Space Reclamation

File Deletes and Undeletes

A file is not instantly erased from HDFS when it is deleted by a user or program. HDFS does a rename to a file in the /trash directory first. If the file is still in the trash (/trash), it can be swiftly recovered. A file can be configured for how long it stays in the trash. The NameNode removes the file from the HDFS namespace once its life in /trash has expired. The blocks related to a file are released when the file is deleted. Keep in mind that there can be a noticeable lag between when a user deletes a file and when HDFS's free space increases in response.

As long as a file is still there in the /trash directory after deletion, the user may undelete it. A user can access the /trash directory and retrieve a file they have deleted if they want to undelete it. Only the most recent version of the deleted file is present in the /trash location. The /trash directory is identical to other directories except for one unique feature: Files from this directory are automatically deleted by HDFS according to predetermined policies. Files older than 6 hours are currently deleted from the trash per default policy. This policy will eventually be programmable through a clear interface.

Decrease Replication Factor

The NameNode chooses extra copies that can be removed when the replication factor of a file decreases. This data is transmitted to the DataNode on the subsequent Heartbeat. The relevant free space subsequently emerges in the cluster once the DataNode removes the corresponding blocks. Once more, there may be a delay between the setReplication API call's successful conclusion and the availability of free space in the cluster.

Features of HDFS

1. Cost Effective

Storage costs are decreased because DataNodes, which house the real data, are made of affordable commodity hardware.

2. Fault Tolerance

High fault tolerance is a feature of HDFS. Fault tolerance is the ability of a system to continue operating without any data loss even when part of its hardware components fail. When a single node in a cluster fails, the entire system crashes. Fault tolerance's main responsibility is to eliminate any failed nodes that disrupt the system's overall regular operation. Every data block in HDFS is replicated by default across three data nodes. A Hadoop cluster achieves fault tolerance by having data duplicated over two data nodes, which means that even if one data node fails, the client can still easily access the data.

With little work, the HDFS is adaptable enough to add and remove data nodes. The three methods listed above—data replication, heartbeat messages, checkpoints, High availability, and recovery—are how HDFS can achieve fault tolerance.

3. Scalable

Since HDFS distributes large-scale data across numerous nodes, the number of nodes in a cluster can be scaled up or scaled down depending on the amount of data that needs to be stored. The two primary strategies that can enable scalability in the cluster are vertical and horizontal scalability. Vertical scalability refers to expanding existing cluster nodes' RAM and disc space. On the other hand, horizontal scaling allows us to expand the number of nodes in the cluster and is more advantageous because it allows the cluster to have hundreds of nodes.

4. Large Dataset with Different Types

HDFS can store data in any format and in any quantity (from megabytes to petabytes) (structured, unstructured).

5. Data Integrity

Data accuracy is referred to as data integrity. By routinely comparing the data to the checksum generated during file writing, HDFS ensures data integrity.

When reading a file, the data is said to be corrupted if the checksum does not match the original checksum. After that, the client decides to request the data block from a different DataNode that contains a copy of it. The NameNode deletes the damaged block and builds a second replica.

6. High Availability

Hadoop's High Availability functionality assures data accessibility even in the event of NameNode or DataNode failure. Because HDFS makes copies of data blocks, if one of the DataNodes fails, the user can still access his or her data from the other DataNodes that have a copy of the same data block. Additionally, if configured, the passive node assumes control of the active NameNode in the event of its failure. As a result, even in the event of a machine crash, data will be available and accessible to the user.

7. High Throughput

Instead of offering low latency interactive usage, HDFS is intended to be a High Throughput batch processing solution. The Write Once Read Many (WORM) paradigm is always used by HDFS. Because the data is immutable, it cannot be modified after it has been recorded. Data is consistent throughout the network as a result. As a result, it can handle a lot of data quickly and has high throughput.

8. Replication

Data replication is one of HDFS's most significant and distinctive characteristics. In HDFS, data replication is done to address the issue of data loss in unfavorable circumstances, such as node corruption, hardware failure, etc. By making block replicas, the data is distributed among a number of nodes in the cluster. HDFS continuously creates replicas of user data on many clustered machines, maintaining the replication process at regular intervals of time. Therefore, the user can access their data from other machines that have the blocks of that data whenever any system in the cluster crashes. Therefore, there is less chance that user data would be lost.

Benefits of HDFS

1. Fast recovery from hardware failure

HDFS is designed to recognize errors and self-recover them automatically.

2. Portability

HDFS is portable across all hardware platforms, and it is compatible with several operating systems, including Windows, Linux, and Mac OS/X.

3. Streaming data access

All hardware platforms could use HDFS, and it supports Windows, Linux, and Mac OS X among many other operating systems.

Use case and Example of HDFS

1. Electric Companies

To keep track of the health of smart grids, the power sector installs phasor measurement units (PMUs) throughout transmission networks. At specific transmission stations, these high-speed sensors monitor current and voltage by amplitude and phase. These businesses examine PMU data to find network segment system issues and alter the grid as necessary. For instance, they could modify the load or switch to a backup power source. Power businesses can gain from using low-cost, highly accessible file systems like HDFS since PMU networks clock hundreds of records every second.

2. Marketing

Highly targeted projects require marketers to have a thorough understanding of their target markets. CRM systems, direct mail responses, point-of-sale systems, Facebook, and Twitter are a few places where marketers might find this data. The most economical place to store data before analysis is an HDFS cluster because a lot of this data is unstructured.

3. Oil and gas providers

Oil and gas firms work with very big data sets and a range of data types, such as videos, 3D earth models, and data from machine sensors. A suitable platform for the required big data analytics can be provided by an HDFS cluster.

4. Research

Data analysis is a crucial component of research, thus HDFS clusters again offer an affordable means to store, process, and analyze massive volumes of data.

5. Game Development

Overall game engine that handles the file system, loading and saving files, etc. The filesystem engine is responsible for providing an interface between the game engine and the operating system's file system.

Most game engines will have a specific file format that they use for storing data. The filesystem engine is responsible for reading and writing this data to the file system. In some cases, the filesystem engine may also be responsible for compressing and decompressing data as it is read from or written to the file system.

Unlock the Power of Data with Our Applied Data Science Program. Gain In-Demand Skills and Propel Your Career Forward. Enroll Today!

Conclusion

On common hardware, the distributed file system HDFS manages enormous data sets. A single Apache Hadoop cluster can be scaled up to hundreds or even thousands of nodes using this technique. One of Apache Hadoop's key parts, along with MapReduce and YARN, is HDFS. Apache HBase, a column-oriented non-relational database management system that sits on top of HDFS and can better serve real-time data needs with its in-memory processing engine, should not be mistaken with HDFS or utilized in its place. It is used in many industries and has great features ahead. To learn more about it please visit KnowledgeHut Big Data Certification gives you training from Industrial experts and provides you with a certification course, which has great demand in the market.

Frequently Asked Questions(FAQs)

1. What is the difference between Hadoop and HDFS?

Hadoop is an open source framework that helps to store, process, and analyze a large volume of data while the HDFS is the distributed file system of Hadoop that provides high throughput access to application data.

2. Is HDFS a database?

No it is the methodology of storing the large files/big data in the Hadoop ecosystem. HBase is a database built top on Hadoop & HDFS principles.

3. Why is HDFS needed?

It is a sustainable solution for distributed file storage in the Hadoop ecosystem. It provides a command line interface with Hadoop. Also provides Streaming access to file system data, status checking of the cluster, provides files permission, and authentication.

4. How is data stored in HDFS?

Each file is stored on HDFS as Blocks.

Dr. Manish Kumar Jain

International Corporate Trainer

Dr. Manish Kumar Jain is an accomplished author, international corporate trainer, and technical consultant with 20+ years of industry experience. He specializes in cutting-edge technologies such as ChatGPT, OpenAI, generative AI, prompt engineering, Industry 4.0, web 3.0, blockchain, RPA, IoT, ML, data science, big data, AI, cloud computing, Hadoop, and deep learning. With expertise in fintech, IIoT, and blockchain, he possesses in-depth knowledge of diverse sectors including finance, aerospace, retail, logistics, energy, banking, telecom, healthcare, manufacturing, education, and oil and gas. Holding a PhD in deep learning and image processing, Dr. Jain's extensive certifications and professional achievements demonstrate his commitment to delivering exceptional training and consultancy services globally while staying at the forefront of technology.

Share This Article

Ready to Master the Skills that Drive Your Career?

Avail your free 1:1 mentorship session.

Upcoming Big Data Batches & Dates

Name	Date	Fee	Know more

Course Advisor