The Power of HDFS

A Scalable, Distributed File System for Big Data

Ezzaldeen Mousa
4 min readDec 29, 2022

Overview

Hadoop is an open-source software framework that is used for storing, processing, and analyzing large volumes of data. It is designed to be scalable, fault-tolerant, and easy to use, and it is widely used in organizations to store and analyze large data sets in a distributed and efficient manner.

Hadoop consists of two main components:

  • Hadoop Distributed File System (HDFS) — our focus in this article!
  • MapReduce programming model.

In simple words HDFS is a distributed file system that allows users to store large amounts of data on a cluster of commodity hardware.

Why HDFS?

There are several reasons why organizations might choose to use Hadoop and the Hadoop Distributed File System (HDFS) for storing, processing, and analyzing large volumes of data:

  1. Scalability: Hadoop is designed to be scalable, which means that it can handle very large volumes of data without requiring expensive hardware upgrades. This makes it an attractive option for organizations that need to store and process large amounts of data.
  2. Fault tolerance: HDFS is designed to be fault-tolerant, which means that it can continue to operate even if some of the servers in the cluster fail. This makes it an attractive option for organizations that need to ensure the availability and reliability of their data processing systems.
  3. Cost-effectiveness: Hadoop is an open-source software framework, which means that it is free to use and can be customized and extended as needed. This makes it a cost-effective option for organizations.

HDFS Architecture

Master-Slave architecture of HDFS
Master-Slave architecture of HDFS (https://www.oreilly.com/library/view/data-lake-for/9781787281349/5983829d-2c2c-4649-a92f-9e72f7a67f15.xhtml)

HDFS follows the master-slave architecture. An HDFS consists of a single NameNode — which is the master server that manages the access operations to files by clients’ requests.
Also, there is a set of DataNodes — which are the slaves that respond to perform read-write operations on the file systems, they also responsible to perform operations at the internal level such as block creation, deletion, and replication based on the NameNode instructions.
Internally, a file splits into one or more blocks and these blocks are stored in a set of DataNodes. We could think of a block as a segment of a file, since the files are divided into one or more segments, and these blocks live separated in DataNodes, the block size is 64MB by default, and we could increase the size from HDFS configuration.

HDFS architecture
Hadoop Distributed File System Architecture (https://www.techtarget.com/searchdatamanagement/definition/Hadoop-Distributed-File-System-HDFS)

Play with HDFS

The basic Hadoop filesystem commands:

The hadoop fs command is used to manage files and directories in HDFS. You can use it to perform various tasks, such as creating directories, copying files, and deleting files.

  1. list the files
List the files command
Hadoop fs -ls command

this command allows you to list the files and directories in a directory on HDFS (Hadoop Distributed File System).

2. copy a file

Hadoop fs -put command

this command allows you to copy a file from the local filesystem to HDFS.
it will copy the file at /local/path/to/file to the specified location on HDFS. If the destination path does not exist, it will be created.

3. display a file’s content

Hadoop fs -cat command

this command allows you to print the contents of one or more files to the console.

4. get a file

Hadoop fs -get command

this command allows you to copy a file from HDFS to the local file system.

Note that this is just a selection of the available hadoop fs commands. You can use hadoop fs -help to see a full list of all the available commands.

Limitations of HDFS:

  1. Limited random access: HDFS is optimized for storing and processing large files in a sequential manner, rather than for random access to small amounts of data. This can make it less efficient for applications that require frequent access to small amounts of data.
  2. Lack of real-time processing: HDFS is not designed for real-time processing of data, which means that it may not be suitable for applications that require rapid access to up-to-date data.
  3. Complexity: HDFS can be complex to set up and maintain, especially for organizations that are new to distributed systems. It requires careful planning and configuration to ensure that it is set up correctly and can scale effectively as the amount of data grows.

References:

HDFS Architecture:
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
HDFS Operatioins:
https://www.tutorialspoint.com/hadoop/hadoop_hdfs_operations.htm

--

--

No responses yet