Hadoop Interview Questions

Ready to face your next Hadoop interview? Be interview-ready with this list of Hadoop interview questions and answers, carefully curated by industry experts. Get ready to answer questions on Hadoop applications, how Hadoop is different from other parallel processing engines, and the difference between NameNode, Checkpoint NameNode, and Backup Node. We have put together a detailed list of big data Hadoop interview questions that will help you become a Hadoop developer, Java developer, or Big Data engineer the industry talks about.

  • 4.5 Rating
  • 18 Question(s)
  • 25 Mins of Read
  • 3268 Reader(s)

Advanced

 The replication factor in HDFS can be modified /overwritten in 2 ways-

  • Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command
$hadoop fs –setrep –w 2 /my/sample.xml

 sample.xml is the filename whose replication factor will be set to 2

  • Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-
$hadoop fs –setrep –w 6 /my/sample_dir

sample_dir is the name of the directory and all the files in this directory will have a replication factor set to 6.

ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -touchz /hadoop/sample
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop

Found 2 items

-rw-r--r--  2 ubuntu supergroup
0 2018-11-08 00:57 /hadoop/sample 
-rw-r--r--  2 ubuntu supergroup
16 2018-11-08 00:45 /hadoop/test

fsck a utility to check health of the file system, to find missing files, over-replicated, under-replicated and corrupted blocks.

 Command for finding the blocks for a file: 

$ hadoop fsck -files -blocks –racks

Hadoop distributed file system (HDFS) is the primary storage system of Hadoop. HDFS stores very large files running on a cluster of commodity hardware. It works on the principle of storage of less number of large files rather than the huge number of small files.

 HDFS stores data reliably even in the case of hardware failure. It provides high throughput access to the application by accessing in parallel. Components of HDFS:

  • NameNode – It is also known as Master node. Namenode stores meta-data i.e. number of blocks, their replicas and other details.
  • DataNode – It is also known as Slave. In Hadoop HDFS, DataNode is responsible for storing actual data. DataNode performs read and write operation as per request for the clients in HDFS.

Update the network addresses in the dfs.include and mapred.include 

$ hadoop dfsadmin -refreshNodes and hadoop mradmin -refreshNodes Update the slave file.

 Start the DataNode and NodeManager on the added Node.

  • The client connects to the name node to register a new file in HDFS.
  • The name node creates some metadata about the file (either using the default block size, or a configured value for the file)
  • For each block of data to be written, the client queries the name node for a block ID and list of destination datanodes to write the data to. Data is then written to each of the datanodes.

By default, the HDFS block size is 64MB

 Default replication factor is 3

It dsplays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL.

ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -getfacl /hadoop
  • file: /hadoop
  • owner: ubuntu
  • group: supergroup

This exception means there is no communication between the DataNode and the DataNode due to any of the below reasons :

  • Block size is negative in hdfs-site.xml file
  • Disk usage for DataNode is 100% and there is no space available .
  • Due to poor network connectivity between NameNode and DataNode , primary DataNode is down when the write operation is in progress .

You can provide dfs.block.size on command line :

  • copying from HDFS to HDFS

hadoop fs -D dfs.block.size=<blocksizeinbytes> -cp /source /destination

  • copying from local to HDFS

hadoop fs -D dfs.block.size=<blocksizeinbytes> -put /source /destination 

Below command is used to enter Safe Mode manually – 

$ Hdfs dfsadmin -safe mode enter

Once the safe mode is entered manually, it should be removed manually.

Below command is used to leave Safe Mode manually – 

$ hdfs dfsadmin -safe mode leave

The two popular utilities or commands to find HDFS space consumed are

  • hdfs dfs –du
  • hdfs dfsadmin –report.

 HDFS provides reliable storage by copying data to multiple nodes. The number of copies it creates is usually referred to as the replication factor which is greater than one.

  • hdfs dfs –du –This command shows the space consumed by data without replications.
  • hdfs dfsadmin –report- This command shows the real disk usage as it counts data replication also . Therefore, the value of hdfs dfsadmin –report will always be greater than the output of hdfs dfs –du command.

Beginner

$ hadoop fs -copyToLocal
 $ hadoop fs -copyFromLocal
 $ hadoop fs -put

Below are the main tasks of JobTracker:

  • Accept jobs from the client.
  • Communicate with the NameNode to determine the location of the data.
  • Locate TaskTracker Nodes with available slots.
  • Submit the work to the chosen TaskTracker node and monitors the progress of each task.

Following are the three configuration files in Hadoop:

  • core-site.xml
  • mapred-site.xml
  • hdfs-site.xml

NameNode- It is also known as Master node. It maintains the file system tree and the metadata for all the files and directories present in the system. NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. It records the metadata of all the files stored in the cluster i.e. location of blocks stored, size of the files, hierarchy,permissions etc .

NameNode is the master daemon that manages and maintains all the DataNodes (slave nodes).

There are two files associated with the metadata:

FsImage: It is the snapshot of the file system when Name Node is started.

EditLogs: It is the sequence of changes made to the file system after the Name Node is started.

Checkpoint node- Checkpoint node is the new implementation of Secondary NameNode . It is used to create periodic checkpoints of file system metadata by merging edits file with fsimage file and finally it uploads the new image back to the active NameNode 

It is structured in the same directory as the NameNode and stores the latest checkpoint .

Backup Node - Backup Node is an extended checkpoint node that performs checkpointing and also supports online streaming of file system edits.

Its main role is to act as the dynamic Backup for the Filesystem Namespace (Metadata )in the Primary Namenode of the Hadoop Ecosystem.

The Backup node keeps an in-memory, up-to-date copy of the file system namespace which is always synchronized with the active NameNode state.

Backup node does not need to download fsimage and edits files from the active NameNode to create a checkpoint, as it already has an up-to-date state of the namespace in it’s own main memory.  So, creating checkpoint in backup node is just saving a copy of file system meta-data (namespace) from main-memory to its local files system.

 ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/Hadoop

Description

Hadoop is an open source framework highly adopted by several organizations to store and process a large amount of structured and unstructured data by applying the MapReduce programming model. There are so many top rated companies using Apache Hadoop framework to deal with their large amount of data that is increasing continuously every minute. Coming to the Hadoop cluster, Yahoo is the first name in the list having around 4500 nodes followed by Linkedin and Facebook
 

Here are some of the world’s most popular and top-rated organizations that are using Hadoop for their production and research. Adobe,  AOL, Alibaba, eBay, and Fox Audience network etc.
 

If you are looking to build your career in the field of big data Hadoop, then give a start with learning big data Hadoop. You can also take up Hadoop Training program and start a career as a big data Hadoop professional to solve large data problems.
 

Interview questions on Hadoop here are the top Hadoop Interview questions asked frequently and which are scenario based. You will also see how to explain Hadoop project in an interview which carries a lot of weight in the interview.
 

These Hadoop developer interview questions have been designed specially to get you familiarized with the nature of questions which you might face during your interview and will help you to crack Hadoop Interview easily & acquire your dream career as a Hadoop Developer.  Top big data Hadoop interview questions will surely boost your confidence to face an interview and will prepare you to answer to your interviewer’s questions in the best manner. These interview questions on Hadoop are suggested by the experts.
 

Turn yourself into a Hadoop Developer. Live your dream career!

Read More
Levels