Top 81 Data Engineer Interview Questions and Answers (2022) Interview Questions

Whether you’re new to the world of big data or looking to break into a higher Data Engineering role, we are sure you all have this question in mind – How to prepare for a Data Engineer interview? Here, we put together a list of the Data Engineer Interview Questions and Answers for beginner and experienced candidates. These important questions are categorized for quick browsing before the interview or as a helping guide on different topics in Data engineering. These interview questions and answers will boost your knowledge in the field, improve your core interview skills, and help you perform better in interviews related to Data Engineering. Also, to get your concepts up to speed, you can start your training with our big data and Hadoop training course. These interview questions will test your skills on various topics like Big Data, Hive, Hadoop, Python, SQL, database, etc., so let us get started.

  • 4.5 Rating
  • 81 Question(s)
  • 80 Mins of Read
  • 1756 Reader(s)

Beginner

This may seem like a pretty basic question, but regardless of your skill level, this is one of the most common questions that can come up during your interview. So, what is it? Briefly, Data Engineering is a term used in big data. It is the process of transforming the raw entity of data (data generated from various sources) into useful information that can be used for various purposes.

Data modelling is the scientific process of converting and transforming complex software data systems by breaking them up into simple diagrams that are easy to understand, thus making the system independent of any pre-requisites. You can explain any prior experience with Data Modelling, if any, in the form of some scenarios.

Companies can ask you questions about design schemas in order to test your knowledge regarding the fundamentals of data engineering. Data Modelling consists of mainly two types of schemas:

  • Star schema: Star schema consists of dimension tables that surround a fact table
  • Snowflake schema: Snowflake schema also contains similar dimension tables surrounding a fact table which are further surrounded by dimension tables.

The difference between structured and unstructured data is as follows-

Parameter
Structured Data
Unstructured Data
Storage
DBMS
File structures are unmanaged
Standard
ODBC, ADO.net, and SQL
XML, STMP, CSV, and SMS
Integration Tool
ELT (Extract, Transform, Load)
Batch processing or Manual data entry
Scaling
Schema scaling is difficult
Schema Scaling is very easy.
Version management
Versioning over tuples, row and tables
Versioned as a whole is possible
Example
An ordered text dataset file
Images, video files, audio files, etc.

In today’s world, the majority of big applications are generating big data that requires vast space and a large amount of processing power, Hadoop plays a significant role in providing such provision to the database world.

  • HDFS: HDFS stands for Hadoop Distributed File System. While working with Hadoop, all the data gets stored in The Hadoop Distributed File System. It is fault-tolerant and provides a distributed file system with very high bandwidth.
  • Hadoop Common: It consists of a set of all common utilities and libraries that are utilized by Hadoop.
  • Hadoop YARN: It is used for managing resources in the Hadoop system. Task scheduling for users can also be performed using YARN.
  • Hadoop MapReduce: It is based according to the algorithm that provides provision for large-scale processing data.

NameNode is the master node in the Hadoop HDFS Architecture. It is used to store all the data of HDFS and also keep track of various files in all clusters. The NameNodes don’t store the actual data but only the metadata of HDFS. The actual data gets stored in the DataNodes.

Hadoop streaming is one of the widely used utilities that comes with the Hadoop distribution. This utility is provided for allowing the user to create and run Map/Reduce jobs with the help of various programming languages like Ruby, Perl, Python, C++, etc. which can then be submitted to a specific cluster for usage.

Some of the important features of Hadoop are as below:

  • Hadoop is an open-source framework that can be used free of cost by user.
  • Data processing is very fast because Hadoop supports the feature of parallel processing of data.
  • In order to avoid data loss, Data redundancy is given high priority.
  • It stores data in separate clusters which are independent of the other operations.
  • It is highly scalable hence large amount of data is divided into multiple machines (cost-effective) in a cluster which can process parallelly.
  • Hadoop provides flexibility as it can be used with any kind of dataset like structured (MySQL Data), Semi-Structured (JSON, XML), and Un-structured (Images and Videos) very efficiently.

Blocks are considered as the smallest unit of data that is allocated to a file that is created automatically by the Hadoop System for storage of data in a different set of nodes in a distributed system. Large files are automatically sliced into small chunks called as blocks by Hadoop.

Block scanner as its name suggests, is used to verify whether the small chunks of files known as blocks that are created by Hadoop are successfully stored in DataNode or not. It helps to detect the corrupt blocks present in DataNode.

Following are the steps followed by the block scanner when it detects a corrupted DataNode block-

  • Whenever the block scanner comes across a block that is corrupted, the DataNode reports this particular block to the NameNode.
  • The NameNode then processes the block and helps to create the replica of the same using the existing corrupted block.
  • The system does not delete the corrupted block until the replication count of the newly created replicas matches with the replication factor which is 3 by default.

This whole process helps HDFS in maintaining the integrity of the data during read operation performed by a client.

Below are the steps to achieve security in Hadoop:

  • The first step in securing an Apache Hadoop cluster is to enable encryption so that the authentication channel of the client to server can be secured. Then time stamp is provided to the client.
  • This received time stamp by the client is then used to request TGS to create a service ticket.
  • Then comes the last step where the client uses the already created service ticket for self-authentication to a specific server.

NameNode communicates and gets information from DataNode via messages or signals.  

There are two types of messages/signals that are used for this communication across the channel:

  • Block report signals: These are the list of all HDFS data blocks stored on DataNode. They correspond to all local files and send this report to NameNode.
  • Heartbeat signals: These signals sent between DataNode and NameNode are taken as sign of vitality. They are used to check whether the DataNode is alive and functional. It acts as a periodic report to check whether to use NameNode or not. If this signal is not sent, it implies DataNode has some technical issues or health issues and it has stopped working. The default heartbeat signal is 3 seconds.

The default ports for Task Tracker, Job Tracker, and NameNode in Hadoop are as below:

  • The default port of Job Tracker is: 50030
  • The default port of Task Tracker is: 50060
  • The default port of NameNode is: 50070

This question is asked by interviewers to check your understanding of the role of a data engineer.

  • They use a systematic approach to develop, test, and maintain data architectures.
  • They align the architecture design keeping into consideration business requirements.
  • They help in obtaining data from the right sources and after the formulation of data set processes, they store optimized data.
  • They help to deploy machine learning and statistical models.
  • They dive into data and help to develop pipelines to automate tasks where manual participation can be avoided.
  • They help in simplifying the data cleansing process.
  • They conduct research to address the issues and enhance data reliability, accuracy, flexibility, and quality.

The difference between NAS and DAS is as follows:

NAS
DAS
NAS stand for Network Attached Storage
DAS stands for Direct Attached Storage
Storage capacity of NAS is between 109 to 1012 in byte.
Storage capacity of DAS is 109 in byte.
In NAS, Storage is distributed over distinct servers on a network
In DAS, storage is attached to the node where computation process is taking place.
It has moderate storage management cost
It has high storage management cost
Data transmission takes place using Ethernet or TCP/IP.
Data transmission takes place using IDE/ SCSI

Below are various fields or languages used by data engineer:

  • Machine learning includes programming languages like Python, Java, Javascript, Scala etc.
  • Knowledge of mathematics (linear algebra and probability) is a must.
  • SQL, NoSQL databases, and Hive QL
  • Apache Airflow, Apache Kafka, and Apache Spark
  • Hadoop Ecosystem

In Hadoop, Rack awareness is the concept of choosing the DataNodes which are closer according to the rack information. By default, Hadoop assumes that all the nodes belong to the same rack.

In order to improve the network traffic while reading/writing HDFS files that are on the same or a nearby rack, NameNode uses the DataNode to read/ write requests. To achieve rack information, the rack ids of each DataNode are maintained by HDFS NameNode. This concept in HDFS is known as Rack Awareness.

When NameNode is down, it means that the entire cluster is down. So, the cluster won’t be accessible as it is down. All the services which are running on that cluster will also be down. So, in this scenario, if any user tries to submit a new job will get an error and job will get failed. All the existing jobs which are running will also get failed.

So briefly, we can say that when NameNode will get down, all the new, as well as existing jobs, will get failed as all services will be down. The user has to wait for the NameNode to restart and can run a job once the NameNode will get up.

Four Vs of big data describes four dimensions of big data. These are listed below:

  • Variety
  • Volume
  • Veracity
  • Velocity

The various XML configuration files present in Hadoop are as follows:

  • Mapred-site
  • YARN-site
  • Core-site
  • HDFS-site
  • Hadoop-env.sh
  • Masters
  • Slaves

The main methods of reducer are given below:

  • setup(): This method is used for the configuration of parameters like the size of input data, distributed cache, etc.
  • reduce(): It acts as the heart of the reducer which is called once per key with the associated reduced task
  • cleanup(): This method is used to clear out all the temporary files and it is called only once at the end of reduce task.

FIFO also known as First In First Out is the simple job scheduling algorithm in Hadoop which implies that the tasks or processes that come first will be served first. In Hadoop, FIFO is the default scheduler. All the tasks or processes are placed in a queue and they get their turn to get executed according to their order of submission. There is one major disadvantage of this type of scheduling which is that the higher priority tasks have to wait for their turn which can impact the process.

Hadoop operations can be used in three different modes. These are listed below:

  • Standalone mode: NameNode, DataNode, Secondary Name node, Job Tracker, and Task Tracker will not run in Standalone mode. It is also called Local mode as Hadoop is made to run on this mode by default.  
  • Pseudo distributed mode: A single node is used in this mode also and the main thing in this type of mode is that all the tasks and processes in a cluster run independently to each other.
  • Fully distributed mode: This acts as the most important mode as here multiple nodes are used. Few are used for Resource Manager and NameNode and rest of the nodes are used for the Node manager and DataNode.

In Hadoop, replication factor depicts the number of times the framework replicates or duplicates the Data blocks in a system. The default replication factor in Hadoop is 3 which can be manipulated as per the system requirements. The main advantage of the replication process is to ensure data availability.

We can configure the replication factor in hdfs-site.xml file which can be less than or more than 3 according to the requirements.

In Hadoop, the primary phases of reducer are as follows:

  • Shuffle: In this phase, the mapper’s sorted output becomes the input to the reducer.
  • Sort: In this phase, Hadoop sorts the input to Reducer using the same key. Both the Sort and Shuffle phases happen concurrently.
  • Reduce: This phase occurs after the sort and shuffle phase. In this phase, output values associated with a specific key are reduced to consolidate the data into the reducer final output. Reducer output sorting is not there.

The distance between two nodes is equal to the simple sum of the distance to the closest nodes. In order to calculate the distance between two nodes, we can use getDistance() method for the same.

In Hadoop, Context object is used along with the Mapper class so that it can interact with the other remaining parts of the system. Using the Context object, all the jobs and the system configuration details can be easily obtained in its constructor.

Information can be easily passed or sent to the methods like cleanup(), setup() and map() using the Context object. During map operations, vital information can be made available using the Context object.

In Apache Hadoop, Safe mode is a mode that is used for the purpose of maintenance. It acts as read-only mode for NameNode in order to avoid any modifications to the file systems. During Safe mode in HDFS, Data blocks can’t be replicated or deleted. Collection of data and statistics from all the DataNodes happen during this time.

The available components of Hive Data Model are as below:

  • Tables: These tables are similar to RDBMS. Joins and unions can be used on tables. All the table data is stored in HDFS. We can also apply filters in tables.  
  • Partitions: We can actually specify partition keys in each table to determine how data is getting stored. Using partitions, we can specify what datasets can be looked upon rather than the whole table.
  • Buckets: Data can be divided into buckets in each partition. We can easily evaluate queries based upon some specific sample data with the help of buckets.

In Hive, SerDe stands for Serialization and Deserialization. SerDe is a built-in Library present in Hadoop API. SerDe instructs Hive on how processing of a record(row) can be done.

Deserializer will take binary representation of a record and translate it into the java object that hive can be able to understand. Now, Serializer will take that java object on which Hive is already working and convert that into a format that can be processed by HDFS and can be stored.

The Table creation functions present in Hive are as follows:

  • JSON_tuple()
  • Stack()
  • Explode(array)
  • Explode(map)

The objects created by create statement in MySQL are listed below:

  • Database
  • Index
  • Table
  • Trigger
  • Event
  • View
  • Function
  • User
  • Procedure

In Hive, .hiverc acts as the initialization file. Whenever you open the CLI (Command Line Interface) in order to write the code for Hive, .hiverc is the first one file that gets loaded. All the parameters that have been initially set by you are contained in this file.  

For example, you can set column headers that you want to be visible in the query results, the addition of any jar files, etc. This file is loaded from the hive conf directory.

Metastore acts as the central repository for Hive metadata. It is used for storing the metadata of Hive tables i.e., schemas and locations.

Metadata is first stored in metastore which is later stored in a relational database (RDBMS) whenever required.

Metastore consists of 3 types of modes for deployment. These are given below.

  • Local Metastore
  • Embedded Metastore
  • Remote Metastore

In Hive, multiple tables can be created for a single data file using the same HDFS directory. As we know already that metastore acts as the central repository for Hive metadata and it stores metadata like schemas and locations.  

Data already remain in the same file. So, it becomes a very easy task to retrieve the different results for the corresponding same data based upon the schema.

In Hive, there are some special types of tables in which the values of columns appear in a repeating manner (Skew), these tables are called as skewed tables. In Hive, while creation of a particular table we can specify that table as SKEWED. All the skewed values in the table are written into separate files and the rest of the remaining values are stored in another file.

While writing queries, skewed tables help to provide better performance. Syntax to define a particular table as ‘skewed’ during its creation is as written below using an example.

CREATE TABLE TableName (column1 STRING, column2 STRING) SKEWED BY (column1) on (‘value’)

In MYSQL, we can see the data structure with the help of DESCRIBE command.

The syntax to use this command is as follows.

DESCRIBE Table name;

We can see the list of all tables in MYSQL using SHOW command.

The syntax to use the thing command is as follows.

SHOW TABLES;

We can perform various operations on strings as well as the substrings present in a table. In order to search for a specific string in a table column, we can use REGEX operator for the same.

Following are some of the ways how big data and data analytics can positively impact company’s business.

  • New products can be launched in the market according to the exact customer needs using data analytics thus enhancing the company’s revenue.
  • Production cost can be reduced to an extent which eventually helps to grow company’s revenue.
  • Business growth can be increased by using the data efficiently.
  • Big data helps to improve business processes and save time.

Advanced

Below are the steps that need to be followed in order to deploy a big data solution.

  • The first step is to integrate the data. We can use various types of data sources for this purpose like Salesforce, SAP, RDBMS, MySQL etc.
  • The second step is to store the extracted data in a database which can be either HDFS or NoSQL.
  • Using various processing frameworks, we can finally deploy our big data solution. We can use Pig, Spark, and MapReduce for this purpose.

FSCK stands for File System Consistency Check. Briefly, we can define FSCK as a command that is used in order to check any inconsistencies or any problems in HDFS file system or at the HDFS level.

Syntax of using FSCK command is as below.

hadoop fsck [ GENERIC OPTIONS] < path > [-delete | -move | -openforwrite ] [-files [ -blocks [ -locations | -racks] ] ]

Yarn is abbreviated ad Yet Another Resource Negotiator. In Hadoop, it is considered as one of the main components. While opening Hadoop, Yarn helps in processing and running data for stream processing, graph processing, batch processing, and interactive processing which are stored in HDFS. So briefly, we can say that YARN helps to run various types of distributed applications.

Using YARN, the efficiency of the system can be increased as data that is stored in HDFS is processed and run by various types of processing engines as depicted above.

It is also known for optimum utilization of all available resources that results in easy processing of a high volume of data.

In Hadoop, HDFS abbreviated as Hadoop distributed file system is considered as the standard storage mechanism. It is built with the help of commodity hardware. As we all know that till now, Hadoop does not require a costly server with high processing power and bigger storage, we can use inexpensive systems with average processor and RAM. These systems are called as commodity hardware.

These are affordable, easy to obtain and compatibles with various operating systems like Linux, Windows and MS-DOS without any requirement of any special type of devices or equipment. Another benefit of using commodity hardware is its scalability.

The various functions of Secondary NameNode are as follows.

  • FsImage: It is used for storing a copy of FsImage file and EditLog.
  • Checkpoint: In HDFS, Checkpoint is used by Secondary NameNode in order to detect any corrupted data present.
  • NameNode crash: Secondary NameNode’s FsImage can be used for the purpose of recreating the NameNode in case of any NameNode crash.
  • Update: It helps to keep the FsImage file updated on Secondary NameNode. It updates both FsImage file and EditLog automatically.

Combiner also known as Mini-Reducer acts as an optional step between Map and Reduce. Briefly we can that, it helps to take the output from Map function. It then summarizes that output using the same key and then it passes the final summarized records as input to the Reducer.

When we make use of MapReduce job on a large dataset. Then large chunk of data is generated by the Mapper which when passed to the reducer for further processing can cause congestion in the network. In order to deal with kind of congestion, Combiner is used by Hadoop Framework as an intermediate between Mapper and Reducer to reduce network congestion.

In Hadoop, when we are dealing with Big Data Systems, then the size of data is huge. Therefore, it is not a good practice to move this large amount of data across the network otherwise it may impact the system output and also causes network congestion.  

In order to get rid of these above problems, Hadoop uses the concept of Data Locality. Briefly we can say that, it is the process of moving the computation towards the data rather than doing the opposite process of moving huge amount of data. In this way, data always remain local to storage locations. So, when a user runs a MapReduce job, then the code present in MapReduce is sent by NameNodes to DataNodes that contains the data related to MapReduce job.

Balancer is a utility provided by HDFS. As we know that, DataNodes stores the actual data related to any job or process. Datasets are divided into blocks and these blocks are stored across the DataNodes in Hadoop cluster. Some of these nodes are underutilized and some are overutilized by the storage of blocks, so a balance needs to be maintained.  

Here comes the use of balancer which analyses the block placement across various nodes and moves blocks from overutilized to underutilized nodes in order to maintain balance of data across the DataNodes until the cluster is deemed to be balanced.

In Hadoop, Distributed cache is a utility provided by the MapReduce framework. Briefly we can that, it can cache files like jar files, archives and text files when needed for any application.

When MapReduce job is running, this utility caches the read only files and make them available to all the DataNodes. Each DataNodes gets the local copy of the file. Thus, we will be able to access all files present in DataNodes. These files remain in the DataNodes while job is running and these are deleted once the job is completed.

The default size of Distributed cache is 10 GB which can be adjusted according to the requirement using local.cache.size.

In Hive, there are various types of SerDe implementations available. There is also a provision to create your own custom SerDe implementations. Few of the popular implementations are listed below.

  • DelimitedJSONSerDe
  • RegexSerDe
  • ByteStreamTypedSerDe
  • OpenCSVSerde

In Python, we can pass a variable number of arguments to a function when we are unsure about how may arguments need to be passed to a function. These arguments can be passed using a special type of symbols as depicted below.

  • *args (Non-Keyword Arguments): This symbol is used in a function to pass a variable number of non-keyword arguments on which tuple operation can be performed. We use asterisk (*) symbol before the name of the parameter in order to pass arguments of variable length.
  • **kwargs (Keyword Arguments): This symbol is used in a function to pass a variable number of keyword arguments dictionary on which dictionary operations can be performed We use double-asterisk (*) symbol before the name of the parameter in order to denote the argument type.

Function flexibility can be achieved by passing these two types of special symbols.

The differences between Data warehouse and Database are given below.

Parameter
Data warehouse
Database
Definition
It is a system that collects and stores information from multiple data sources within an organization
It is organised collection of logical data that is easy to search, manipulate and analyse.
Purpose and Usage
It is used for the purpose of analysis of your business
It is used for recording the data and performing various fundamental operations for your business.
Data Availability
When required, data is refreshed and captured from various data sources
Real time data is always available
Type of data stored
It contains only summarized data
It contains detailed data
Usage of Queries
Complex queries are used
Simple queries are used
Tables and Joins
In Data warehouse, Tables and Joins are simple
In Database, Tables and Joins are complex

The differences between OLAP and OTLP are given below.

OLAP
OLTP
It is used for managing informational data
It is used for managing operational data
The size of database is 100 GB-TB
The size of database is 100 MB-GB
It contains large volume of data
The volume of stored data is not that much large
It has mainly one access mode that is write mode
It has both read and write access modes
It is partially normalized
It is completely normalized
Its processing speed is dependent on lot of factors like the complex queries, number of files it contains etc
It has very high processing speed
It is market oriented and is mainly used by analysts, managers and executives
It is customer oriented and is mainly used by clerks, clients and IT professionals

The differences between the NoSQL and SQL database are as below.

Parameter
NoSQL database
SQL database
History
It is developed in late 2000s with a focus to allow rapid implementation of changes in applications and scalability
It is developed in 1970s with a focus to reduce the problem of data duplication
Data Storage Model
Tables with rows and dynamic columns are used
Tables with fixed rows and columns are used
Schemas
Schemas are flexible
Schemas are rigid
Scaling
Horizontal scaling is possible
Vertical scaling is possible
Joins usage
Joins are not required in NoSQL
Joins are typically required in SQL
Examples
MongoDB and CouchDB
MySQL, Oracle, Microsoft SQL Server, and PostgreSQL

In Modern applications that has complex and constantly changing data sets, NoSQL seems to be better option to use as compared to traditional database as in such applications we need a flexible data model that doesn’t need to be defined immediately.

NoSQL provided various agile features which helps companies to go to the market faster and accordingly make updates faster. It also helps to store real time data.

While using big servers, it is always better approach to scale out rather than scale in when we are dealing with increased data processing load. Using NoSQL is a better option here as it is cost effective and can deal with huge volume of data. Although, relational database provides better connectivity with the analytical tools but still NoSQL is better to use as it offers a lot of features compared to traditional database.

In Python, both list and tuple are classes of data structures. Differences between list and tuple are as follows.

List
Tuple
Lists are mutable i.e.; they can be modified
Tuples are immutable i.e.; they can’t be modified
Memory consumption of List is more
Memory consumption of Tuple is less as compared to List
List is more prone to errors and unexpected changes
Tuple is not prone to such errors and unexpected changes
It contains a lot of built-in methods
It doesn’t contain much built-in methods
Operations like insertion and deletion are performed better using list
Tuple is mainly used for accessing the elements
List has dynamic characteristics so it is slower compared to tuple
Tuple has static characteristics so it is faster
Syntax: list_data1 = ['list', 'can', 'be', 'modified', 'easily']
Syntax: tuple_data1 = ('tuple', 'can’t', 'be', 'modified', 'ever')

We can come across a situation in which we can find multiple duplicate data entries in a table which makes no sense to fetch all those entries to avoid redundancy while we are fetching records from that table. We need only those unique data entries that will make sense to fetch.

For achieving this, DISTINCT keyword is provided by the SQL which we can use with the SELECT statement so that we can eliminate the duplicate data entries and can only fetch unique data entries.

The syntax to use this keyword to eliminate duplicate data is as below:

SELECT DISTINCT column1, column2, column3...columnM
FROM table_name1
WHERE [conditions]

We can also this use UNIQUE keyword to handle duplicate data. The UNIQUE constraint is used for ensuring that all the values present in a specific column are different in SQL.

COSHH also known as Classification and Optimization-based Scheduling for Heterogeneous Hadoop systems.

Multiplexing and execution of a lot of tasks happens in a common datacentre in Hadoop system. It will lead to the sharing of the Hadoop cluster among a lot of users which will lead to increase in system heterogeneity. Though in Hadoop schedulers, this issue is not given that much importance. In order to rectify that, here comes the use of COSHH, which is specially designed and implemented in order to provide scheduling at both cluster and application levels. This implementation leads to improve the completion time of job.

This question mainly focuses on knowing how you can actually deal with unexpected problems in high pressure situations.

Unexpected problems are inevitable and a lot of situations arises when you encounter these unexpected problems while doing your daily routine jobs or tasks. Same is the case with data maintenance.

Data maintenance can be considered as one of the daily basis tasks which need to be monitored properly to make sure all the inbuild tasks and corresponding scripts are getting executed as per expectation. As an example, in order to prevent addition of corrupt indexes into the database, we can create various maintenance tasks which can prevent addition of these corrupt indexes to the database to avoid any serious damage.

Advantages and disadvantages of cloud computing are as follows.

Advantages:

  • There is no cost required for infrastructure. You need to pay for only those services what you use.
  • Maintenance is negligible.
  • It is very easy to use and highly reliable.
  • It provides high storage capacity.
  • It helps to provide recovery, data backup and data control.

Disadvantages:

  • It requires a good internet connectivity with excellent bandwidth in order to provide good functionality.
  • Security and technical issues might arise.
  • We can only control limited infrastructure.
  • Flexibility will be less.
  • You need to deal with outages.

This question mainly focuses on knowing what problems you have faced while working as a data engineer in you prior experience. We can depict some of the most common problems here as an answer.

  • As a data engineer, we need to deal with huge amount of data storage and corresponding information from that data.
  • Problem can be caused due to restrictions on the required resources.
  • Decision needs to be made to choose the best tools which can help to provide desired results.
  • Continuous processing and transferring of data happen in real time scenario.

In modern world, data has become the new currency. Both the roles of data engineer and data scientists revolves around data only but there are some differences in their duties which are as mentioned below.

Data Engineer
Data Scientist
This role mainly focuses on collection and analysis of data. It focuses on designing and implementing of various pipelines which can be used in manipulating and transformation of unstructured data into required format that can be used by data scientists for analysis purposes.
This role mainly focuses on using that required format data and extract various data patterns from it by using various analytical tools, mathematics and statistical knowledge and provide these deep understandings that may positively impact the business.
Considering the importance of data, it is the duty of data engineer to keep the data safe and secure and also should take data backups to avoid any loss.
After performing data analysis on the data, it is the job of data engineer to convey their analysis results to the stakeholders. Good communication skills are must for them.
Big Data and Database management skills are must for a data engineer
Machine learning is a must skill required for a data scientist.

NFS is Network File System and HDFS is Hadoop Distributed File System. The various differences between the two are as follows.

NFS
HDFS
Only small amount of data can be stored and processed with NFS
Large amount of data or big data can be stored and processed with HDFS.
NFS stores data on a single dedicated machine or a disk of a dedicated network machine. These data files can be accessed by the clients over the network
HDFS stores data in a distributes manner. In other words, data is stores on many dedicated machines or network computers
NFS is not fault tolerant and data can be lost if there is some failure is caused. This data can’t be recovered in future.
HDFS is fault tolerant and data can be easily recovered in case of any failure caused in nodes.
There is no data redundancy that can occur in NFS as all the data get stored on a single dedicated machine.
Data redundancy can be point of concern here because of the replication of same data files across the multiple dedicated machines.

Feature selection is the process of identifying and selecting the most relevant features that can be input to the machine learning algorithms for the purpose of model creation.

Feature selection techniques are used for the purpose of neglecting all the redundant or unrelated features as an input to the machine learning models by decreasing the number of input variables and narrowing down the features to only the desired relevant features. There are few advantages of using these feature selection techniques which are mentioned below.

  • Time is saved in order to train any machine learning model as after using feature selection techniques we get only subset of desired features
  • It leads to simpler models which are easy to explain compared to any complex models.
  • More the number of features, more is the volume of space required which eventually limit the availability of data. As this technique helps to eliminate the unrelated features, it helps to reduce dimensionality.

There are few ways with the help of which we can handle missing value in Big Data. These are as follows.

  • Use of Medians/Mean: All the missing values of a column can be easily filled by using the median or mean of the remaining values in the column in a dataset having data type as numeric.
  • Deletion of rows that has missing values: In a dataset, we can delete rows or columns from a table that has missing values but this option is only effective or should be used when there are small number of missing values. We can delete a column if it has missing value in more than half of the table rows. Similarly, we can delete a row that has missing values in more than half of the table columns.
  • Use of categorical data: If we can classify data in a column then we can use categorical variable in order to fill the empty values with frequently used values if half of the column values are empty.
  • Predictive values: We can fill the missing values in a table if we know the nature and can predict the variable type and then fill those empty values with the predictive ones using regression techniques.

Other than the above-mentioned techniques we can also use K-NN algorithm, The RandomForest algorithm, Naive Bayes algorithm, and Last Observation Carried Forward (LCOF) methods in order to handle missing values in Big Data.

Outliers are the data records which are different from the normal records in some of their characteristics. It is very important to first decide the characteristics of the normal records in order to detect the outliers. These records when used in algorithms or analytical systems can provide abnormal results which may impact the analysis process. So, it is very important to detect the outliers to avoid such abnormalities.

We can detect outliers using tables or graphs by directly looking at them. As an example, let suppose there is table containing the Name and Age of few people and one of the rows representing a person contains Age as 500. So, we can easily analyse that value to be an invalid value as age can be 40,50 or 55 but it can’t be 500. So, we can predict the age but can’t sure about the exact value. This kind of detection is easy when we are dealing with a table with limited records but if the tables contain thousands of records, then it becomes impossible to detect the outliers.

The difference between the K-Nearest Neighbour and K-Means methods are as below.

KNN
K-means
KNN is supervised learning algorithm which can be used for classification or regression purposes. The Classification of nearest K points is done by KNN so that category of all points can be easily determined
k-means clustering or method is an unsupervised learning algorithm which can be used for clustering purpose. Here you select K number of clusters and then place each of the data points into those K clusters
The performance of KNN is better if all the data is having same scale
This doesn’t stand true for K-means

Logistic regression acts as a predictive model which is used to analyse large datasets to determine a binary output considering an input variable is provided. The binary output can take only limited number of values which can be 0/1, true/false or yes/no.

Logistic regression make use a sigmoid function in order to determine various possible outcomes and their corresponding probabilities of occurrence and then map them both on a graph. There is always an acceptance threshold which is set to determine whether a particular instance belongs to that class or not. If the probability of an outcomes is more than that threshold then that instance belongs to that class otherwise it doesn’t belong to that class if the probability is less than the acceptance threshold. There are three types of logistic regression. These are as listed below.

  • Binary logistic regression
  • Ordinal logistic regression
  • Multinomial logistic regression

A/B testing also known as split testing is a random statistical experiment performed on two different variants (A and B) of a webpage or any application by showing these variants to set of end users and analysing which of the two variants is creating a larger impact on the users or which variant proved to be more effective and beneficial to the end users. A/B testing is having a lot benefits which as follows.

  • It helps to improve user engagement.
  • It helps to provide a better end product by improving content.
  • It helps to increase conversion rates and decrease bounce rates.
  • Analysis process has been simplified to a high extend.
  • Sales volume can be increased as a result of A/B testing.
  • It helps to reduce risks as whatever is not required in a page or app can be easily removed and mistakes can be avoided.

Collaborative filtering is a process or technique which make use of various algorithms in order to provide personalized recommendations to the users. It is also known with name of social filtering. Some of the popular websites which make use of this kind of filtering are iTunes, Amazon, Flipkart, Netflix etc.

In Collaborative filtering, a user is provided with personal recommendations based upon the compilation of common interest or preferences from other users with the help of prediction algorithms. We can take an example of two users A and B. Let’s suppose user A visits Amazon and bought item 1 and 2 and when user B will try to buy that same item 1 then item 2 will be recommended to the user B based upon predictive analysis.

“is” operator is used for the purpose of reference equality to check whether the two references or variables are pointing to the same object or not. Accordingly, it returns value as true or false.

“==” operator is used for the purpose of value equality to check whether the two variables are having same value or not. Accordingly, it returns value as true or false.

We can take any example with the help of two lists X and Y.

X = [1,2,3,4,5]

Y = [1,2,3,4,5]

Z = Y

  1. X == Y will give the result as true because the List X contains same values as List Y. So, the above statement turns out to be true.
  2. X is Y will give the result as false because even though the values contained in the two lists are same only but the referred objects are different.
  3. Z is Y give the result as true because the referred object is same only for both the lists.

Python memory manager does the task of managing memory in Python. All the data structures and objects in Python are stored in private heap. It is the duty of Python memory manager only to manage this private heap. Developers can’t access this private heap space. This private heap space can be allocated to objects by memory manager.

Python memory manager contains object specific allocators to allocate the space to specific objects. Along with that it also has raw memory allocators to make sure that space is allocated to the private heap.

In Python, developers create a Garbage collector so that they don’t need to manually do garbage collection. The main job of this collector is to clear out unused space and make it available for other new objects or private heap space.

Decorators can be considered as one of the most important and powerful tools that are present in python. We can temporarily modify the behaviour of a function or a class with the help of this tool.

Decorator helps to wrap the function or a class with another function to modify the behaviour of wrapped function or a class without making any permanent changes to that specific function source code.

In Python we can easily use or pass functions as arguments as they are considered as first-class objects. In Decorator, a function acting as a first-class object can be passed an argument to another function and then it will be called inside the wrapper function.

append(): In Python, when we will pass an argument to append() then it will be added as a single entity in the list. In other words, we can say that when we try to append a list into another list then that whole list is added as a single object to the other list’s end and hence the length of the list will be incremented by 1 only. Append() has fixed time complexity of O(1)

Example: Let’s take an example of two lists as shown below.

list1 = [“Alpha”, “Beta”, “Gamma”]
list2 = [“Delta”, “Eta”, “Theta”]
list1.append(list2)
list1 will now become: [“Alpha”, “Beta”, “Gamma”, [“Delta”, “Eta”, “Theta”]]

The length of list1 will now become 4 after addition of second list as a single entity.

extend(): In Python, when we will pass an argument to extend() then all the elements which are contained in that argument get added to the list or in other words the argument will be iterated over. So, the length of the list will be incremented by the number of elements which have been added from another list. extend() has time complexity of O(n) where n is the number of the elements in an argument which has been passed to the extend().

Example: Let’s take an example of two lists as shown below.

list1 = [“Alpha”, “Beta”, “Gamma”]
list2 = [“Delta”, “Eta”, “Theta”]
list1.extend(list2)
list1 will now become: [“Alpha”, “Beta”, “Gamma”, “Delta”, “Eta”, “Theta”].

The length of list1 will now become 6 in this scenario.

In Python, loops statements are used in order to do repetitive tasks with good efficiency. But in some scenarios, we need to come out of those loop statements or ignore some of the conditions. For these scenarios Python provides various loop control statements in order to take control over loops. These statements are as below.

  • Break statement
  • Continue statement
  • Pass statement
  1. Break statement: Break statement is a loop statement which is used for the purpose of breaking or terminating the loop in which it has been applied. After the application of this statement to a loop, the control will be shifted to the next consecutive statement to the break statement if available. If Break statement is applied to a nested loop, then it will break only those particular loops where it has been applied.
  2. Continue statement: Continue statement is a loop statement which is exactly opposite to the behaviour of break statement. It continues to execute the next iteration of the loop rather than breaking or terminating that loop. In short words, it helps in terminating current iteration and continue to execute the next iteration of a loop. When a Continue statement is executed in a loop, then the whole code inside the loop will be skipped and the control will shift to next iteration of loop.
  3. Pass statement: Pass statement is a loop statement which when executed does nothing. When a particular statement is required syntactically but at the same time no code needs to be executed then we can use this Pass statement. Empty loops can be easily written with the help of Pass statement. It can also write functions and classes.

In Python, SciPy is an open-source library which is used for the purpose of solving various engineering, mathematical, technical and scientific problems. We can easily manipulate data with the help of SciPy and perform the data visualisation with a wide number of high-level commands present in Python. SciPy is called as “Sign Pi”.

NumPy acts as the foundation of SciPy as it is built on it. SciPy libraries are built only to make sure that it can be able to work with the arrays of NumPy. Optimization and numeric integration are also possible by using the various numerical practices like routines which are provided by SciPy. In order to setup SciPy in your systems, below are syntax for the same for different operating systems.

Windows:  

Syntax: Python3 -m pip install --user numpy scipy  

Linux:

Syntax: sudo apt-get install python-scipy python-numpy

Mac:

Syntax: sudo port install py35-scipy py35-numpy

BETWEEN operator: In Python, BETWEEN operator is used in order to test whether the provided expression lies between a defined range of values or not. While testing, the range is inclusive. These values can be of any type like date, number or text. We can use this BETWEEN operator with SELECT, INSERT, DELETE, and UPDATE statements. The syntax to apply this operator is as below.

SELECT column_name(s)
FROM table_name
WHERE column_name BETWEEN value1 AND value2;

Output: It will return all the values from above column_name which lies between value1 and value2 including these 2 values also.

IN operator: In Python, IN operator is used to check whether an expression matches with some of values that has been specified in a list containing values. It can be used in order to eliminate the use of multiple OR conditions. We can also use NOT IN operator which functions exactly opposite to the IN operator to exclude certain rows from the output list. We can use this IN operator or NOT IN operator with SELECT, INSERT, DELETE, and UPDATE statements. The syntax to apply this operator is as below.

IN: SELECT column_name(s)
      FROM table_name
      WHERE column_name IN (list_of_values);

Output: It will return all the values from above column_name which matches with the specified “list_of_values”

NOT IN: SELECT column_name(s)
              FROM table_name
             WHERE column_name NOT IN (list_of_values);

Output: It will return all the values from above column_name excluding the specified “list_of_values”

In SQL, we can provide temporary names to either columns or table, these are called as aliases for a specific query. When we don’t want to use the original name of the table or column, then we use alias to provide temporary names to them. The scope of the alias is temporary and up to that specified query only.

We use alias to increase the readability of a column or a table name. This change is temporary and the original names that are stored in the database never get changed. Sometimes the names of table or column are complex so it is always preferred to use alias to give them an easy name temporarily. Below is the syntax to use alias for both table and column names.

Column Alias:

Syntax: SELECT column as alias_name FROM table_name;

Explanation: Here alias_name is the temporary name that is given to column name in the given table table_name.

Table Alias:

Syntax: SELECT column FROM table_name as alias_name;

Explanation: Here alias_name is the temporary name that is given to table table_name.

SQL injection is the process of inserting malicious SQL commands to the database that can exploit the user data stored in it. By inserting these statements, hackers actually take control of the database and can destroy and manipulate sensitive information stored in database. These SQL command insertions or SQL injection mainly happens using inputs through web pages which is one of the most common web hacking techniques.

In Web applications, usually web servers do communication with the database servers in order to retrieve or store data related to user in the database. Hackers input these malicious SQL codes which are executed once the web server tries to make connection with the database server resulting in compromising the security of the web application.

We can make use of Restricted access privileges and user authentication to avoid any security breach which may impact the critical data present in database. Another way is to avoid using system administrator accounts.

In SQL, Trigger acts a stored procedure which gets invoked when a triggering event occurs in a database. These triggering events can be caused due to insertion, deletion or updating of any row or column in a particular table. For example, trigger can be invoked when a new row is added or deleted from a table or any row is updated. The syntax to create a tigger in SQL is as below.

Syntax:

create trigger [trigger_name]
[before | after]
{insert | update | delete}
on [table_name]
[for each row]
[trigger_body]

Explanation: 

1. Trigger will be created with a name as [trigger_name] whose execution is determined by [before | after].

2. {insert | update | delete} are examples of DML operations.

3. [table_name] is the table which is associated with trigger.

4. [for each row] determines the rows for which trigger will be executed.

5. [trigger_body] determines the operations that needs to be performed after trigger is invoked.

Description: Data Engineering is very important term used in big data. It is the process of transforming the raw entity of data (data generated from various sources) into helpful information that can be used for various purposes. Data Engineering has become one of the most popular career choices today.

According to a study, it has been expected that the data engineering services and global big data will grow from USD 29.50 billion that was in 2017 to USD 77.37 billion by 2023, at a Compound Annual Growth Rate (CAGR) of 17.6% during the forecast period. 2017 is taken as the base year for this study, and the forecast period taken here is 2018–2023. Data engineer has to take up a lot of responsibilities daily, from collecting to analyzing data with the help of many tools.

If you are interested in data engineering and looking for top interview questions and answers in the field of data engineering, then these above beginner and advance level questions are best for you which keep into consideration various skills of data engineering like Python, Big data, Hadoop, SQL, Database, etc. Data analyst and data engineer jobs are increasing at a faster rate in the market and market has a lot opportunities for both freshers and experienced engineers across the world. Good conceptual knowledge and hold on logics will help you crack interviews in many reputed companies. The above questions are designed to help understand the concepts of data engineering deeply. We have tried to cover almost every topic of data engineering.

If you go through the above-mentioned, you will easily find questions from beginner to an advanced level according to your level of expertise. These questions will help you give an extra edge over the other applicants who will apply for data engineering jobs. If you want to study data engineering topics deeply, you can enroll in big data courses on KnowledgeHut that can help you to boost your basic and advanced skills.

Best of Luck.

Description

Data Engineering is very important term used in big data. It is the process of transforming the raw entity of data (data generated from various sources) into helpful information that can be used for various purposes. Data Engineering has become one of the most popular career choices today.

According to a study, it has been expected that the data engineering services and global big data will grow from USD 29.50 billion that was in 2017 to USD 77.37 billion by 2023, at a Compound Annual Growth Rate (CAGR) of 17.6% during the forecast period. 2017 is taken as the base year for this study, and the forecast period taken here is 2018–2023. Data engineer has to take up a lot of responsibilities daily, from collecting to analyzing data with the help of many tools. 

If you are interested in data engineering and looking for top interview questions and answers in the field of data engineering, then these above beginner and advance level questions are best for you which keep into consideration various skills of data engineering like Python, Big data, Hadoop, SQL, Database, etc. Data analyst and data engineer jobs are increasing at a faster rate in the market and market has a lot opportunities for both freshers and experienced engineers across the world. Good conceptual knowledge and hold on logics will help you crack interviews in many reputed companies. The above questions are designed to help understand the concepts of data engineering deeply. We have tried to cover almost every topic of data engineering. 

If you go through the above-mentioned, you will easily find questions from beginner to an advanced level according to your level of expertise. These questions will help you give an extra edge over the other applicants who will apply for data engineering jobs. If you want to study data engineering topics deeply, you can enroll in big data courses on KnowledgeHut that can help you to boost your basic and advanced skills. 

Best of Luck. 

Read More
Levels