Big Data Interview Questions [Beginner to Advanced] 2024

1. What is the key Spark-Driver component to handle the execution of Big Data?

DAGScheduler:

DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling. It transforms a logical execution plan into the DAGScheduler which is the scheduling layer of Apache Spark that implements stage-oriented scheduling. SparkContext hands over a logical plan to DAGScheduler that it in turn translates to a set of stages that are submitted as TaskSets for execution.

TaskScheduler:

TaskScheduler is responsible for submitting tasks for execution in a Spark application. TaskScheduler tracks the executors in a Spark application using executorHeartbeatReceived and executor Lost methods that are to inform about active and lost executors, respectively. Spark comes with the following custom TaskSchedulers: TaskSchedulerImpl — the default TaskScheduler (that the following two YARN-specific TaskSchedulers extend). YarnScheduler for Spark on YARN in client deploy mode. YarnClusterScheduler for Spark on YARN in cluster deploy mode.

BackendScheduler:

BackendScheduler is a pluggable interface to support various cluster managers, cluster managers differ by their custom task scheduling modes and resource offers mechanisms Spark abstracts the differences in BackendScheduler contract.

BlockManager:

Responsible for the translation of spark user code into actual spark jobs executed on the cluster.

2. What is Spark Driver and what are the roles and responsibility perform after the job submission?

Spark driver prepares the context and declares the operations on the data using RDD transformations and actions. Driver submits the serialized RDD graph to the master, where master creates tasks out of it and submits them to the workers for execution. Executor is a distributed agent responsible for the execution of tasks.

Below is the key point for the reference:

Spark driver plays the vital role which is kickoff from execute the main () function.
control the node in the cluster and performing below three operation:
maintaining information about the Spark Application
responding to a user’s program or input
analyzing, distributing, and scheduling work across the executors
Spark driver application has its own executor process.
Executor performs all the data processing.
Reads from and Writes data to external sources.
Stores the computation results data in-memory.
Interacts with the storage systems.

Spark driver coordinates the different job stages, where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.

3. File format support by Spark? How the lazy evaluation works in Spark?

Parquet file.
JSON
Hive
Cassandra, Mongo-DB
Text file
CSV file
My-SQL

Lazy evaluation in spark works to instantiate the variable when its really required for instance like when spark do the transformation, till this time it’s not computed, however when we applied the action than its compute the data. Spark delays its evaluation till it is necessary.

Eg: lazy val lazydata = 10

4. How is the Spark difference from Hadoop?

This question is a regular feature in Spark interview questions for experienced, be ready to tackle it.

Feature Criteria	Apache Spark	Hadoop
Speed	100 times faster than Hadoop	Slower than the Spark
Processing	Support both Real-time & Batch processing	Batch processing only
Difficulty	Easy because of high level modules	Tough to learn
Recovery	Allows recovery of partitions	Fault-tolerant
Interactivity	Has interactive modes	No interactive mode except Pig & Hive

5. Is it possible to perform spark operation with using Hadoop?

One can execute the spark operation without using Hadoop, for instance we can develop and run the spark code from the local system even from
Windows platform.
Even spark can read and then process data from the data base and no sql as well.
Spark has a doesn't have an ability to storing the record, that is a reason its require the distributed storage system.
Another reason that spark processing the huge volume of record, which is difficult to store and process in a single node or local machine, that is also a one reason that Hadoop need to integrate if there is high volume data available.

6. What is a GraphX? How the RDD ineract to the graph DB and what are the operation can perform on Graph DB? Syntax of creating graphx?

GraphX is a part of Spark framework, which use for graph and graph based parallel processing, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation.

Extension on Spark RDD to perform computation on graph DB.
Follow directed multigraph data structure.
Support operator like (joinGraph, joinVertices and mapReduceTriplet)
Support both Supervised and unsupervised algorithms.
GraphX optimizes the representation of vertex and edge types when they are primitive data types (e.g., int, double, etc…) reducing the in-memory footprint by storing them in specialized arrays.
GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregate Messages)
Package need to import is “import org.apache.spark.graphx._”
val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

GraphX in spark are immutable, distributed, and fault-tolerant. Changes to the values or structure of the graph are accomplished by producing a new graph with the desired changes.

7. What is a MLib? Who is using a MLib, which type of machine learning can perform by using MLib?

MLib is a scalable machine library which comes as a bundle of Spark framework. Spark provide a high-quality well tested algorithms to performs data science operation.

Below are the key points:

MLib is a scalable machine learning library which provide the tested and productive machine learning algorithms.
This library used for performing data analytics by data scientist.
Support both supervised and unsupervised machine learning algorithms.
Package which need to import is: “import org.apache.spark.mlib._”

Code Snippet: data = spark.read.format("libsvm").load("hdfs://...")

                                  model = KMeans(k=10).fit(data)

MLlib utilize the linear algebra package Breeze, which depends on netlib-java for optimized numerical processing. If native libraries1 are not available at runtime, you will see a warning message and a pure JVM implementation will be used instead.

8. What are the algorithms MLib support use by data scientist?

Logistic regression, naive Bayes: Use for Classification.
Generalized linear regression, survival regression: Perform Regression technique.
Decision trees, random forests, and gradient-boosted trees
Alternating least squares (ALS): For Recommendation
K-means, Gaussian mixtures (GMMs): To performs Clustering
Latent Dirichlet allocation (LDA): To perform modeling
Sequential pattern mining: Frequent item sets, association rule mining.
Featurization: feature extraction, transformation, dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
Persistence: saving and load algorithms, models, and Pipelines
Utilities: linear algebra, statistics, data handling, etc.

9. How dataset will be a better alternative as compare to the dataframe?

Dataframe	Dataset
Dataframe is structured into named and column and provides a same behaviour which is in table in RDBMS	Dataset is distributed collection of data, which provide the benefits of both RDD and dataframe
Dataframe doesn’t require schema or meta information about the and does not process strict type checking.	To create dataset we need to provide the schema information about the record and follows strict type checking.
Dataframe doesn’t allow lambda function	Dataset support support lambda function.
Dataframe doesn’t comes with optimize engine	Dataset comes with Spark SQL optimize engine called Catalyst optimizer
Dataframe doesn’t support any encoding technique at runtime	Dataset comes with encoder technique, which provide technique to convert JVM object into the dataset.
Incompatible with domain object, once dataframe created, we can’t regenerate the domain object.	Regeneration of domain object is possible, because dataset need the schema information before creating the
Dataframe doesn’t support the compile time safety.	Dataset maintain the schema information, if schema is incorrect than its generate the exception at compile time.
Once dataframe get created, we can’t perform any RDD operation on it.	Dataset leverage to use RDD operation as well along with sql query processor.

10. What is persistence and caching in Spark and why do we need them?

It's no surprise that this one pops up often in Spark scala interview questions.

Cache and persist methods are optimization techniques in Spark that save the result of RDD evaluation. By using cache and persist, we can save the intermediate results so that we can use them further if required.

We can make RDD persist in memory(which can be in-memory or dist )using cache() and persist() methods.

If we make RDDs cache() method, it stores all the RDD data in-memory.

We use persist() method in RDD to save all the RDD in memory as well. But the difference is the cache() stores RDD in the system/clusters in-memory, but persist() method can use various storage levels to store the RDD. By default, persist() uses MEMORY_ONLY, it is equal to cache() method.

Below are the various levels of persist().

MEMORY_ONLY – Stores RDD in in-memory, but If the RDD does not fit in memory, then some partitions will not be cached and will recompute on the fly each time needed. This is the default level.
MEMORY_AND_DISK – Stores RDD in both in-memory and the disk. If the RDD does not fit in memory, it stores some partitions that don’t fit on the disk and read them from there when they are needed.
MEMORY_ONLY_SER– Stores RDD in-memory. But it stores RDD as serialized Java objects. This is more space-efficient than deserialized objects. especially when using a fast serializer, but it is hard for CPU to read.
MEMORY_AND_DISK_SER – Stores RDD in both in-memory and the disk.it stores RDD as serialized Java objects. This is more space-efficient than deserialized objects but it spills partitions that don’t fit in memory to disk.
DISK_ONLY – It stores the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2 – It replicates each partition on two cluster nodes.
OFF_HEAP – Like MEMORY_ONLY_SER, but store the data in off-heap memory. This requires enabling off-heap memory.

Need for persistence :

In Spark, we often use the same RDD’s multiple times. When we repeatedly process the same RDD multiple times, it requires time to evaluate each time. This task can be time and memory-consuming, especially iterative algorithms that require data multiple times. To solve the problem of repeated computation, we require a persistence technique.

11. What are the shared variables in Spark?

A common question in Spark interview questions, don't miss this one.

In addition to RDD abstraction, the second kind of Low-level API is shared variables in Spark. Spark has two types of distributed shared variables:

Broadcast Variables
Accumulators

These variables can be used in User Defined Functions(UDFs).

Broadcast Variables

Broadcast variables are the variables to share an immutable value efficiently around the cluster without encapsulating that variable in a function closure. The normal way to use a variable in our driver node inside your tasks is to simply reference it in your function closures (e.g., in a map operation), but this can be inefficient, especially for large variables such as a lookup table or a machine learning model. The reason for this is that when you use a variable in a closure, it must be deserialized on the worker nodes many times.

Moreover, if you use the same variable in multiple Spark actions and jobs, it will be re-sent to the workers with every job instead of once. This is where broadcast variables come in. Broadcast variables are shared, immutable variables that are cached on every machine in the cluster instead of serialized with every single task. The canonical use case is to pass around a large lookup table that fits in memory on the executors and use that in a function.

Accumulators

Spark’s second type of shared variables is a way of updating a value inside of a variety of transformations and propagating that value to the driver node in an efficient and fault-tolerant way. Accumulators provide a mutable variable that a Spark cluster can safely update on a per-row basis. We can use these for debugging purposes or to create low-level aggregation.

We can use them to implement counters or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types. For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will be applied only once, meaning that restarted tasks will not update the value. In transformations, we should be aware that each task’s update can be applied more than once if tasks or job stages are re-executed.

Accumulators do not change the lazy evaluation model of Spar. Accumulator updates are not guaranteed to be executed when made within a lazy transformation like map().

Accumulators can be both named and unnamed. Named accumulators will display their running results in the Spark UI, whereas unnamed ones will not.

12. What is Spark streaming and how does it work?

Spark Streaming is real-time processing of streaming data. Through Spark streaming, we achieve fault tolerant processing of live data stream. The input data can be from any source. For example, like Kafka, Flume, kinesis, twitter or HDFS/S3. Spark includes two streaming API’s:

DStream API.
Structured Stream API.

DStream API :

Spark’s DStream API has been used broadly for stream processing since its first release in 2012. Many companies use and operate Spark Streaming at scale in production today due to its high-level API interface and simple exactly once semantics. Interactions with RDD code, such as joins with static data, are also natively supported in Spark Streaming. Operating Spark streaming is not much more difficult than operating a normal Spark cluster. However, the DStreams API has some limitations.

It is based purely on Java/Python objects and functions, as opposed to the richer concept of structured tables in DataFrames and Datasets. This limits the engine’s opportunity to perform optimizations.
The API is purely based on processing time, to handle event-time operations, applications need to be implemented on their own.
Finally, DStreams can only operate in a micro-batch fashion, and exposes the duration of micro-batches in some parts of its API, making it difficult to support alternative execution modes.

Structured Stream API:

Structured Streaming is a higher-level streaming API built from the ground up on Spark’s Structured APIs. It is available in all the environments where structured processing runs, including Scala, Java, Python, R, and SQL. Like DStreams, it is a declarative API based on high-level operations, but by building on the structured data model, Structured Streaming can perform more types of optimizations automatically. However, unlike DStreams, Structured Streaming has native support for event time data.

More fundamentally, beyond simplifying stream processing, Structured Streaming is also designed to make it easy to build end-to-end continuous applications using Apache Spark that combine streaming, batch, and interactive queries. Structured Streaming will automatically update the result of this computation in an incremental fashion as data arrives.

13. What is checkpointing in Streaming and when should you enable it?

A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic like system failures, JVM crashes, etc.. For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures. There are two types of data that are checkpointed.

Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from a failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes:
- Configuration - The configuration that was used to create the streaming application.
- DStream operations - The set of DStream operations that define the streaming application.
- Incomplete batches - Batches whose jobs are queued but have not completed yet.
Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary for some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time or proportional to dependency chain, intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage like HDFS to cut off the dependency chains.

To summarize, metadata checkpointing is primarily needed for recovery from driver failures, whereas data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used.

When to enable Checkpointing :

Checkpointing must be enabled for applications with any of the following requirements:

Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing.
Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information.

Note that simple streaming applications without the stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.

14. How to configure checkpointing?

One of the most frequently posed Spark interview questions for experienced, be ready for it.

Checkpointing can be enabled by setting a directory in a fault-tolerant, reliable file system like HDFS, S3, etc., to which the checkpoint information will be saved. This is done by using streamingContext.checkpoint(checkpointDirectory). This will allow you to use the aforementioned stateful transformations.

Additionally, if you want to make the application recover from driver failures, you should use checkpointing functionality in your streaming application to have the following behavior:

When the program is being started for the first time, it will create a new StreamingContext, set up all the streams and then call start()
When the program is restarted after failure, it will re-create a StreamingContext from the checkpoint data in the checkpoint directory.

def createStreamingContext():StreamingContext ={
val ssc = new StreamingContext(...) // new context 
val lines = ssc.socketTextStream(...) // create DStreams ... 
ssc.checkpoint(checkpointDirectory) // set checkpoint directory 
ssc
}
// Get StreamingContext from checkpoint data or create a new one 
val context = StreamingContext.getOrCreate(checkpointDirectory, createStreamingContext _)

If the checkpointDirectory exists, then the context will be recreated from the checkpoint data. If the directory does not exist (i.e., running for the first time), then the function createStreamingContext will be called to create a new context and set up the DStreams.

15. What are the output Modes in Structured Streaming?

There are three modes supported by Structured Streaming. Let’s look at each of them:

Append mode.
Complete mode.
Update mode.

Append mode

Append mode is the default behavior and the simplest to understand. When new rows are added to the result table, they will be output to the sink based on the trigger (explained next) that you specify. This mode ensures that each row is output once (and only once), assuming that you have a fault-tolerant sink. When you use append mode with event-time and watermarks, only the final results will output to the sink.

Complete mode

The complete model will output the entire state of the result table to your output sink. This is useful when we are working with some stateful data for which all rows are expected to change over time or the sink you are writing does not support row-level updates. Think of it as the state of a stream at the time the previous batch had run.

Update mode

Update mode is complete mode except that only the rows that are different from the previous write are written out to the sink. Naturally, your sink must support row-level updates to support this mode. If the query doesn’t contain aggregations, this is equivalent to append mode.

16. What are Event Time and Stateful processing in Streaming?

Event Time :

At a higher level, in stream-processing systems, there are effectively two relevant times for each event: the time at which it actually occurred (event time) and the time that it was processed or reached the stream-processing system (processing time).

Event time:

Event time is the time that is embedded in the data itself. It is most often, though not required to

be, the time that an event actually occurs. This is important to use because it provides a more

robust way of comparing events against one another. The challenge here is that event data can be late or out of order. This means that the stream processing system must be able to handle out-of-order or late data.

Processing time:

Processing time is the time at which the stream-processing system actually receives data. This is usually less important than event time because when it is processed, is largely an implementation detail. This can not ever be out of order because it is a property of the streaming system at a certain time.

Stateful Processing :

Stateful processing is only necessary when you need to use or update intermediate information (state) over longer periods of time (in either a micro-batch or a record-at-a-time approach). This can happen when you are using event time or when you are performing aggregation on a key, whether that involves event time or not.

For the most part, when we are performing stateful operations, Spark handles all of this complexity for us. For example, when you specify a grouping, Structured Streaming maintains and updates the information for you. You simply specify the logic. When performing a stateful operation, Spark stores the intermediate information in a state store. Spark’s current state store implementation is an in-memory state store that is made fault tolerant by storing intermediate state to the checkpoint directory.

17. What are the performance techniques in Spark?

A staple in Spark interview questions and answers, be prepared to answer this one.

There are many optimization techniques that we can perform to help Spark job run faster.

We will list some of them. Please note that these can be applied when your code requires improvement in performance based on the functionality which you are implementing.

By making good design – It helps you to write better Spark Application and helps to run more stable and consistent manner over time.
By Using kryo object Serialization.
Dynamic Allocation of cluster resources.
By choosing splittable file types and compressed ones.
By increasing the parallelism.
Bucketing – Bucketing your data allows Spark to pre-partition data
By tuning garbage collection.
By configuring Spark’s external shuffle service.
By using filters.
By using Repartition and coalesce.
By using a minimal number of UDF’s.
By caching/persisting.
By Using Shared variables(Broadcasting variables and Accumulators).

18. What is the functionality of Mlib?

It enables the computers or the machines to make data-driven decisions rather than being explicitly programmed for carrying out a certain task. These programs or algorithms are designed in a way that they learn and improve over time when are exposed to new data.

Machine Learning algorithm is trained using a training data set to create a model. When new input data is introduced to the ML algorithm, it makes a prediction based on the model.

The prediction is evaluated for accuracy and if the accuracy is acceptable, the Machine Learning algorithm is deployed. If the accuracy is not acceptable, the Machine Learning algorithm is trained again and again with an augmented training data set.

For example :

Online Shopping

While doing online shopping, when we are checking for a product, we can notice the recommendations for a product similar to what you are looking for, and we can also notice “the person bought this product also bought this” combination of products. How are they doing this recommendation? This is machine learning.

Insurance policy

Sometimes you are getting calls from insurance/third party company for asking you to take insurance, What do you think, do they call everyone? No, they call only a few selected customers who they think will purchase their product. How do they select? This is target marketing and can be applied using Clustering. This is machine learning.

19. What are the different types of Machine Learning?

Machine learning is sub categorized into 3 types:

Supervised Learning.
Unsupervised Learning.
Reinforcement Learning.

Supervised Learning :

Supervised Learning is the one, where you can consider the learning is guided by a supervisor. Let’s say we have a dataset which acts as a supervisor and its role is to train the model or the machine. Once the model gets trained it can start making a prediction or decision when new data is given to it. It is including classification and regression, where the goal is to predict a label for each data point based on various features.

Unsupervised Learning :

The model learns through observation and finds structures in the data. Once the model is given a dataset, it automatically finds patterns and relationships in the dataset by creating clusters in it. What it cannot do is add labels to the cluster, like it cannot say this a group of apples or mangoes, but it will separate all the apples from mangoes.

Suppose we presented images of apples, bananas, and mangoes to the model, so what it does, based on some patterns and relationships it creates clusters and divides the dataset into those clusters. Now if a new data is fed to the model, it adds it to one of the created clusters. It is also including clustering, anomaly detection, and topic modeling, where the goal is to discover structure in the data.

Reinforcement Learning :

It is the ability of an agent to interact with the environment and find out what is the best outcome. It follows the concept of hit and trial method. The agent is rewarded or penalized with a point for a correct or a wrong answer and based on the positive reward points gained the model trains itself. And again, once trained it gets ready to predict the new data presented to it.

20. List some of the use cases of Supervised/Classification Algorithm.

There are many uses for Classification. We will discuss some of them:

Predicting disease

A doctor or hospital might have a historical dataset of behavioral and physiological attributes of a set of patients. They could use this dataset to train a model on this historical data (and evaluate its success and ethical implications before applying it) and then leverage it to predict whether a patient has heart disease or not. This is an example of binary classification (healthy heart, unhealthy heart) or multiclass classification (healthy heart, or one of several different diseases).

Classifying images

There are several applications from companies like Apple, Google, or Facebook that can

predict who is in each photo by running a classification model that has been trained on

historical images of people in your past photos. Another common use case is to classify images or label the objects in images.

Predicting customer churn

A more business-oriented use case might be predicting customer churn—that is, which customers are likely to stop using a service. You can do this by training a binary classifier on past customers that have churned (and not churned) and using it to try and predict whether current customers will churn.

Buy or won’t buy

Companies often want to predict whether visitors of their website will purchase a given product. They might use information about users browsing pattern or attributes such as location in order to drive this prediction.

21. List some of the use cases of Unsupervised Learning.

Some of the use cases for unsupervised learning include:

Anomaly detection

Given some standard event type often occurring over time, we might want to report when a nonstandard type of event occurs. For example, a security officer might want to receive notifications when a strange object (think vehicle, skater, or bicyclist) is observed on a pathway.

User segmentation

Given a set of user’s behaviors, we might want to better understand what attributes certain users share with other users. For instance, a gaming company might cluster users based on properties like the number of hours played in a given game. The algorithm might reveal that casual players have very different behavior than hardcore gamers, for example, and allow the company to offer different recommendations or rewards to each player.

Topic modeling

Given a set of documents, we might analyze the different words contained therein to see if there is some underlying relationship between them. For example, given several web pages on data analytics, a topic modeling algorithm can cluster them into pages about machine learning, SQL, streaming, and so on based on groups of words that are more common in one topic than in others.

Intuitively, it is easy to see how segmenting customers could help a platform cater better to each set of users. However, it may be hard to discover whether this set of user segments is “correct”. For this reason, it can be difficult to determine whether a particular model is good or not.

22. What is Graph Analytics in Spark?

This question is a regular feature in Spark interview questions and answers for experienced, be ready to tackle it.

Graphs are data structures composed of nodes, or vertices, which are arbitrary objects and edges that define relationships between these nodes. Graph analytics is the process of analyzing these relationships. An example graph might be your friend group. In the context of graph analytics, each vertex or node would represent a person, and each edge would represent a relationship.

Graphs are a natural way of describing relationships and many different problem sets, and Spark provides several ways of working in this analytics paradigm. Some business use cases could be detecting credit card fraud, motif finding, determining the importance of papers in bibliographic networks (i.e., which papers are most referenced), and ranking web pages, as Google famously used the PageRank algorithm to do.

Spark has long contained an RDD-based library for performing graph processing: GraphX. This provided a very low-level interface that was extremely powerful but, just like RDDs, wasn’t easy to use or optimize. GraphX remains a core part of Spark. Companies continue to build production applications on top of it, and it still sees some minor feature development. The GraphX API is well documented simply because it hasn’t changed much since its creation.

However, some of the developers of Spark (including some of the original authors of GraphX) have recently created a next-generation graph analytics library on Spark: GraphFrames. GraphFrames extends GraphX to provide a DataFrame API and support for Spark’s different language bindings so that users of Python can take advantage of the scalability of the tool.

23. List some of the use cases for Graph Analytics?

Below are some of the use cases for Graph Analytics:

Fraud prediction

Capital, one uses Spark’s graph analytics capabilities to better understand fraud networks. By using historical fraudulent information (like phone numbers, addresses, or names) they discover fraudulent credit requests or transactions. For instance, any user accounts within two hops of a fraudulent phone number might be considered suspicious.

Anomaly detection

By looking at how networks of individuals connect with one another, outliers and anomalies can be flagged for manual analysis. For instance, if typically, in our data each vertex has ten edges associated with it and a given vertex only has one edge, that might be worth investigating as something strange.

Classification

Given some facts about certain vertices in a network, you can classify other vertices according to their connection to the original node. For instance, if a certain individual is labeled as an influencer in a social network, we could classify other individuals with similar network structures as influencers.

Recommendation

Google’s original web recommendation algorithm, PageRank, is a graph algorithm that analyses website relationships in order to rank the importance of web pages. For example, a web page that has a lot of links to it is ranked as more important than one with no links to it.

24. What is Structured Streaming in Spark? What are the different output modes in Spark Structured Streaming?

This is a frequently asked question in Spark performance tuning interview questions.

Structured Streaming provides a fast, fault-tolerant and exactly-once stream processing while allowing users to use DataFrame/Dataset API to express streaming aggregations, event time windows, etc.

The computation is executed on the same Spark SQL engine. You express your streaming computation the same way you would express a batch computation using DataFrame/Dataset. Spark SQL engine takes care of running it incrementally and updating the final result as and when streaming data keeps arriving.

Output modes define the way data is written to result from the table. There are three different output modes in Spark Structured Streaming.

Append: In this mode, only news rows are written to sink. This mode is suited when the output table stores the immutable result.
Complete: In this mode, all the rows are written to sink every time. This mode should be used when aggregations need to be applied to input data.
Update: In this mode, only updated records are written to the output sink, unlike the earlier mode in which all the records were written to sink.

25. What is checkpointing in Spark? How does it help Spark achieve exactly-once semantics? How does Checkpointing differ from Persistence?

Checkpointing is defined as the process of truncating the RDD lineage graph and storing it to a fault-tolerant file system such as HDFS.

By default, Spark maintains a history of all transformations you apply to a DataFrame or RDD. While this enables Spark to be fault-tolerant, it also results in a performance hit an entire set of transformations on RDD/Dataframe needs to be recomputed in case fault occurs during application execution. This can be avoided with the use of checkpoints. Checkpointing truncates the RDD lineage graph and saves it to HDFS. Spark then keeps track of only the transformations that have been applied after checkpointing.

Checkpointing helps Spark achieve exactly once, fault-tolerant guarantee. It uses checkpointing and write-ahead logs to record the offset range of data processed in each trigger. In case of a failure, data can be replayed using checkpointed offsets after a failure.

Persisting an RDD stores it to Disk or Memory. However, Spark remembers the RDD lineage though it doesn’t call it. After the job run is complete, the cache is cleared.

With Checkpointing, RDD is stored to HDFS and the lineage gets deleted. When the job run is completed, the checkpoint file is not deleted.

26. Describe the Spark Memory model. What is the difference between Off-heap and On-heap memory?

On heap memory refers to objects stored on JVM heap and bound by JVM Garbage Collection.

Off-heap memory objects are stored outside of Java heap via serialization, managed by the application and not bound by garbage collection. This method is heavily used by Spark as it avoids frequent GC and tight control over the lifecycle of objects. However, the logic for memory allocation and release needs to be custom written by the application as is the case with Spark.

Since version 1.6, Spark has been following the Unified Memory model wherein both Storage memory and Execution memory share a memory area and both can occupy each other’s free area.

By default, Spark uses On-heap memory only. Its size can be configured using parameter ‘spark.executor.memory’ at the time of submitting the job. On heap memory area can be divided into four parts:

Storage memory: Used to cache RDDs, broadcast variables, etc.

Execution memory: Used to store temporary data during processing operations such as shuffle, join, sort, etc.

User memory: Mainly used to store information related to RDD dependency.

Reserved memory: Memory reserved by Spark and used for storing Spark’s internal objects.

Off-heap memory can be enabled by setting the parameter ‘spark.memory.offHeap.enabled’ to true. This memory area consists of only two parts – Storage memory and Execution memory. When Off-heap memory is enabled, an executor will use both On heap and Off-heap memory.

27. What is a broadcast variable in Spark? What purpose does it serve? How is it different from an accumulator?

A common question in Spark interview, don't miss this one.

Broadcast variables provide a way to keep a read-only variable cached on each executor from the driver program. Broadcast variables allow for the efficient sharing of large data sets intended as reference data for workers. If regular variables had to be used for this purpose instead, then the variable would have to be shipped to each executor for every transformation and action. One of the common use cases for a broadcast variable is to store and share a lookup table in a join operation.

When Spark ships a regular variable to executors, they become local to the executor, and its updated value is not relayed back to the driver. Accumulator variables are a special type of variable wherein updates to the variable on executor nodes are relayed back to the driver. They can be used for associative or commutative operations. One of the common use cases is to analyze transaction logs. However, when using accumulators following needs to be considered:

Accumulators used inside transformations won’t get executed until an action gets called.
If a task is restarted and DAG is recomputed, then accumulators inside transformations might get updated more than once.

To be on the safe side, accumulators should be used inside actions only.

28. When submitting a Spark job, I notice that a few tasks are relatively taking longer time to get completed. What could be the cause of this and how to resolve this issue?

It's no surprise that this one pops up often in Spark interview questions for experienced.

When a Spark job is submitted, each worker node launches an executor. The data is read from the source into RDDs or Data Frames, which can be considered a sort of big arrays with multiple partitions. Each executor can launch one or more tasks, with each task mapping to a partition, thereby increasing parallelism.

However, in case the data is skewed i.e. some of the partitions contain much larger data compared to others, then tasks operating on larger partitions can take much longer to complete than those which operate on smaller partitions.

Data skewness can arise due to multiple reasons e.g. say source contains user data for various countries. If the data is partitioned based on country, then a partition for a country having a larger population will have more data leading to data skewness. A better way to handle this situation is to partition data based on a key which results in a more balanced spreading of data.

Another way to handle this problem is to use repartition. Spark repartition does a full shuffle of data in RDD and creates new partitions with data distributed evenly. Since the data is more evenly spread now, tasks operating on partitions will take an equal amount of time to process now. Keep in mind that repartitioning your data is a fairly expensive operation.

Yet another option is to cache the RDD or Dataframe before heavy operations as caching helps optimize performance to a great extent.

Spark Interview Questions and Answers

Introduction

Beginner

Intermediate

Advanced

1. What do the features of Spark provide, which is not available to the Map-Reduce?

2. How spark core fit into the picture to solving the big data use case?

3. What are the benefits of using Spark streaming for real time processing instead of other framework and tools?

4. What Spark-SQL does, how it’s benefits to programmer to interact with database? And Syntax of creating SQL Context?

5. What are the key component of spark which internally spark require to execute the job?

6. Why we need the master driver in spark?

7. What is an executor in spark and how its support to perform the operation on volume of data?

8. What happens when a Spark Job is submitted?

9. What is RDD? How does spark RDD works? What are the various ways to create the RDD?

10. What is the other notable feature of RDD and ways to create the RDD?

1. Compare Apache spark and MapReduce.

2. What are the core components of Spark Ecosystem?

3. Explain the key features of Apache Spark.

4. What are the different running modes of Spark Application?

5. What is RDD, DataFrame, and Dataset?

6. When to use RDD, DataFrame or Dataset?

7. Name the operations supported by RDD and explain them.

8. What are the ways to create RDD?

9. What is the lineage graph and DAG?

10. How many types of transformations are there in Spark and explain them?

11. What is Spark Mlib?

12. Explain Classification Algorithm.

13. Explain GraphFrames in GraphX Lib.

14. What is Graph Algorithm and explain one of the Graph algorithms? intermediate

15. What is caching or persistence in Spark? How are the two different from each other? What are various storage levels for persisting RDDs?

16. What are the different deployment modes in Spark?

17. What is coalesce in Spark? How is it different from repartition?

18. What is shuffling in Spark? When does it occur? What are the various ways by which shuffling of data can be minimized?

19. What are the different cluster managers supported by Spark?