Spark Interview Questions

With the Spark interviews getting harder, it's time for you to get smarter with the latest interview-cracking skills. Our expert-authored Spark interview questions will be the best guide in preparing for the Spark interviews and help you answer questions on the key features of Spark, RDD, Spark engine, MLib, GraphX, Spark Driver, Spark Ecosystem, etc. and land your dream job as a Spark Developer, Spark programmer, etc. With the help of the following Apache Spark interview questions, boost your confidence and ace your upcoming Spark interview.

  • 4.6 Rating
  • 19 Question(s)
  • 20 Mins of Read
  • 3291 Reader(s)


  • DAGScheduler:

DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling. It transforms a logical execution plan into the DAGScheduler which is the scheduling layer of Apache Spark that implements stage-oriented scheduling. SparkContext hands over a logical plan to DAGScheduler that it in turn translates to a set of stages that are submitted as TaskSets for execution.

  • TaskScheduler:

TaskScheduler is responsible for submitting tasks for execution in a Spark application. TaskScheduler tracks the executors in a Spark application using executorHeartbeatReceived and executor Lost methods that are to inform about active and lost executors, respectively. Spark comes with the following custom TaskSchedulers: TaskSchedulerImpl — the default TaskScheduler (that the following two YARN-specific TaskSchedulers extend). YarnScheduler for Spark on YARN in client deploy mode. YarnClusterScheduler for Spark on YARN in cluster deploy mode.

  • BackendScheduler:

BackendScheduler is a pluggable interface to support various cluster managers, cluster managers differ by their custom task scheduling modes and resource offers mechanisms Spark abstracts the differences in BackendScheduler contract.

  • BlockManager:

Responsible for the translation of spark user code into actual spark jobs executed on the cluster.

Spark driver prepares the context and declares the operations on the data using RDD transformations and actions. Driver submits the serialized RDD graph to the master, where master creates tasks out of it and submits them to the workers for execution. Executor is a distributed agent responsible for the execution of tasks.

Below is the key point for the reference:

  • Spark driver plays the vital role which is kickoff from execute the main () function.
  • control the node in the cluster and performing below three operation:
  •  maintaining information about the Spark Application
  •  responding to a user’s program or input
  •  analyzing, distributing, and scheduling work across the executors
  • Spark driver application has its own executor process.
  • Executor performs all the data processing.
  • Reads from and Writes data to external sources.
  • Stores the computation results data in-memory.
  • Interacts with the storage systems.

Spark driver coordinates the different job stages, where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.

  • Parquet file.
  • JSON
  • Hive
  • Cassandra, Mongo-DB
  • Text file
  • CSV file
  • My-SQL

Lazy evaluation in spark works to instantiate the variable when its really required for instance like when spark do the transformation, till this time it’s not computed, however when we applied the action than its compute the data. Spark delays its evaluation till it is necessary. 

Eg: lazy val lazydata = 10

Feature Criteria
Apache Spark
100 times faster than Hadoop
Slower than the Spark
Support both Real-time & Batch processing
Batch processing only
Easy because of high level modules
Tough to learn
Allows recovery of partitions
Has interactive modes
No interactive mode except Pig & Hive
  • One can execute the spark operation without using Hadoop, for instance we can develop and run the spark code from the local system even from 
  • Windows platform.
  • Even spark can read and then process data from the data base and no sql as well.
  • Spark has a doesn't have an ability to storing the record, that is a reason its require the distributed storage system.
  • Another reason that spark processing the huge volume of record, which is difficult to store and process in a single node or local machine, that is also a one reason that Hadoop need to integrate if there is high volume data available. 

GraphX is a part of Spark framework, which use for graph and graph based parallel processing, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation.

  • Extension on Spark RDD to perform computation on graph DB.
  • Follow directed multigraph data structure.
  • Support operator like (joinGraph, joinVertices and mapReduceTriplet)
  • Support both Supervised and unsupervised algorithms. 
  • GraphX optimizes the representation of vertex and edge types when they are primitive data types (e.g., int, double, etc…) reducing the in-memory footprint by storing them in specialized arrays.
  • GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregate Messages)
  • Package need to import is “import org.apache.spark.graphx._”
  • val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
  • val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
  • val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

GraphX in spark are immutable, distributed, and fault-tolerant. Changes to the values or structure of the graph are accomplished by producing a new graph with the desired changes.

MLib is a scalable machine library which comes as a bundle of Spark framework. Spark provide a high-quality well tested algorithms to performs data science operation.

Below are the key points:

  • MLib is a scalable machine learning library which provide the tested and productive machine learning algorithms.
  • This library used for performing data analytics by data scientist. 
  • Support both supervised and unsupervised machine learning algorithms. 
  • Package which need to import is: “import org.apache.spark.mlib._”
  • Code Snippet: data ="libsvm").load("hdfs://...")
                                  model = KMeans(k=10).fit(data)

MLlib utilize the linear algebra package Breeze, which depends on netlib-java for optimized   numerical processing. If native libraries1 are not available at runtime, you will see a warning message and a pure JVM implementation will be used instead.   

  • Logistic regression, naive Bayes: Use for Classification.
  • Generalized linear regression, survival regression: Perform Regression technique.
  • Decision trees, random forests, and gradient-boosted trees
  • Alternating least squares (ALS): For Recommendation
  • K-means, Gaussian mixtures (GMMs)To performs Clustering
  • Latent Dirichlet allocation (LDA): To perform modeling
  • Sequential pattern mining: Frequent item sets, association rule mining.
  • Featurization: feature extraction, transformation, dimensionality reduction, and selection
  • Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
  • Persistence: saving and load algorithms, models, and Pipelines
  • Utilities: linear algebra, statistics, data handling, etc.
Dataframe is structured into named and column and provides a same behaviour which is in  table in RDBMS
Dataset is distributed collection of data, which provide the benefits of both RDD and dataframe
Dataframe doesn’t require schema or meta information about the and does  not process strict type checking.
To create dataset we need to provide the schema information about the record and follows strict type checking.
Dataframe doesn’t allow lambda function
Dataset support support lambda function.
Dataframe doesn’t comes with optimize engine
Dataset comes with Spark SQL optimize engine called Catalyst optimizer
Dataframe doesn’t support any encoding technique at runtime
Dataset comes with encoder technique, which provide technique to convert JVM object into the dataset.
Incompatible with domain object, once dataframe created, we can’t regenerate the domain object.
Regeneration of domain object is possible, because dataset need the schema information before creating the
Dataframe doesn’t support the compile time safety.
Dataset maintain the schema information, if schema is incorrect than its generate the exception at compile time.
Once dataframe get created, we can’t perform any RDD operation on it.
Dataset leverage to use RDD operation as well along with sql query processor.


Spark API provides various key features, which is very useful for spark real time processing, most of the features has a well support library along with real time processing capability. 

Below are the key features providing by spark framework:

  • Spark Core
  • Spark Streaming.
  • Spark SQL
  • GrasphX
  • MLib

Spark core is a heart of spark framework and well support capability for functional programing practice for the language like Java, Scala, Python, however most of the new release come for JVM language first and then later on introduced for python. 

Reduce, collection, aggregation API, stream, parallel stream, optional which can easily handle to all the use case where we are dealing volume of data handling.

Bullet points are as follows:

  • Spark core is the distributed execution engine for large-scala parallel and distributed data processing.
  • Spark core provide a real time processing for large data set.
  • Handle memory management and fault recovery.
  • Scheduling, distributing and monitoring jobs on a cluster.
  • Spark core comes with map, flatmap, reduce, reducebykey, groupbykey which handling the key value pair-based data processing for large data set.
  • Spark core also support aggregation operation.
  • Spark core support Java, Scala and Python.
  • Code snippet: val counts = textReader.flatMap(line => line.split(",")).map(word => (word, 1)).reduceByKey(_ + _).

Apparently spark use for data processing framework, however we can also use to perform the data analysis and data science.

Spark Streaming supports micro-batch-oriented stream processing engine, Spark has a capability to allow the data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, 

and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.

Below are the other key benefits which Spark streaming support.

  • Spark streaming is one of features of Spark used to process the real time data efficiently.
  • Spark Streaming implement using Kafka and Zookeeper messaging API, which is again a fault tolerant messaging container can create a messaging cluster.
  • Provide high-throughput and fault-tolerant stream processing 
  • Provide DStream data structure which is a basically a stream of RDD to process the real-time data.
  • Spark Streaming fits for scenario where interaction require Kafka to  Database or Kafka to Data science model type of context.

Spark work on batches which receives an input data stream and divided into the micro batches, which is further processed by the spark engine to generate the final stream of result in the batches.

Below diagram clearly illustrated the workflow of Spark streaming. 

Spark SQL provides programmatic abstraction in the form of data frame and data set which can work the principal of distributed SQL query engine. Spark SQL simplify the interaction to the large amount of data through the dataframe and dataset.

  • Spark-SQL provide a relation processing along with spark functional programming.
  • Support querying data using SQL and HIVE query language.
  • Support Datasource API, Dataframe API, Interpreter & Optimizer, SQL Service.
  • Spark-SQL also providing the new API called Dataset which has capability of both Dataframe and core.
  • Spark-SQL I much optimize to perform SQL query-based operation on flat file, json.
  • Spark SQL support variety of language like: Java, Scala, Python and R.
  • Code Snippet: val sqlContext = new SQLContext( sc: SparkContext)  
  • Dataframe can be create using below approach: 
  • Structured data files: 
  • Tables in Hive: 
  • External databases:
  • Using existing RDD:  

Spark SQL plays a vital role on optimization technique using Catalyst optimizer, Spark SQL also support UDF, built in function and aggregates function. 

  •  Spark follows a master/slave architecture.
    •  Master Daemon: (Master Drive process)
    •  Worker Daemon: (Slave process)
  • Spark cluster has a single Master
  • No. of Slave worked as a commodity server.
  • When we submit the spark job it triggers the spark driver. 
  • Getting the current status of spark application
  • Canceling the job
  • Canceling the Stage
  • Running job synchronously
  • Running job asynchronously
  • Accessing persistent RDD
  • Un-persisting RDD
  • Programmable dynamic allocation

Master driver is central point and the entry point of the Spark Shell which is supporting this language (Scala, Python, and R). Below is the sequential process, which driver follows to execute the spark job.

  • Driver runs the main () function of the application which create the spark context.
  • Driver program that runs on the master node of the spark cluster schedules the job execution.
  • Translates the RDD’s into the execution graph and splits the graph into multiple stages.
  • Driver stores the metadata about all the Resilient Distributed Databases and their partitions.
  • Driver program converts a user application into smaller execution units known as tasks which is also as a stage.
  • Tasks are then executed by the executors i.e. the worker processes which run individual tasks.

The complete process can track by cluster manager user interface. Driver exposes the information about the running spark application through a Web UI at port 4040

Executors are worker nodes' processes in charge of running individual tasks when Spark job get submitted. They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application. Once they have run the task, they send the results to the driver. They also provide in-memory storage for RDDs that are cached by user programs through Block Manager.

Below are the key points on executors:

  • Every spark application has its own executor process.
  • Executor performs all the data processing.
  • Reads from and Writes data to external sources.
  • Executor stores the computation results data in-memory, cache or on hard disk drives.

Executor also work as a distributed agent responsible for the execution of tasks. When the job getting launched, spark trigger the executor, which act as a worker node which responsible for running individual task, which is assigned by spark driver.

Below is the step which spark job follows once job get submitted:

  • A standalone application starts and instantiates a SparkContext instance and it is only then when you can call the application a driver.
  • The driver program asks for resources to the cluster manager to launch executors.
  • The cluster manager launches executors.
  • The driver process runs through the user application. 
  • Depending on the actions and transformations over RDDs task are sent to executors.
  • Executors run the tasks and save the results.
  • If any worker crashes, its tasks will be sent to different executors to be processed again.
  • Driver implicitly converts the code containing transformations and actions into a logical
  • directed acyclic graph (DAG). 

Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. For example, if the node running a partition of a map () operation crashes, Spark will rerun it on another node; and even if the node does not crash but is simply much slower than other nodes, Spark can preemptively launch a “speculative” copy of the task on another node and take its result if that finishes.

Resilient distributed dataset (RDD) is a core of Spark framework, which is a fault-tolerant collection of elements that can be operated on in parallel.

Below are the key points on RDD:

  • RDD is an immutable distributed collection of objects.
  • RDD works on in-memory computation paradigm.
  • RDD is divided into logical partitions, which computed in different worker nodes.
  • Stores the state of memory as an object across the jobs and the object is sharable between those jobs.
  • Data sharing using RDD faster than the I/O and disk, because its use the in – memory computation. 
  • The working of RDD is:
    • Resilient handling a fault-tolerant with the help of RDD spark able to recover or recompute the missing or damaged partitions due to node failures.
    • Distributed mechanism handling data residing on multiple nodes in a cluster.
    • Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects

We can create the RDD using below approach:

By Referring a dataset:

  • Val byTextFile = sc.textFile(hdfs:// or s3:// )

By Parallelizing a dataset:

  • Val byParalizeOperation = sc.paralize( Seq(DataFrame or Dataset), numSlices: Integer)

By converting dataframe to rdd.

  • Val byDF = df.filter().toRDD

RDDs predominately support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

  • In-Memory:  Ability to perform operation in the primary memory not in the disk
  • Immutable or Read-Only: Emphasize in creating the immutable data set.
  • Lazy evaluated: Spark computing the record when the action is going to perform, not in transformation level.
  • Cacheable: We can cache the record, for faster processing.
  • Parallel:  Spark has an ability to parallelize the operation on data, saved in     RDD.
  • Partitioned of records: Spark has ability to partition the record, by default its support 128 MB of partition.
  • Parallelizing: an existing collection in your driver program. 
  • Referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase


circumstances. Other than that, there are libraries for SQL, graph computation, machine learning, and stream processing. The programming languages that Spark support are Python, Java, R and Scala. Data scientists and application developers incorporate Spark in their applications to query, analyse and transform data at scale. Tasks that are most frequently associated with Spark include SQL batch jobs, machine learning tasks, etc.

Professionals can opt for a career as a Spark Developer, Big Data developer, Big Data Engineer and related profiles. According to, the average salary of "big data spark developer" ranges from approximately $105,767 per year for a Data Warehouse Engineer to $133,184 per year for Data Engineer.

There are many companies who use Apache Spark. According to iDatalabs, most of the companies that are using Apache Kafka are found in the United States, particularly in the industry of Computer Software. Mostly, these companies have 50-200 employees with revenue of 1M-10M dollars. Hortonworks Inc, DataStax, Inc., and Databricks Inc are some of the top industry majors.

Are you wondering how to crack the Spark Interview and what could be the probable Spark Interview Questions asked? Then you should realize that every interview is different and the scope of jobs differ in every organisation. Keeping this in mind, we have designed the most common Apache Spark Interview Questions and Answers to help you crack your interview successfully.  

We have compiled the most frequently asked Apache Spark Interview Questions with Answers for both experienced as well as freshers. These Spark SQL interview questions will surely help you to get through your desired Spark Interview.

After going through these Spark interview questions and answers you will be able to confidently face an interview and will be prepared to answer your interviewer in the best manner. Spark coding interview questions here are suggested by the experts.

Prepare well and in time!All the best!


Read More