Are you ready to ace your upcoming Spark interview? Your hiring managers will expect you to be able to answer them smoothly. Be prepared to be on the right smooth track. It's important to be prepared to respond effectively to the questions that employers typically ask in an interview. Since these job interview questions are so common, hiring managers will expect you to be able to answer them smoothly and without hesitation.
Spark API provides various key features, which is very useful for spark real time processing, most of the features has a well support library along with real time processing capability.
Below are the key features providing by spark framework:
Spark core is a heart of spark framework and well support capability for functional programing practice for the language like Java, Scala, Python, however most of the new release come for JVM language first and then later on introduced for python.
Reduce, collection, aggregation API, stream, parallel stream, optional which can easily handle to all the use case where we are dealing volume of data handling.
Bullet points are as follows:
Apparently spark use for data processing framework, however we can also use to perform the data analysis and data science.
Spark Streaming supports micro-batch-oriented stream processing engine, Spark has a capability to allow the data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets,
and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.
Below are the other key benefits which Spark streaming support.
Spark work on batches which receives an input data stream and divided into the micro batches, which is further processed by the spark engine to generate the final stream of result in the batches.
Below diagram clearly illustrated the workflow of Spark streaming.
Spark SQL provides programmatic abstraction in the form of data frame and data set which can work the principal of distributed SQL query engine. Spark SQL simplify the interaction to the large amount of data through the dataframe and dataset.
Spark SQL plays a vital role on optimization technique using Catalyst optimizer, Spark SQL also support UDF, built in function and aggregates function.
Master driver is central point and the entry point of the Spark Shell which is supporting this language (Scala, Python, and R). Below is the sequential process, which driver follows to execute the spark job.
The complete process can track by cluster manager user interface. Driver exposes the information about the running spark application through a Web UI at port 4040
Executors are worker nodes' processes in charge of running individual tasks when Spark job get submitted. They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application. Once they have run the task, they send the results to the driver. They also provide in-memory storage for RDDs that are cached by user programs through Block Manager.
Below are the key points on executors:
Executor also work as a distributed agent responsible for the execution of tasks. When the job getting launched, spark trigger the executor, which act as a worker node which responsible for running individual task, which is assigned by spark driver.
Below is the step which spark job follows once job get submitted:
Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. For example, if the node running a partition of a map () operation crashes, Spark will rerun it on another node; and even if the node does not crash but is simply much slower than other nodes, Spark can preemptively launch a “speculative” copy of the task on another node and take its result if that finishes.
Resilient distributed dataset (RDD) is a core of Spark framework, which is a fault-tolerant collection of elements that can be operated on in parallel.
Below are the key points on RDD:
We can create the RDD using below approach:
By Referring a dataset:
Val byTextFile = sc.textFile(hdfs:// or s3:// )
By Parallelizing a dataset:
Val byParalizeOperation = sc.paralize( Seq(DataFrame or Dataset), numSlices: Integer)
By converting dataframe to rdd.
Val byDF = df.filter().toRDD
RDDs predominately support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling. It transforms a logical execution plan into the DAGScheduler which is the scheduling layer of Apache Spark that implements stage-oriented scheduling. SparkContext hands over a logical plan to DAGScheduler that it in turn translates to a set of stages that are submitted as TaskSets for execution.
TaskScheduler is responsible for submitting tasks for execution in a Spark application. TaskScheduler tracks the executors in a Spark application using executorHeartbeatReceived and executor Lost methods that are to inform about active and lost executors, respectively. Spark comes with the following custom TaskSchedulers: TaskSchedulerImpl — the default TaskScheduler (that the following two YARN-specific TaskSchedulers extend). YarnScheduler for Spark on YARN in client deploy mode. YarnClusterScheduler for Spark on YARN in cluster deploy mode.
BackendScheduler is a pluggable interface to support various cluster managers, cluster managers differ by their custom task scheduling modes and resource offers mechanisms Spark abstracts the differences in BackendScheduler contract.
Responsible for the translation of spark user code into actual spark jobs executed on the cluster.
Spark driver prepares the context and declares the operations on the data using RDD transformations and actions. Driver submits the serialized RDD graph to the master, where master creates tasks out of it and submits them to the workers for execution. Executor is a distributed agent responsible for the execution of tasks.
Below is the key point for the reference:
Spark driver coordinates the different job stages, where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.
Lazy evaluation in spark works to instantiate the variable when its really required for instance like when spark do the transformation, till this time it’s not computed, however when we applied the action than its compute the data. Spark delays its evaluation till it is necessary.
Eg: lazy val lazydata = 10
|Feature Criteria||Apache Spark||Hadoop|
|Speed||100 times faster than Hadoop||Slower than the Spark|
|Processing||Support both Real-time & Batch processing||Batch processing only|
|Difficulty||Easy because of high level modules||Tough to learn|
|Recovery||Allows recovery of partitions||Fault-tolerant|
|Interactivity||Has interactive modes||No interactive mode except Pig & Hive|
GraphX is a part of Spark framework, which use for graph and graph based parallel processing, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation.
GraphX in spark are immutable, distributed, and fault-tolerant. Changes to the values or structure of the graph are accomplished by producing a new graph with the desired changes.
MLib is a scalable machine library which comes as a bundle of Spark framework. Spark provide a high-quality well tested algorithms to performs data science operation.
Below are the key points:
Code Snippet: data = spark.read.format("libsvm").load("hdfs://...")
model = KMeans(k=10).fit(data)
MLlib utilize the linear algebra package Breeze, which depends on netlib-java for optimized numerical processing. If native libraries1 are not available at runtime, you will see a warning message and a pure JVM implementation will be used instead.
|Dataframe is structured into named and column and provides a same behaviour which is in table in RDBMS||Dataset is distributed collection of data, which provide the benefits of both RDD and dataframe|
|Dataframe doesn’t require schema or meta information about the and does not process strict type checking.||To create dataset we need to provide the schema information about the record and follows strict type checking.|
|Dataframe doesn’t allow lambda function||Dataset support support lambda function.|
|Dataframe doesn’t comes with optimize engine||Dataset comes with Spark SQL optimize engine called Catalyst optimizer|
|Dataframe doesn’t support any encoding technique at runtime||Dataset comes with encoder technique, which provide technique to convert JVM object into the dataset.|
|Incompatible with domain object, once dataframe created, we can’t regenerate the domain object.||Regeneration of domain object is possible, because dataset need the schema information before creating the|
|Dataframe doesn’t support the compile time safety.||Dataset maintain the schema information, if schema is incorrect than its generate the exception at compile time.|
|Once dataframe get created, we can’t perform any RDD operation on it.||Dataset leverage to use RDD operation as well along with sql query processor.|
Are you wondering how to crack the Spark Interview and what could be the probable Spark Interview Questions asked? Then you should realize that every interview is different and the scope of the jobs differ in every organisation. Keeping this in mind we have designed the most common Apache Spark Interview Questions and Answers to help you to crack your interview successfully.
We have compiled the most frequently asked Apache Spark Interview Questions with Answers for both experienced as well as freshers. These Spark Sql interview questions will surely help you to get through your desired Spark Interview.
After going through these Spark interview questions and answers you will be able to confidently face an interview and will be prepared to answer your interviewer in the best manner. Spark coding interview questions here are suggested by the experts.