In this section we will look into some of the advanced concepts of Apache Spark like RDD (Resilient Distributed Dataset), which is the building block of Spark processing. We will also look into concepts like transformations and actions which are basic units of Spark processing.
We have already seen how to create SparkContext and RDDs from SparkContext. Let us go deeper into how a Spark program gets executed and its internals.
Spark defines an RDD interface with the properties which must be implemented by all the RDD types. These properties are mainly needed to know RDD dependencies and data locality which is used by the Spark engine. There are mainly five properties which correspond to the following methods:
There are two types of functions defined on RDDs. They are called transformations and actions. Transformations are functions which return another RDD and actions are functions which return something which is not RDD. As we already know Spark does lazy evaluation, only when an action is invoked, Spark will evaluate all the transformations and finally produce the output.
Also, there are two types of transformations in Spark called the narrow and wide transformations.
The difference between the narrow and wide transformations has very significant ramifications on how the tasks are executed and their performance. Narrow transformations are operations which depend on only one or a known set of partitions in the parent RDD. So these can be evaluated without any information about other partitions of the RDD. Some examples of narrow transformations are map, filter, flatMap etc, while wide transformations cannot be executed on arbitrary rows but need the data to be partitioned in a particular way. Wide transformation examples are sort, reduceByKey, groupByKey, join and more which calls for repartition.
In Spark’s execution paradigm, Spark application does not evaluate anything to produce an output, but just builds an execution graph called DAG, until the driver program calls an action and Spark job is then launched. Each job consists of stages which are steps in the transformation of the data needed to materialize the final RDD. Each stage is a combination of tasks which are the smallest unit of execution performed parallelly on the executors launched on the cluster. Each task in one stage executes the same code on different pieces of data and one task cannot execute on more than one executor. Each task is allocated a number of slots for running the tasks and may run tasks parallelly in its lifetime. The number of tasks in each stage is the same as the number of partitions in the output RDD of that stage.
The below diagram shows the relation between the Jobs, Stages and Tasks for an application.
Let us see an example program to calculate the frequency of each word in a text file.
scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> val lines = sc.textFile("/Users/home/Downloads/Spark.txt") lines: org.apache.spark.rdd.RDD[String] = /Users/home/Downloads/Spark.txt MapPartitionsRDD at textFile at <console>:25 scala> val counts = lines.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD at reduceByKey at <console>:26 scala> counts.collect()
Looking at the Spark Web UI we can see the Job and Stages created for the Application.
The details of each stage and the DAG can be viewed by clicking on the Stage links as below:
In this section we understood the lowest level abstraction in Spark i.e. RDDs. This will help us write our application processes in terms of transformations and actions.