In this section we will get a detailed understanding on RDDs, as they are one of the most important features of Spark. A very clear understanding on this topic is essential to create a good Spark application. We will also understand how RDDs overcome the demerits of MapReduce processes.
Resilient Distributed Datasets
As we have already seen, RDDs are immutable, partitioned, distributed datasets used by Spark for data processing. They are also fault tolerant and can be recreated at any stage of processing if any failure occurs in-network or cluster nodes. They can be created either by parallelizing an existing collection in the driver program, by reading a dataset from external storage like HDFS, HBase, Cassandra, databases, etc, or by transforming an existing RDD in memory and being processed by the Spark engine. Spark makes use of RDDs to achieve the same or similar processing results as MapReduce but at a much faster speed usually of the order by 10 to 100x. Let us see
Data Sharing is Slow in MapReduce
MapReduce has established itself as one of the best technologies to process and generate huge datasets parallelly using distributed algorithms in distributed environments. It helps users and developers to do parallel computations using high-level APIs and the users do not have to get involved and take care of the intricacies of work distribution, concurrency issues and making the processing systems fault-tolerant.
When we need to reuse data between different computation stages say in MapReduce processing, the only way to do this is to write the output of the first stage to physical storage e.g. HDFS. Though the MapReduce framework gives users many abstractions to use cluster’s compute power, we as developers are a never satisfied lot and still want more out of it. Both the iterative and interactive applications need data sharing to be very fast across parallel jobs. But the data sharing is not that fast in MapReduce due to disk IO, serialization, deserialization, and replication involved in writing to stable storage for intermediate results. It is found that Hadoop applications spend almost 90% of their time in reading and writing to storage systems.
Iterative Operations on MapReduce
Iterative operations mean reusing the intermediate results from one or more steps across other multiple stages within the application. We will see in the below diagram, how MapReduce works in iterative applications and how the overheads due to replications, IO and serialization, and deserialization affects the performance of the whole application.
Interactive Operations on MapReduce
In interactive operations, the user runs an ad-hoc query on the same subset of data and each time the query will go to the disk and perform IO to fetch the data and return back to the user. This increases the interactive query time for the user and hampers the user experience. We will see below how this works in MapReduce.
Data Sharing using Spark RDD
We saw why the data sharing between intermediate steps for iterative applications and also for interactive applications is slow in MapReduce. The reason was due to disk IO, serialization and replication issues in MapReduce which are inbuilt features and core to the functioning of Hadoop MapReduce. To overcome this slowness Spark was developed which works on the concept of Resilient Distributed Datasets which essentially are in-memory objects but are partitioned and distributed on a cluster and also fault-tolerant. So the intermediate stage outputs do not require to be written to stable storage and can be accessed from memory avoiding the biggest bottleneck of the MapReduce systems. This in-memory sharing makes Spark 10 to 100 times faster than MapReduce.
Let us now see how iterative and interactive operations take place in Spark’s Resilient Distributed Datasets.
Iterative Operations on Spark RDD
The below diagram shows how Spark’s RDD work in iterative applications. The intermediate results are written to memory instead of the stable disk storage and the subsequent steps can read the same memory RDD objects. Only when the memory(RAM) is insufficient to store the whole RDD, the results are spilled to the disk. But overall the whole system is way faster than the MapReduce application due to this optimization.
Interactive Operations on Spark RDD
The diagram below shows how interactive systems can benefit from Spark RDD processing. If there are different queries to be run on the same set or subset of data, this data can be kept in memory for faster response time. By default, each transformed RDD may be recomputed with every call on the action, but even this can be optimized by using Spark’s caching mechanism which can cache the computed RDD in a distributed manner across different machines on the cluster.
In this section we understood how RDDs make Apache Spark a fast, fault tolerant and distributed processing engine.