Spark was developed by MatieZaharia in 2009 as a research project at UC Berkeley AMPLab, which focused on Big Data Analytics. The fundamental motive and goal behind developing the framework was to overcome the inefficiencies of MapReduce. Even though MapReduce was a huge success and had wide acceptance, it could not be applied to a wide range of problems. MapReduce is not efficient for multi-pass applications that require low-latency data sharing across multiple parallel operations. There are many data analytics applications which include:
MapReduce does not fit in such use cases as data has to be read from disk storage sources and then written back to the disk as distinct jobs.
Spark offers much better programming abstraction called RDD (Resilient Distributed Dataset) which can be stored in memory in between queries and can also be cached for the repetitive process. Also, RDDs are a read-only collection of partitioned objects across different machines and fault-tolerant as the same exact copy can be created from scratch in case of process failure. or node failure. Although RDDs are not a generally shared memory abstraction, they represent a sweet-spot between expressivity on the one hand and scalability and reliability on the other hand. We will see the concepts of RDD in detail in the following sections and understand how RDDs are used by Spark for processing at such a fast speed.