10X Sale
kh logo
All Courses
  1. Tutorials
  2. Big Data

Evolution of Apache Spark

Updated on Oct 7, 2025
 
20,574 Views

Spark was developed by Matie Zaharia in 2009 as a research project at UC Berkeley AMPLab, which focused on Big Data Analytics. The fundamental motive and goal behind developing the framework was to overcome the inefficiencies of MapReduce. Even though MapReduce was a huge success and had wide acceptance, it could not be applied to a wide range of problems. MapReduce is not efficient for multi-pass applications that require low-latency data sharing across multiple parallel operations. There are many data analytics applications which include:

  1. Iterative algorithms, used in machine learning and graph processing
  2. Interactive business intelligence and data mining, where data from different sources are loaded in memory and queried repeatedly
  3. Streaming applications that keep updating the existing data and need to maintain the current state based on the latest data.

Image

MapReduce does not fit in such use cases as data has to be read from disk storage sources and then written back to the disk as distinct jobs.

Spark offers much better programming abstraction called RDD (Resilient Distributed Dataset) which can be stored in memory in between queries and can also be cached for the repetitive process. Also, RDDs are a read-only collection of partitioned objects across different machines and fault-tolerant as the same exact copy can be created from scratch in case of process failure. or node failure. Although RDDs are not a generally shared memory abstraction, they represent a sweet-spot between expressivity on the one hand and scalability and reliability on the other hand. We will see the concepts of RDD in detail in the following sections and understand how RDDs are used by Spark for processing at such a fast speed.

+91

By Signing up, you agree to ourTerms & Conditionsand ourPrivacy and Policy

Get your free handbook for CSM!!
Recommended Courses