top

Search

Apache Spark Tutorial

Spark was developed by MatieZaharia in 2009 as a research project at UC Berkeley AMPLab, which focused on Big Data Analytics. The fundamental motive and goal behind developing the framework was to overcome the inefficiencies of MapReduce. Even though MapReduce was a huge success and had wide acceptance, it could not be applied to a wide range of problems. MapReduce is not efficient for multi-pass applications that require low-latency data sharing across multiple parallel operations. There are many data analytics applications which include: Iterative algorithms, used in machine learning and graph processing  Interactive business intelligence and data mining, where data from different sources are loaded in memory and queried repeatedly  Streaming applications that keep updating the existing data and need to maintain the current state based on the latest data.  MapReduce does not fit in such use cases as data has to be read from disk storage sources and then written back to the disk as distinct jobs. Spark offers much better programming abstraction called RDD (Resilient Distributed Dataset) which can be stored in memory in between queries and can also be cached for the repetitive process. Also, RDDs are a read-only collection of partitioned objects across different machines and fault-tolerant as the same exact copy can be created from scratch in case of process failure. or node failure. Although RDDs are not a generally shared memory abstraction, they represent a sweet-spot between expressivity on the one hand and scalability and reliability on the other hand. We will see the concepts of RDD in detail in the following sections and understand how RDDs are used by Spark for processing at such a fast speed. 
logo

Apache Spark Tutorial

Evolution of Apache Spark

Spark was developed by MatieZaharia in 2009 as a research project at UC Berkeley AMPLab, which focused on Big Data Analytics. The fundamental motive and goal behind developing the framework was to overcome the inefficiencies of MapReduce. Even though MapReduce was a huge success and had wide acceptance, it could not be applied to a wide range of problems. MapReduce is not efficient for multi-pass applications that require low-latency data sharing across multiple parallel operations. There are many data analytics applications which include: 

  • Iterative algorithms, used in machine learning and graph processing  
  • Interactive business intelligence and data mining, where data from different sources are loaded in memory and queried repeatedly  
  • Streaming applications that keep updating the existing data and need to maintain the current state based on the latest data. 

 
History of spark

MapReduce does not fit in such use cases as data has to be read from disk storage sources and then written back to the disk as distinct jobs. 

Spark offers much better programming abstraction called RDD (Resilient Distributed Dataset) which can be stored in memory in between queries and can also be cached for the repetitive process. Also, RDDs are a read-only collection of partitioned objects across different machines and fault-tolerant as the same exact copy can be created from scratch in case of process failure. or node failure. Although RDDs are not a generally shared memory abstraction, they represent a sweet-spot between expressivity on the one hand and scalability and reliability on the other hand. We will see the concepts of RDD in detail in the following sections and understand how RDDs are used by Spark for processing at such a fast speed. 

Leave a Reply

Your email address will not be published. Required fields are marked *