Bootcamps

Enterprise

Resources

Home
Blog
Big Data
Apache Spark vs MapReduce: A Detailed Comparison

HomeBlogBig DataApache Spark vs MapReduce: A Detailed Comparison

Apache Spark vs MapReduce: A Detailed Comparison

Blog Author

Dr. Manish Kumar Jain

Published

02nd May, 2024

Views

Read TimeRead it in

20 Mins

In this article

Apache Spark vs MapReduce: A Detailed Comparison

Why We Need Big Data Frameworks

Big data is primarily defined by the volume of a data set. Big data sets are generally huge – measuring tens of terabytes – and sometimes crossing the threshold of petabytes. It is surprising to know how much data is generated every minute. As estimated by DOMO:

Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth.

To store and process even only a fraction of this amount of data, we need Big Data frameworks as traditional Databases would not be able to store so much data nor traditional processing systems would be able to process this data quickly. Here come the frameworks like Apache Spark and MapReduce to our rescue and help us to get deep insights into this huge amount of structured, unstructured, and semi-structured data and make more sense of it.

Market Demands for Spark and MapReduce

Apache Spark was originally developed in 2009 at UC Berkeley by the team who later founded Databricks. Since its launch Spark has seen rapid adoption and growth. Most cutting-edge technology organizations like Netflix, Apple, Facebook, and Uber have massive Spark clusters for data processing and analytics. The demand for Spark is increasing at a very fast pace. According to the marketanalysis.com report forecast, the global Apache Spark market will grow at a CAGR of 67% between 2019 and 2022. The global Spark market revenue is rapidly expanding and may grow to $4.2 billion by 2022, with a cumulative market valued at $9.2 billion (2019 – 2022).

MapReduce has been there for a little longer after being developed in 2006 and gaining industry acceptance during the initial years. But at last, 5 years or so with Apache Spark gaining more ground, demand for MapReduce as the processing engine has reduced. But, it cannot be said in black and white that MapReduce will be completely replaced by Apache Spark in the coming years. Both technologies have their own pros and cons as we will see below. One solution cannot fit at all the places, so MapReduce will have its own takers depending on the problem to be solved.

Also, Spark and MapReduce do complement each other on many occasions.

Both these technologies have made inroads in all walks of common man’s life. You name the industry and it's there. Be it telecommunication, e-commerce, banking, insurance, healthcare, medicine, agriculture, biotechnology, etc.

Apache Spark vs MapReduce

After getting off hangover about how Apache Spark and MapReduce work, we need to understand how these two technologies compare with each other, and what are their pros and cons, so as to get a clear understanding of which technology fits our use case.

As we can see, MapReduce involves at least 4 disk operations whereas Spark only involves 2 disk operations. This is one reason for Spark is much faster than MapReduce. Spark also caches intermediate data which can be used in further iterations helping Spark improve its performance further. The more iterative the process the better the Spark performance due to in-memory processing and caching. This is where MapReduce's performance is not as good as Spark's due to disk read/write operations for every iteration.

Let’s see a comparison between Spark and MapReduce on different other parameters to understand where to use Spark and where to use MapReduce

Attributes	MapReduce	Apache Spark
Speed/Performance	MapReduce is designed for batch processing and is not as fast as Spark. It is used for gathering data from multiple sources and processing it once and store in a distributed data store like HDFS. It is best suited where memory is limited and processing data size is so big that it would not fit in the available memory.	Spark is 10-100 times faster because of in-memory processing and its caching mechanism. It can deliver near real-time analytics. It is used in Credit Card Processing, Fraud detection, Machine learning, and data analytics, IoT sensors, etc
Cost	As it is part of Apache Open Source there is no software cost. Hardware cost is less in MapReduce as it works with smaller memory(RAM) as compared to Spark. Even commodity hardware is sufficient.	Spark also is Apache Open Source so no license cost. Hardware cost is more than MapReduce as even though Spark can work on commodity hardware it needs a lot more memory(RAM) as compared to MapReduce since it should be able to fit all the data in Memory for optimal performance. The cluster needs little high-end commodity hardware with lots of RAM else performance gets hit
Ease of Use	MapReduce is a bit complex to write. MapReduce is written in Java and the APIs are a bit complex to code for new programmers, so there is a steep learning curve involved. The Pig has SQL-like syntax and it is easier for SQL developers to get on board easily. Also, there is no interactive mode available in MapReduce	Spark has APIs in Scala, Java, Python, and R for all basic transformations and actions. It also has rich Spark SQL APIs for SQL-savvy developers and it covers most of the SQL functions and is adding more functions with each new release. Also, Spark has scope for writing User Defined Analytical Functions and Functions (UDF/UDAF) for anyone who would like to have custom functions.
Compatibility	MapReduce is also compatible with all data sources and file formats Hadoop supports. But MapReduce needs another Scheduler like YARN or Mesos to run, it does not have any inbuilt Scheduler like Spark’s default/standalone scheduler.	Apache Spark can be in standalone mode using the default scheduler. It can also run on YARN or Mesos. It can run on-premise or on the cloud. Spark supports most data formats like parquet, Avro, ORC, JSON, etc. It also supports multiple languages and has APIs for Java, Scala, Python, and R.
Data Processing	MapReduce can only be used for batch processing where throughput is more important and latency can be compromised.	Spark supports Batch as well as Stream processing, so fits both use cases and can be used for Lambda design where applications need both a Speed layer and a slower layer/data processing layer
Security	MapReduce has more security features.MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like Knox Gateway and Sentry.	Spark is a bit bare at the moment. Spark currently supports authentication via a shared secret. Spark can integrate with HDFS and it can use HDFS ACLs and file-level permissions. Spark can also run on YARN leveraging the capability of Kerberos.
Fault Tolerance	MapReduce uses replication for fault tolerance. If any slave daemon fails, master daemons reschedule all pending and in-progress operations to another slave. This method is effective, but it can significantly increase the completion times for operations with a single failure also	In Spark, RDDs are the building blocks and Spark also uses it RDDs and DAG for fault tolerance. If an RDD is lost, it will automatically be recomputed by using the original transformations.
Latency	MapReduce has high latency	Spark provides low-latency performance
Interactive Mode	MapReduce does not have any interactive mode of operation.	Spark can be used interactively also for data processing. It has out-of-the-box support for spark-shell for scala/python/R
Machine Learning/Graph Processing	No support for these. A mahout has to be used for ML	Spark has dedicated modules for ML and Graph processing

What is Spark?

As per Apache, “Apache Spark is a unified analytics engine for large-scale data processing”. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark, instead of just “map” and “reduce” functions, defines a large set of operations called transformations and actions for the developers and which are ultimately transformed to map/reduce by the spark execution engine and these operations are arbitrarily combined for highly optimized performance.

Spark is developed in Scala language and it can run on Hadoop in standalone mode using its own default resource manager as well as in Cluster mode using YARN or Mesos resource manager. It is not mandatory to use Hadoop for Spark, it can be used with S3 or Cassandra also. But, in the majority of cases, Hadoop is the best fit as Spark’s data storage layer.

Features of Spark

1. Speed: Spark enables applications running on Hadoop to run up to 100x faster in memory and up to 10x faster on disk. Spark achieves this by minimizing disk read/write operations for intermediate results and storing them in memory and performing disk operations only when essential. Spark achieves this using DAG, query optimizer, and a highly optimized physical execution engine.

2. Fault Tolerance: Apache Spark achieves fault tolerance using a spark abstraction layer called RDD (Resilient Distributed Datasets), which is designed to handle worker node failure.

3. Lazy Evaluation: All the processing(transformations) on Spark RDD/Datasets are lazily evaluated, i.e. the output RDD/datasets are not available right away after transformation but will be available only when an action is performed.

4. Dynamic nature: Spark offers over 80 high-level operators that make it easy to build parallel apps.

5. Multiple Language Support: Spark provides multiple programming language support and you can use it interactively from the Scala, Python, R, and SQL shells.

6. Reusability: Spark code can be used for batch-processing, joining streaming data against historical data as well as running ad-hoc queries on the streaming state.

7. Machine Learning: Apache Spark comes with out-of-the-box support for machine learning called MLib which can be used for complex, predictive data analytics.

8. Graph Processing: GraphX is Apache Spark's API for graphs and graph-parallel computation. You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms using the Pregel API.

9. Real-Time Stream Processing: Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs.

Spark Example in Scala (Spark shell can be used for this)

// “sc” is a “Spark context” – this transforms the file into an RDD
val textFile = sc.textFile("data.txt")
// Return number of items (lines) in this RDD; count() is an action
textFile.count()
// Demo filtering.  Filter is a transform.  By itself this does no real work
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
// Demo chaining – how many lines contain “Spark”?  count() is an action.
textFile.filter(line => line.contains("Spark")).count()
// Length of line with most words.  Reduce is an action.
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
// Word count – traditional map-reduce.  collect() is an action
val word Counts = text File.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
word Counts.collect()

Sample Spark Transformations

map(func): Return a new distributed dataset formed by passing each element of the source through a function func.

filter(func): Return a new dataset formed by selecting those elements of the source on which func returns true

union(other Dataset): Return a new dataset that contains the union of the elements in the source dataset and the argument.

Sample Spark Actions

reduce(func): Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

collect(): Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

count(): Return the number of elements in the dataset.

The data is referred from the RDD Programming guide.

What is MapReduce?

MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Programmers have been writing parallel programs for a long time in different languages like C++, Java, C#, and Python. But, they have their own nuances, and maintaining these, is the programmer's responsibility. There are chances of the application crashing, performance hit, and incorrect results. Also, such systems if grow very large are not very fault tolerant or difficult to maintain.

MapReduce has simplified all these. Fault tolerance, parallel execution, and resource management is all responsibility of the Resource manager and the framework. Programmers have to only concentrate on business logic by writing only maps and reducing functions.

Brief Description of MapReduce Architecture

A MapReduce application has broadly two functions called map and reduce.

1. Map: Mapper process takes input as key/value pair, processes them i.e. performs some computation, and then produces intermediate results as key/value pairs

i.e. map(k1,v1) ---> list(k2,v2)

2. Reduce: The reducer process receives an intermediate key and a set of values in sorted order. It processes these and generates output key/value pairs by grouping values for each key.

i.e. reduce(k2, list(v2)) ---> list(v3)

Can also define an option function “Combiner” (to optimize bandwidth)

If defined, runs after Mapper & before Reducer on every node that has run a map task

Combiner receives as input all data emitted by the Mapper instances on a given node

Combiner output sent to the Reducers, instead of the output from the Mappers

Is a "mini-reduce" process that operates only on data generated by one machine

How does MapReduce work?

MapReduce is usually applied to huge datasets. A MapReduce job splits the input data into smaller independent chunks called partitions and then processes them independently using map tasks and reduce tasks. Below is an example.

MapReduce Word Count (Pseudocode)

map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));

Map	Reduce
Apply a function to all the elements of the list list1=[1,2,3,4,5]; square x = x * x list2=Map square(list1) print list2 -> [1,4,9,16,25]0	Combine all the elements of the list for a summary list1 = [1,2,3,4,5]; A = reduce (+) list1 Print A -> 15

Map

Reduce

Apply a function to all the elements of the list

list1=[1,2,3,4,5];
square x = x * x
list2=Map square(list1)
print list2
-> [1,4,9,16,25]0

Combine all the elements of the list for a summary

list1 = [1,2,3,4,5];
A = reduce (+) list1
Print A
-> 15

Pros and Cons of MapReduce vs Spark

MapReduce is best suited for the Analysis of archived data where the data size is huge and it is not going to fit in memory, and if the instant results and intermediate solutions are not required. MapReduce also scales very well and the cluster can be horizontally scaled with ease using commodity machines.
Offline Analytics is a good fit for MapReduce like Top Products per month, Unique clicks per banner.
MapReduce is also suited for Web Crawling as well as Crawling tweets at scale and NLP like Sentiment Analysis.
Another use case for MapReduce is de-duplicating data from social networking sites, job sites, and other similar sites.
MapReduce is also heavily used in Data mining for Generating the model and then classifying it.
Spark is fast and so can be used in Near Real Time data analysis.
A lot of organizations are moving to Spark as their ETL processing layer from legacy ETL systems like Informatica. Spark is a very good and optimized SQL processing module that fits the ETL requirements as it can read from multiple sources and can also write to many kinds of data sources.
Spark can also handle Streaming data so it's best suited for Lambda design. Most graph processing algorithms like page rank perform multiple iterations over the same data and this requires a message-passing mechanism. Spark has great support for Graph processing using the GraphX module.
Almost all machine learning algorithms work iteratively. Spark has a built-in scalable machine learning library called MLlib which contains high-quality algorithms that leverage iterations and yield better results than one-pass approximations sometimes used on MapReduce.
Hadoop MapReduce is more mature as it has been there for a longer time and its support is also better in the open-source community. It can be beneficial for really big data use cases where memory is limited and data will not fit the RAM. Most of the time, Spark use case will involve Hadoop and other tools like Hive, Pig, Impala and so when these technologies complement each other it will be a win for both Spark and MapReduce.

Where is Spark Usually Used?

Spark is used by 1000+ organizations in Production. Many of these organizations are known to run Spark clusters of 1000+ nodes. In terms of data size, Spark has been shown to work well up to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce (which sorted 100 TB of data in 23 min, using 2100 machines) using 10X fever machines, winning the 2014 Daytona GraySort Benchmark, as well as to sort 1 PB. Several production workloads use Spark to do ETL and data analysis on PBs of data. Below are some examples of where Spark is used across industries:

AsiaInfo: Uses Spark Core, Streaming, MLlib and Graphx, and Hadoop to build cost-effective data center solutions for our customers in the telecom industry as well as other industrial sectors.

Atp: Predictive models and learning algorithms to improve the relevance of programmatic marketing.

Credit Karma: Creates personalized experiences using Spark

eBay Inc: Using Spark core for log transaction aggregation and analytics

Kelkoo: Using Spark Core, SQL, and Streaming. Product recommendations, BI and analytics, real-time malicious activity filtering, and data mining.

Looking to dive into the world of data science? Discover the best data science course in India, where you'll gain invaluable skills and knowledge. Start your journey today and unlock endless opportunities!

Conclusion

Hadoop MapReduce is more mature as it has been there for a longer time and its support is also better in the open-source community. It can be beneficial for really big data use cases where memory is limited and data will not fit the RAM. Most of the time, Spark use case will involve Hadoop and other tools like Hive, Pig, Impala and so when these technologies complement each other it will be a win for both Spark and MapReduce.

Dr. Manish Kumar Jain

International Corporate Trainer

Dr. Manish Kumar Jain is an accomplished author, international corporate trainer, and technical consultant with 20+ years of industry experience. He specializes in cutting-edge technologies such as ChatGPT, OpenAI, generative AI, prompt engineering, Industry 4.0, web 3.0, blockchain, RPA, IoT, ML, data science, big data, AI, cloud computing, Hadoop, and deep learning. With expertise in fintech, IIoT, and blockchain, he possesses in-depth knowledge of diverse sectors including finance, aerospace, retail, logistics, energy, banking, telecom, healthcare, manufacturing, education, and oil and gas. Holding a PhD in deep learning and image processing, Dr. Jain's extensive certifications and professional achievements demonstrate his commitment to delivering exceptional training and consultancy services globally while staying at the forefront of technology.

Share This Article

Ready to Master the Skills that Drive Your Career?

Avail your free 1:1 mentorship session.

Upcoming Big Data Batches & Dates

Name	Date	Fee	Know more

Useful Links

Course Advisor

Apache Spark vs MapReduce: A Detailed Comparison

Why We Need Big Data Frameworks

Market Demands for Spark and MapReduce

Apache Spark vs MapReduce

Attributes

MapReduce

Apache Spark

What is Spark?

Features of Spark

Spark Example in Scala (Spark shell can be used for this)

Sample Spark Transformations

Sample Spark Actions

What is MapReduce?

Brief Description of MapReduce Architecture

How does MapReduce work?

Pros and Cons of MapReduce vs Spark

Where is Spark Usually Used?

Conclusion

Dr. Manish Kumar Jain

Upcoming Big Data Batches & Dates

Useful Links