Search

Apache Spark Vs MapReduce

Why we need Big Data frameworksBig data is primarily defined by the volume of a data set. Big data sets are generally huge – measuring tens of terabytes – and sometimes crossing the threshold of petabytes. It is surprising to know how much data is generated every minute. As estimated by DOMO:Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth.You can read DOMO's full report, including industry-specific breakdowns, here.To store and process even only a fraction of this amount of data, we need Big Data frameworks as the traditional Databases would not be able to store so much of data nor traditional processing systems would be able to process this data quickly. Here comes the frameworks like Apache Spark and MapReduce to our rescue and help us to get deep insights into this huge amount of structured, unstructured and semi-structured data and make more sense out of it.Market Demands for Spark and MapReduceApache Spark was originally developed in 2009 at UC Berkeley by the team who later founded Databricks. Since its launch Spark has seen rapid adoption and growth. Most of the cutting-edge technology organizations like Netflix, Apple, Facebook, Uber have massive Spark clusters for data processing and analytics. The demand for Spark is increasing at a very fast pace. According to marketanalysis.com report forecast, the global Apache Spark market will grow at a CAGR of 67% between 2019 and 2022. The global Spark market revenue is rapidly expanding and may grow up $4.2 billion by 2022, with a cumulative market valued at $9.2 billion (2019 – 2022).MapReduce has been there for a little longer after being developed in 2006 and gained industry acceptance during the initial years. But at last, 5 years or so with Apache Spark gaining more ground, demand for MapReduce as the processing engine has reduced. But, it cannot be said in black and white that MapReduce will be completely replaced by Apache Spark in the coming years. Both the technologies have their own pros and cons as we will see them below. One solution cannot fit at all the places, so MapReduce will have its own takers depending on the problem to be solved.Also, Spark and MapReduce do complement each other on many occasions.Both these technologies have made inroads in all walks of common man’s life. You name the industry and its there. Be it telecommunication, e-commerce, banking, insurance, healthcare, medicine, agriculture, biotechnology, etc.What is Spark?As per Apache, “Apache Spark is a unified analytics engine for large-scale data processing”. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.Spark, instead of just “map” and “reduce” functions, defines a large set of operations called transformations and actions for the developers and which are ultimately transformed to map/reduce by the spark execution engine and these operations are arbitrarily combined for highly optimized performance.Spark is developed in Scala language and it can run on Hadoop in standalone mode using its own default resource manager as well as in Cluster mode using YARN or Mesos resource manager. It is not mandatory to use Hadoop for Spark, it can be used with S3 or Cassandra also. But, in the majority of the cases, Hadoop is the best fit as Spark’s data storage layer.Features of SparkSpeed: Spark enables applications running on Hadoop to run up to 100x faster in memory and up to 10x faster on disk. Spark achieves this by minimising disk read/write operations for intermediate results and storing in memory and perform disk operations only when essential. Spark achieves this using DAG, query optimizer and highly optimized physical execution engine.Fault Tolerance: Apache Spark achieves fault tolerance using spark abstraction layer called RDD (Resilient Distributed Datasets), which are designed to handle worker node failure.Lazy Evaluation: All the processing(transformations) on Spark RDD/Datasets are lazily evaluated, i.e. the output RDD/datasets are not available right away after transformation but will be available only when an action is performed.Dynamic nature: Spark offers over 80 high-level operators that make it easy to build parallel apps.Multiple Language Support: Spark provides multiple programming language support and you can use it interactively from the Scala, Python, R, and SQL shells.Reusability: Spark code can be used for batch-processing, joining streaming data against historical data as well as run ad-hoc queries on streaming state.Machine Learning: Apache Spark comes with out of the box support for machine learning called MLib which can be used for complex, predictive data analytics.Graph Processing: GraphX is Apache Spark's API for graphs and graph-parallel computation. You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms using the Pregel API.Real-Time Stream Processing: Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs.Where is Spark usually used?Spark is used by 1000+ organizations in Production. Many of these organizations are known to run Spark clusters of 1000+ nodes. In terms of data size, Spark has been shown to work well up to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce (which sorted 100 TB of data in 23 min, using 2100 machines) using 10X fever machines, winning the 2014 Daytona GraySort Benchmark, as well as to sort 1 PB. Several production workloads use Spark to do ETL and data analysis on PBs of data. Below are some examples where Spark is used across industries:AsiaInfo: Uses Spark Core, Streaming, MLlib and Graphx and Hadoop to build cost-effective data centre solution for our customers in the telecom industry as well as other industrial sectors.Atp: Predictive models and learning algorithms to improve the relevance of programmatic marketing.Credit Karma: Creates personalized experiences using SparkeBay Inc: Using Spark core for log transaction aggregation and analyticsKelkoo: Using Spark Core, SQL, and Streaming. Product recommendations, BI and analytics, real-time malicious activity filtering, and data mining.More examples can be found on Apache’s  Powered By pageSpark Example in Scala (Spark shell can be used for this)// “sc” is a “Spark context” – this transforms the file into an RDD val textFile = sc.textFile("data.txt") // Return number of items (lines) in this RDD; count() is an action textFile.count() // Demo filtering.  Filter is a transform.  By itself this does no real work val linesWithSpark = textFile.filter(line => line.contains("Spark")) // Demo chaining – how many lines contain “Spark”?  count() is an action. textFile.filter(line => line.contains("Spark")).count() // Length of line with most words.  Reduce is an action. textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) // Word count – traditional map-reduce.  collect() is an action val word Counts = text File.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) word Counts.collect()Sample Spark Transformationsmap(func): Return a new distributed dataset formed by passing each element of the source through a function func.filter(func): Return a new dataset formed by selecting those elements of the source on which func returns trueunion(other Dataset): Return a new dataset that contains the union of the elements in the source dataset and the argument.Sample Spark Actionsreduce(func): Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.collect(): Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.count(): Return the number of elements in the dataset.The data is referred from the RDD Programming guide.What is MapReduce?MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.Programmers have been writing parallel programs for a long time in different languages like C++, Java, C#, Python. But, they have their own nuances and maintaining these, is the programmer's responsibility. There are chances of application crashing, performance hit, incorrect results. Also, such systems if grows very large is not very fault tolerant or difficult to maintain.MapReduce has simplified all these. Fault tolerance, parallel execution, resources management is all responsibility of the Resource manager and the framework. Programmers have to only concentrate on business logic by writing only map and reduce functions.Brief Description of MapReduce ArchitectureA MapReduce application has broadly two functions called map and reduce.Map: Mapper process takes input as key/value pair, processes them i.e. performs some computation and then produces intermediate results as key/value pairsi.e. map(k1,v1) ---> list(k2,v2)Reduce: Reducer process receives an intermediate key and a set of values in sorted order. It processes these and generates output key/value pairs by grouping values for each key.i.e. reduce(k2, list(v2)) ---> list(v3)Can also define an option function “Combiner” (to optimize bandwidth)If defined, runs after Mapper & before Reducer on every node that has run a map taskCombiner receives as input all data emitted by the Mapper instances on a given nodeCombiner output sent to the Reducers, instead of the output from the MappersIs a "mini-reduce" process which operates only on data generated by one machineHow does MapReduce work?MapReduce is usually applied to huge datasets. A MapReduce job splits the input data into smaller independent chunks called partitions and then processes them independently using map tasks and reduce tasks. Below is an example.MapReduce Word Count (Pseudocode)map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));MapReduceApply a function to all the elements of Listlist1=[1,2,3,4,5]; square x = x * x list2=Map square(list1) print list2 -> [1,4,9,16,25]0Combine all the elements of list for a summarylist1 = [1,2,3,4,5]; A = reduce (+) list1 Print A -> 15Apache Spark vs MapReduceAfter getting off hangover how Apache Spark and MapReduce works, we need to understand how these two technologies compare with each other, what are their pros and cons, so as to get a clear understanding which technology fits our use case.As we can see, MapReduce involves at least 4 disk operations whereas Spark only involves 2 disk operations. This is one reason for Spark to be much faster than MapReduce. Spark also caches intermediate data which can be used in further iterations helping Spark improve its performance further. The more iterative the process the better is the Spark performance due to in-memory processing and caching. This is where MapReduce performance not as good as Spark due to disk read/write operations for every iteration.Let’s see a comparison between Spark and MapReduce on different other parameters to understand where to use Spark and where to use MapReduceAttributesMapReduceApache SparkSpeed/PerformanceMapReduce is designed for batch processing and is not as fast as Spark. It is used for gathering data from multiple sources and process it once and store in a distributed data store like HDFS. It is best suited where memory is limited and processing data size is so big that it would not fit in the available memory.Spark is 10-100 times faster because of in-memory processing and its caching mechanism. It can deliver near real-time analytics. It is used in Credit Card Processing, Fraud detection, Machine learning and data analytics, IoT sensors etcCostAs it is part of Apache Open Source there is no software cost.Hardware cost is less in MapReduce as it works with smaller memory(RAM) as compared to Spark. Even commodity hardware is sufficient.Spark also is Apache Open Source so no license cost.Hardware cost is more than MapReduce as even though Spark can work on commodity hardware it needs a lot more memory(RAM) as compared to MapReduce since it should be able to fit all the data in Memory for optimal performance. Cluster needs little high-end commodity hardware with lots of RAM else performance gets hitEase of UseMapReduce is a bit complex to write. MapReduce is written in Java and the APIs are a bit complex to code for new programmers, so there is a steep learning curve involved. The Pig has SQL like syntax and it is easier for SQL developers to get onboard easily. Also, there is no interactive mode available in MapReduceSpark has APIs in Scala, Java, Python, R for all basic transformations and actions. It also has rich Spark SQL APIs for SQL savvy developers and it covers most of the SQL functions and is adding more functions with each new release. Also, Spark has scope for writing User Defined Analytical Functions and Functions (UDF/UDAF) for anyone who would like to have custom functions.CompatibilityMapReduce is also compatible with all data sources and file formats Hadoop supports. But MapReduce needs another Scheduler like YARN or Mesos to run, it does not have any inbuilt Scheduler like Spark’s default/standalone scheduler.Apache Spark can in standalone mode using default scheduler. It can also run on YARN or Mesos. It can run on-premise or on the cloud. Spark supports most of the data formats like parquet, Avro, ORC, JSON, etc. It also supports multiple languages and has APIs for Java, Scala, Python, R.Data ProcessingMapReduce can only be used for batch processing where throughput is more important and latency can be compromised.Spark supports Batch as well as Stream processing, so fits both use cases and can be used for Lambda design where applications need both Speed layer and slower layer/data processing layerSecurityMapReduce has more security features.MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like Knox Gateway and Sentry.Spark is a bit bare at the moment. Spark currently supports authentication via a shared secret. Spark can integrate with HDFS and it can use HDFS ACLs and file-level permissions. Spark can also run on YARN leveraging the capability of Kerberos.Fault ToleranceMapReduce uses replication for fault tolerance. If any slave daemon fails, master daemons reschedule all pending and in-progress operations to another slave. This method is effective, but it can significantly increase the completion times for operations with a single failure alsoIn Spark, RDDs are the building blocks and Spark also uses it RDDs and DAG for fault tolerance. If an RDD is lost, it will automatically be recomputed by using the original transformations.LatencyMapReduce has high latencySpark provides low latency performanceInteractive ModeMapReduce does not have any interactive mode of operation.Spark can be used interactively also for data processing. It has out-of-the-box support for spark shell for scala/python/RMachine Learning/Graph ProcessingNo support for these. A mahout has to be used for MLSpark has dedicated modules for ML and Graph processingBoth these technologies MapReduce and Spark have pros and cons:MapReduce is best suited for Analysis of archived data where data size is huge and it is not going to fit in memory, and if the instant results and intermediate solutions are not required. MapReduce also scales very well and the cluster can be horizontally scaled with ease using commodity machines.Offline Analytics is a good fit for MapReduce like Top Products per month, Unique clicks per banner.MapReduce is also suited for Web Crawling as well as Crawling tweets at scale and NLP like Sentiment Analysis.Another use case for MapReduce is de-duplicating data from social networking sites, job sites and other similar sites.MapReduce is also heavily used in Data mining for Generating the model and then classifying.Spark is fast and so can be used in Near Real Time data analysis.A lot of organizations are moving to Spark as their ETL processing layer from legacy ETL systems like Informatica. Spark as very good and optimized SQL processing module which fits the ETL requirements as it can read from multiple sources and can also write to many kinds of data sources.Spark can also handle Streaming data so its best suited for Lambda design.Most graph processing algorithms like page rank perform multiple iterations over the same data and this requires a message passing mechanism. Spark has great support for Graph processing using GraphX module.Almost all machine learning algorithms work iteratively. Spark has a built-in scalable machine learning library called MLlib which contains high-quality algorithms that leverage iterations and yields better results than one pass approximations sometimes used on MapReduce.Hadoop MapReduce is more mature as it has been there for a longer time and its support is also better in the open source community. It can be beneficial for really big data use case where memory is limited and data will not fit the RAM. Most of the time, Spark use case will involve Hadoop and other tools like Hive, Pig, Impala and so when these technologies complement each other it will be a win for both Spark and MapReduce.Conclusion:Hadoop Mapreduce is more mature as it has been there for a longer time and its support is also better in the open source community. It can be beneficial for really big data use case where memory is limited and data will not fit the RAM. Most of the time, Spark use case will involve Hadoop and other tools like Hive, Pig, Impala and so when these technologies complement each other it will be a win for both Spark and Mapreduce.

Apache Spark Vs MapReduce

40K
  • by Nitin Kumar
  • 09th May, 2019
  • Last updated on 11th Mar, 2021
  • 20 mins read
Apache Spark Vs MapReduce

Why we need Big Data frameworks

Big data is primarily defined by the volume of a data set. Big data sets are generally huge – measuring tens of terabytes – and sometimes crossing the threshold of petabytes. It is surprising to know how much data is generated every minute. As estimated by DOMO:

Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth.

You can read DOMO's full report, including industry-specific breakdowns, here.

To store and process even only a fraction of this amount of data, we need Big Data frameworks as the traditional Databases would not be able to store so much of data nor traditional processing systems would be able to process this data quickly. Here comes the frameworks like Apache Spark and MapReduce to our rescue and help us to get deep insights into this huge amount of structured, unstructured and semi-structured data and make more sense out of it.

Market Demands for Spark and MapReduce

Apache Spark was originally developed in 2009 at UC Berkeley by the team who later founded Databricks. Since its launch Spark has seen rapid adoption and growth. Most of the cutting-edge technology organizations like Netflix, Apple, Facebook, Uber have massive Spark clusters for data processing and analytics. The demand for Spark is increasing at a very fast pace. According to marketanalysis.com report forecast, the global Apache Spark market will grow at a CAGR of 67% between 2019 and 2022. The global Spark market revenue is rapidly expanding and may grow up $4.2 billion by 2022, with a cumulative market valued at $9.2 billion (2019 – 2022).

MapReduce has been there for a little longer after being developed in 2006 and gained industry acceptance during the initial years. But at last, 5 years or so with Apache Spark gaining more ground, demand for MapReduce as the processing engine has reduced. But, it cannot be said in black and white that MapReduce will be completely replaced by Apache Spark in the coming years. Both the technologies have their own pros and cons as we will see them below. One solution cannot fit at all the places, so MapReduce will have its own takers depending on the problem to be solved.

Also, Spark and MapReduce do complement each other on many occasions.

Both these technologies have made inroads in all walks of common man’s life. You name the industry and its there. Be it telecommunication, e-commerce, banking, insurance, healthcare, medicine, agriculture, biotechnology, etc.

What is Spark?

As per Apache, “Apache Spark is a unified analytics engine for large-scale data processing”. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark, instead of just “map” and “reduce” functions, defines a large set of operations called transformations and actions for the developers and which are ultimately transformed to map/reduce by the spark execution engine and these operations are arbitrarily combined for highly optimized performance.

What is Spark

Spark is developed in Scala language and it can run on Hadoop in standalone mode using its own default resource manager as well as in Cluster mode using YARN or Mesos resource manager. It is not mandatory to use Hadoop for Spark, it can be used with S3 or Cassandra also. But, in the majority of the cases, Hadoop is the best fit as Spark’s data storage layer.

Features of Spark

Features of Spark

Speed: Spark enables applications running on Hadoop to run up to 100x faster in memory and up to 10x faster on disk. Spark achieves this by minimising disk read/write operations for intermediate results and storing in memory and perform disk operations only when essential. Spark achieves this using DAG, query optimizer and highly optimized physical execution engine.

Fault Tolerance: Apache Spark achieves fault tolerance using spark abstraction layer called RDD (Resilient Distributed Datasets), which are designed to handle worker node failure.

Lazy Evaluation: All the processing(transformations) on Spark RDD/Datasets are lazily evaluated, i.e. the output RDD/datasets are not available right away after transformation but will be available only when an action is performed.

Dynamic nature: Spark offers over 80 high-level operators that make it easy to build parallel apps.

Multiple Language Support: Spark provides multiple programming language support and you can use it interactively from the Scala, Python, R, and SQL shells.

Reusability: Spark code can be used for batch-processing, joining streaming data against historical data as well as run ad-hoc queries on streaming state.

Machine Learning: Apache Spark comes with out of the box support for machine learning called MLib which can be used for complex, predictive data analytics.

Graph Processing: GraphX is Apache Spark's API for graphs and graph-parallel computation. You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms using the Pregel API.

Real-Time Stream Processing: Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs.

Where is Spark usually used?

Spark is used by 1000+ organizations in Production. Many of these organizations are known to run Spark clusters of 1000+ nodes. In terms of data size, Spark has been shown to work well up to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce (which sorted 100 TB of data in 23 min, using 2100 machines) using 10X fever machines, winning the 2014 Daytona GraySort Benchmark, as well as to sort 1 PB. Several production workloads use Spark to do ETL and data analysis on PBs of data. Below are some examples where Spark is used across industries:

Where is Spark usually used

AsiaInfo: Uses Spark Core, Streaming, MLlib and Graphx and Hadoop to build cost-effective data centre solution for our customers in the telecom industry as well as other industrial sectors.

Atp: Predictive models and learning algorithms to improve the relevance of programmatic marketing.

Credit Karma: Creates personalized experiences using Spark

eBay Inc: Using Spark core for log transaction aggregation and analytics

Kelkoo: Using Spark Core, SQL, and Streaming. Product recommendations, BI and analytics, real-time malicious activity filtering, and data mining.

More examples can be found on Apache’s  Powered By page

Spark Example in Scala (Spark shell can be used for this)

// “sc” is a “Spark context” – this transforms the file into an RDD
val textFile = sc.textFile("data.txt")
// Return number of items (lines) in this RDD; count() is an action
textFile.count()
// Demo filtering.  Filter is a transform.  By itself this does no real work
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
// Demo chaining – how many lines contain “Spark”?  count() is an action.
textFile.filter(line => line.contains("Spark")).count()
// Length of line with most words.  Reduce is an action.
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
// Word count – traditional map-reduce.  collect() is an action
val word Counts = text File.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
word Counts.collect()

Sample Spark Transformations

map(func): Return a new distributed dataset formed by passing each element of the source through a function func.

filter(func): Return a new dataset formed by selecting those elements of the source on which func returns true

union(other Dataset): Return a new dataset that contains the union of the elements in the source dataset and the argument.

Sample Spark Actions

reduce(func): Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

collect(): Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

count(): Return the number of elements in the dataset.

The data is referred from the RDD Programming guide.

What is MapReduce?

MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Programmers have been writing parallel programs for a long time in different languages like C++, Java, C#, Python. But, they have their own nuances and maintaining these, is the programmer's responsibility. There are chances of application crashing, performance hit, incorrect results. Also, such systems if grows very large is not very fault tolerant or difficult to maintain.

MapReduce has simplified all these. Fault tolerance, parallel execution, resources management is all responsibility of the Resource manager and the framework. Programmers have to only concentrate on business logic by writing only map and reduce functions.

Brief Description of MapReduce Architecture

A MapReduce application has broadly two functions called map and reduce.

Map: Mapper process takes input as key/value pair, processes them i.e. performs some computation and then produces intermediate results as key/value pairs

i.e. map(k1,v1) ---> list(k2,v2)

Reduce: Reducer process receives an intermediate key and a set of values in sorted order. It processes these and generates output key/value pairs by grouping values for each key.

i.e. reduce(k2, list(v2)) ---> list(v3)

Can also define an option function “Combiner” (to optimize bandwidth)

If defined, runs after Mapper & before Reducer on every node that has run a map task

Combiner receives as input all data emitted by the Mapper instances on a given node

Combiner output sent to the Reducers, instead of the output from the Mappers

Is a "mini-reduce" process which operates only on data generated by one machine

How does MapReduce work?

MapReduce is usually applied to huge datasets. A MapReduce job splits the input data into smaller independent chunks called partitions and then processes them independently using map tasks and reduce tasks. Below is an example.

MapReduce Word Count (Pseudocode)

map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));

Mapreduce Framework

Map
Reduce

Apply a function to all the elements of List

list1=[1,2,3,4,5];
square x = x * x
list2=Map square(list1)
print list2
-> [1,4,9,16,25]0

Combine all the elements of list for a summary

list1 = [1,2,3,4,5];
A = reduce (+) list1
Print A
-> 15

Apache Spark vs MapReduce

After getting off hangover how Apache Spark and MapReduce works, we need to understand how these two technologies compare with each other, what are their pros and cons, so as to get a clear understanding which technology fits our use case.

Apache Spark vs Mapreduce

As we can see, MapReduce involves at least 4 disk operations whereas Spark only involves 2 disk operations. This is one reason for Spark to be much faster than MapReduce. Spark also caches intermediate data which can be used in further iterations helping Spark improve its performance further. The more iterative the process the better is the Spark performance due to in-memory processing and caching. This is where MapReduce performance not as good as Spark due to disk read/write operations for every iteration.

Let’s see a comparison between Spark and MapReduce on different other parameters to understand where to use Spark and where to use MapReduce

Attributes

MapReduce

Apache Spark

Speed/Performance


MapReduce is designed for batch processing and is not as fast as Spark. It is used for gathering data from multiple sources and process it once and store in a distributed data store like HDFS. It is best suited where memory is limited and processing data size is so big that it would not fit in the available memory.

Spark is 10-100 times faster because of in-memory processing and its caching mechanism. It can deliver near real-time analytics. It is used in Credit Card Processing, Fraud detection, Machine learning and data analytics, IoT sensors etc

Cost


As it is part of Apache Open Source there is no software cost.

Hardware cost is less in MapReduce as it works with smaller memory(RAM) as compared to Spark. Even commodity hardware is sufficient.

Spark also is Apache Open Source so no license cost.

Hardware cost is more than MapReduce as even though Spark can work on commodity hardware it needs a lot more memory(RAM) as compared to MapReduce since it should be able to fit all the data in Memory for optimal performance. Cluster needs little high-end commodity hardware with lots of RAM else performance gets hit

Ease of Use


MapReduce is a bit complex to write. MapReduce is written in Java and the APIs are a bit complex to code for new programmers, so there is a steep learning curve involved. The Pig has SQL like syntax and it is easier for SQL developers to get onboard easily. Also, there is no interactive mode available in MapReduce

Spark has APIs in Scala, Java, Python, R for all basic transformations and actions. It also has rich Spark SQL APIs for SQL savvy developers and it covers most of the SQL functions and is adding more functions with each new release. Also, Spark has scope for writing User Defined Analytical Functions and Functions (UDF/UDAF) for anyone who would like to have custom functions.

Compatibility


MapReduce is also compatible with all data sources and file formats Hadoop supports. But MapReduce needs another Scheduler like YARN or Mesos to run, it does not have any inbuilt Scheduler like Spark’s default/standalone scheduler.

Apache Spark can in standalone mode using default scheduler. It can also run on YARN or Mesos. It can run on-premise or on the cloud. Spark supports most of the data formats like parquet, Avro, ORC, JSON, etc. It also supports multiple languages and has APIs for Java, Scala, Python, R.

Data Processing


MapReduce can only be used for batch processing where throughput is more important and latency can be compromised.

Spark supports Batch as well as Stream processing, so fits both use cases and can be used for Lambda design where applications need both Speed layer and slower layer/data processing layer

Security


MapReduce has more security features.MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like Knox Gateway and Sentry.

Spark is a bit bare at the moment. Spark currently supports authentication via a shared secret. Spark can integrate with HDFS and it can use HDFS ACLs and file-level permissions. Spark can also run on YARN leveraging the capability of Kerberos.

Fault Tolerance


MapReduce uses replication for fault tolerance. If any slave daemon fails, master daemons reschedule all pending and in-progress operations to another slave. This method is effective, but it can significantly increase the completion times for operations with a single failure also

In Spark, RDDs are the building blocks and Spark also uses it RDDs and DAG for fault tolerance. If an RDD is lost, it will automatically be recomputed by using the original transformations.

Latency

MapReduce has high latency

Spark provides low latency performance

Interactive Mode

MapReduce does not have any interactive mode of operation.

Spark can be used interactively also for data processing. It has out-of-the-box support for spark shell for scala/python/R

Machine Learning/Graph Processing

No support for these. A mahout has to be used for ML

Spark has dedicated modules for ML and Graph processing

Both these technologies MapReduce and Spark have pros and cons:

MapReduce is best suited for Analysis of archived data where data size is huge and it is not going to fit in memory, and if the instant results and intermediate solutions are not required. MapReduce also scales very well and the cluster can be horizontally scaled with ease using commodity machines.

Offline Analytics is a good fit for MapReduce like Top Products per month, Unique clicks per banner.

MapReduce is also suited for Web Crawling as well as Crawling tweets at scale and NLP like Sentiment Analysis.

Another use case for MapReduce is de-duplicating data from social networking sites, job sites and other similar sites.

MapReduce is also heavily used in Data mining for Generating the model and then classifying.

Spark is fast and so can be used in Near Real Time data analysis.

A lot of organizations are moving to Spark as their ETL processing layer from legacy ETL systems like Informatica. Spark as very good and optimized SQL processing module which fits the ETL requirements as it can read from multiple sources and can also write to many kinds of data sources.

Spark can also handle Streaming data so its best suited for Lambda design.

Most graph processing algorithms like page rank perform multiple iterations over the same data and this requires a message passing mechanism. Spark has great support for Graph processing using GraphX module.

Almost all machine learning algorithms work iteratively. Spark has a built-in scalable machine learning library called MLlib which contains high-quality algorithms that leverage iterations and yields better results than one pass approximations sometimes used on MapReduce.

Hadoop MapReduce is more mature as it has been there for a longer time and its support is also better in the open source community. It can be beneficial for really big data use case where memory is limited and data will not fit the RAM. Most of the time, Spark use case will involve Hadoop and other tools like Hive, Pig, Impala and so when these technologies complement each other it will be a win for both Spark and MapReduce.

Conclusion:

Hadoop Mapreduce is more mature as it has been there for a longer time and its support is also better in the open source community. It can be beneficial for really big data use case where memory is limited and data will not fit the RAM. Most of the time, Spark use case will involve Hadoop and other tools like Hive, Pig, Impala and so when these technologies complement each other it will be a win for both Spark and Mapreduce.

Nitin

Nitin Kumar

Blog Author

I am an Alumni of IIT(ISM) Dhanbad. I have 15+ years of experience in Software industry working with Investment Banking and Financial Services domain. I have worked for Wall Street banks like Morgan Stanley and JP Morgan Chase. I have been working on Big Data technologies like hadoop,spark, cloudera for 3+ years.

Join the Discussion

Your email address will not be published. Required fields are marked *

1 comments

Antonio 21 Jun 2019

I am searching for this type of article,finally i found it from your site,thanks for the article.

Suggested Blogs

Overview of Deploying Machine Learning Models

Machine Learning is no longer just the latest buzzword. In fact, it has permeated every facet of our everyday lives. Most of the applications across the world are built using Machine Learning and their applications extend further when they are combined with other cutting-edge technologies like Deep Learning and Artificial Intelligence. These latest technologies are a boon to mankind, as they simplify tasks, helping to complete work in lesser time. They boost the growth and profitability of industries and organizations across sectors, which in turn helps in the growth of the economy and generates jobs.What are the fields that machine learning extends into?Machine Learning now finds applications across sectors and industries including fields like Healthcare, defense, insurance, government sectors, automobile, manufacturing, retail and more. ML gives great insights to businesses in gaining and retaining customer loyalty, enhances efficiency, minimizes the time consumption, optimizes resource allocation and decreases the cost of labor for a specific task.What is Model Deployment?It’s well established that ML has a lot of applications in the real world. But how exactly do these models work to solve our problems? And how can it be made available for a large user base? The answer is that we have to deploy the trained machine learning model into the web, so that it can be available for many users.When a model is deployed, it is fully equipped with training and it knows what are the inputs to be taken by the model and what are the outputs given out in return. This strategy is used to advantage in real world applications. Deployment is a tricky task and is the last stage of our ML project.Generally, the deployment will take place on a web server or a cloud for further use, and we can modify the content based on the user requirements and update the model. Deployment makes it easier to interact with the applications and share the benefits to the applications with others.With the process of Model Deployment, we can overcome problems like Portability, which means shifting of software from one machine to the other and Scalability, which is the capacity to be changed on a scale and the ability of the computation process to be used in a wide range of capabilities.Installing Flask on your MachineFlask is a web application framework in Python. It is a lightweight Web Server Gateway Interface (WSGI) web framework. It consists of many modules, and contains different types of tools and libraries which helps a web developer to write and implement many useful web applications.Installing Flask on our machine is simple. But before that, please ensure you have installed Python in your system because Flask runs using Python.In Windows: Open command prompt and write the following code:a) Initially make the virtual environment using pip -- pip install virtualenv And then write mkvirtualenv HelloWorldb) Connect to the project – Create a folder dev, then mkdir Helloworld for creating a directory; then type in cd HelloWorld to go the file location.c) Set Project Directory – Use setprojectdir in order to connect our virtual environment to the current working directory. Now further when we activate the environment, we will directly move into this directory.d) Deactivate – On using the command called deactivate, the virtual environment of hello world present in parenthesis will disappear, and we can activate our process directly in later steps.e) Workon – When we have some work to do with the project, we write the command  “workon HelloWorld” to activate the virtual environment directly in the command prompt.The above is the set of Virtual Environment commands for running our programs in Flask. This virtual environment helps and makes the work easier as it doesn’t disturb the normal environment of the system. The actions we perform will reside in the created virtual environment and facilitate the users with better features.f) Flask Installation – Now you install flask on the virtual environment using command pip install flaskUnderstanding the Problem StatementFor example, let us try a Face Recognition problem using opencv. Here, we work on haarcascades dataset. Our goal is to detect the eyes and face using opencv. We have an xml file that contains the values of face and eyes that were stored. This xml file will help us to identify the face and eyes when we look into the camera.The xml data for face recognition is available online, and we can try this project on our own after reading this blog. For every problem that we solve using Machine Learning, we require a dataset, which is the basic building block for the Model development in ML. You can generate interesting outcomes at the end like detecting the face and eyes with a bounding rectangular box. Machine learning beginners can use these examples and create a mini project which will help them to know much about the core of ML and other technologies associated with it.Workflow of the ProjectModel Building: We build a Machine Learning model to detect the face of the human present in front of the camera. We use the technology of Opencv to perform this action which is the library of Computer Vision.Here our focus is to understand how the model is working and how it is deployed on server using Flask. Accuracy is not the main objective, but we will learn how the developed ML model is deployed.Face app: We will create a face app that detects your face and implements the model application. This establishes the connection between Python script and the webpage template.Camera.py: This is the Python script file where we import the necessary libraries and datasets required for our model and we write the actual logic for the model to exhibit its functionality.Webpage Template: Here, we will design a user interface where the user can experience live detection of his face and eyes in the camera. We provide a button on a webpage, to experience the results.Getting the output screen: when the user clicks the button, the camera will open directly and we can get the result of the machine learning model deployed on the server. In the output screen you can see your face. Storage: This section is totally optional for users, and it is based on the users’ choice of storing and maintaining the data. After getting the outputs on the webpage screen, you can store the outputs in a folder on your computer. This helps us to see how the images are captured and stored locally in our system. You can add a file path in the code, that can store the images locally on your system if necessary.This application can be further extended to a major project of “Attendance taking using Face Recognition Technique”, which can be used in colleges and schools, and can potentially replace normal handwritten Attendance logs. This is an example of a smart application that can be used to make our work simple.Diagrammatic Representation of the steps for the projectBuilding our Machine Learning ModelWe have the XML data for recognizing face and eyes respectively. Now we will write the machine learning code, that implements the technique of face and eyes detection using opencv. Before that, we import some necessary libraries required for our project, in the file named camera.py # import cv2 # import numpy as np # import scipy.ndimage # import pyzbar.pyzbar as pyzbar # from PIL import Image Now, we load the dataset into some variables in order to access them further. Haarcascades is the file name where all the xml files containing the values of face, eye, nose etc are stored. # defining face detector# face_cascade = cv2.CascadeClassifier("haarcascades/haarcascade_frontalface_default.xml") # eye_cascade = cv2.CascadeClassifier('haarcascades/haarcascade_eye.xml')The xml data required for our project is represented as shown below, and mostly consists of numbers.Now we write the code for opening the camera, and releasing of camera in a class file. The “def” keyword is the name of the function in Python. The functions in Python are declared using the keyword “def”.The function named “def __init__” initiates the task of opening camera for live streaming of the video. The “def __del__” function closes the camera upon termination of the window.# class VideoCamera(object):#    def __init__(self):        # capturing video#       self.video = cv2.VideoCapture(0) #  def __del__(self):#        # releasing camera#        self.video.release()Next, we build up the actual logic for face and eyes detect using opencv in Python script as follows. This function is a part of class named videocamera.# class VideoCamera(object):#    def __init__(self):#        # capturing video#        self.video = cv2.VideoCapture(0)#    def __del__(self):#        # releasing camera#        self.video.release()#    def face_eyes_detect(self):#        ret, frame = self.video.read()#        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)#        faces = face_cascade.detectMultiScale(gray, 1.3, 5)#        c=0#        for (x,y,w,h) in faces:#            cv2.rectangle(frame, (x,y), (x+w,y+h), (255, 0, 0), 2)#            roi_gray = gray[y:y+h, x:x+w]#            roi_color = frame[y:y+h, x:x+w]#            eyes = eye_cascade.detectMultiScale(roi_gray)#            for (ex,ey,ew,eh) in eyes:#                cv2.rectangle(roi_color, (ex, ey), (ex+ew, ey+eh), (0, 255, 0), 2)#            while True:#                k = cv2.waitKey(1000) & 0xFF#                print("Image "+str(c)+" saved")#                file = 'C:/Users/user/dev/HelloWorld/images/'+str(c)+'.jpg'#                cv2.imwrite(file, frame)#                c += 1            # encode Opencv raw frame to jpg and display it#        ret, jpeg = cv2.imencode('.jpg', frame)#        return jpeg.tobytes()The first line in the function “ret, frame” reads the data of live streaming video. The ret takes the value “1”, when the camera is open, else it takes “0” as input. The frame captures the live streaming video from time to time. In the 2nd line, we are changing the color of image from RGB to Grayscale, i.e., we are changing the values of pixels. And then we are applying some inbuilt functions to detect faces. The for loop, illustrates that it is having some fixed dimensions to draw a bounding rectangular box around the face and eyes, when it is detected. If you want to store the captured images after detecting face and eyes, we can add the code of while loop, and we can give the location to store the captured images. When an image is captured, it is saved as Image 1, Image 2 saved, etc., for confirmation.All the images will be saved in jpg format. We can mention the type of format in which the images should be stored. The method named cv2.imwrite stores the frame in a particular file location.Finally, after capturing the detected picture of face and eyes, it displays the result at the user end. Creating a WebpageWe will create a webpage, in order to implement the functionality of the developed machine learning model after deployment using Flask. Here is the design of our webpage.The above picture represents a small webpage demonstrating “Video Streaming Demonstration” and a link “face-eyes-detect”. When we click the button on the screen, the camera gets opened and the functionality will be displayed to the users who are facing the camera.The code for creating a webpage is as follows:If the project contains only one single html file, it should be necessarily saved with the name of index. The above code should be saved as “index.html” in a folder named “templates” in the project folder named “HelloWorld”, that we have created in the virtual environment earlier. This is the actual format we need to follow while designing a webpage using Flask framework.Connecting Webpage to our ModelTill now we have developed two separate files, one for developing the machine learning model for the problem statement and the other for creating a webpage, where we can access the functionality of the model. Now we will try to see how we can connect both of them.This is the Python script with the file name saved as “app.py”. Initially we import the necessary libraries to it, and create a variable that stores the Flask app. We then guide the code to which location it needs to be redirected, when the Python scripts are executed in our system. The redirection is done through “@app.route” followed by a function named “home”. Then we include the functionality of model named “face_eyes_detect” to the camera followed by the function definition named “gen”. After adding the functionality, we display the response of the deployed model on to the web browser. The outcome of the functionality is the detection of face and eyes in the live streaming camera and the frames are stored in the folder named images. We put the debug mode to False. # from flask import Flask, render_template, Response,url_for, redirect, request.# from flask import Flask, render_template, Response,url_for, redirect, request  # from camera import VideoCamera  # import cv2  # import time  # app = Flask(__name__)  # @app.route("/")  # def home():  #     # rendering web page  #     return render_template('index.html')  # def gen(camera):  #     while True:  #         # get camera frame  #         frame = camera.face_eyes_detect()  #         yield(b'--frame\r\n'  #                   b'Content-Type: image/jpeg\r\n\r\n' + frame + b'\r\n\r\n')  # @app.route("/video_feed")  # def video_feed():  #     return Response(gen(VideoCamera()),  #           mimetype='multipart/x-mixed-replace; boundary=frame')  # if __name__ == '__main__':  #     # defining server ip address and port  #     app.run(debug=False)Before running the Python scripts, we need to install the libraries like opencv, flask, scipy, numpy, PIL, pyzbar etc., using the command prompt with the command named “pip install library_name” like “pip install opencv-python”, ”pip install flask”, “pip install scipy” etc.When you have installed all the libraries in your system, now open the python script “app.py” and run it using the command “f5”. The output is as follows:Image: Output obtained when we run app.py fileNow we need to copy the server address http://127.0.0.1:5000/ and paste it on the web browser, and we will get the output screen as follows:Now when we click the link “face-eyes-detect”, we will get the functionality of detecting the face and eyes of a user, and it is seen as follows:Without SpectaclesWith SpectaclesOne eye closed by handone eye closedWhen these detected frames are generated, they are similarly stored in a specified location of folder named “images”. And in the Python shell we can observe, the sequence of images is saved in the folder, and looks as follows:In the above format, we get the outcomes of images stored in our folder.Now we will see how the images were stored in the previously created folder named “images” present in the project folder of “HelloWorld.”Now we can use the deployed model in real time. With the help of this application, we can try some other new applications of Opencv and we can deploy it in the flask server accordingly.  You can find all the above code with the files in the following github repository, and you can make further changes to extend this project application to some other level.Github Link.ConclusionIn this blog, we learnt how to deploy a model using flask server and how to connect the Machine Learning Model with the Webpage using Flask. The example project of face-eyes detection using opencv is a pretty common application in the present world. Deployment using flask is easy and simple.  We can use the Flask Framework for deployment of ML models as it is a light weight framework. In the real-world scenario, Flask may not be the most suitable framework for bigger applications as it is a minimalist framework and works well only for lighter applications.
3292
Overview of Deploying Machine Learning Models

Machine Learning is no longer just the latest buzz... Read More

How Big Data Can Solve Enterprise Problems

Many professionals in the digital world have become familiar with the hype cycle. A new technology enters the tech world amid great expectations. Undoubtedly, dismay sets in and retrenchment stage starts, practice and process catch up to assumptions and the new value is untied. Currently, there is apparently no topic more hyped than big data and there is already no deficit of self-proclaimed pundits. Yet nearly 55% of big data projects fail and there is an increasing divide between enterprises that are benefiting from its use and those who are not. However, qualified data scientists, great integration across departments, and the ability to manage expectations all play a part in making big data work for your organization. It is often said that an organization’s future is dependent on the decisions it takes. Since most of the business decisions are backed by data available at hand. The accurate the information, the better they are for the business. Gone are the days when data was only used as an aid in better decision making. But now, with big data, it has actually become a part of all business decisions. For quite some time now, big data has been changing the way business operations are managed, how they collect data and turn it into useful and accurate information in real-time. Today, let’s take a look at solving real-life enterprise problems with big data. Predictive Analysis Let’s assume that you have a solid knowledge of the emerging trends and technologies in the market or when your infrastructure needs good maintenance. With huge amounts of data, you can easily predict trends and your future needs for the business. This sort of knowledge gives you an edge over your peers in this competitive world. Enhancing Market Research Regardless of the business vertical, market research is an essential part of business operations. With the ever-changing needs and aspirations of your customers, businesses need to find ways to get into the mind of customers with better and improved products and services. In such scenarios, having large volumes of data in hand will let you carry out detailed market research and thus enhancing your products and services. Streamlining Business Process For any enterprise, streamlining the business process is a crucial link to keeping the business sustainable and lucrative. Some effective modifications here and there can benefit you in the long run by cutting down the operational costs. Big data can be utilized to overhaul your whole business process right from raw material procurement to maintaining the supply chain. Data Access Centralization It is an inevitable fact that the decentralized data has its own advantages and one of the main restrictions arises from the fact that it can build data silos. Large enterprises with global presence frequently encounter such challenges. Centralizing conventional data often posed a challenge and blocked the complete enterprise from working as one team. But big data has entirely solved this problem, offering visibility of the data throughout the organization. How are you navigating the implications of all that data within your enterprise? Have you deployed big data in your enterprise and solved real-life enterprise problems? Then we would love to know your experiences. Do let us by commenting in the section below.
15000
How Big Data Can Solve Enterprise Problems

Many professionals in the digital world have becom... Read More

Analysis Of Big Data Using Spark And Scala

The use of Big Data over a network cluster has become a major application in multiple industries. The wide use of MapReduce and Hadoop technologies is proof of this evolving technology, along with the recent rise of Apache Spark, a data processing engine written in Scala programming language. Introduction to Scala Scala is a general purpose object-oriented programming language, similar to Java programming. Scala is an acronym for “Scalable language” meaning its capabilities can grow along the lines of your requirements & also there are more technologies built on scala. The capabilities of Scala programming can range from a simple scripting language to the preferred language for mission-critical applications. Scala has the following capabilities: Support for functional programming, with features including currying, type interference, immutability, lazy evaluation, and pattern matching. An advanced type system including algebraic data types and anonymous types. Features that are not available in Java, like operator overloading, named parameters, raw strings, and no checked exceptions. Scala can run seamlessly on a Java Virtual Machine (JVM), and Scala and Java classes can be freely interchanged or can refer to each other. Scala also supports cluster computing, with the most popular framework solution, Spark, which was written using Scala. Introduction to Apache Spark Apache Spark is an open-source Big Data processing framework that provides an interface for programming data clusters using data parallelism and fault tolerance. Apache Spark is widely used for fast processing of large datasets. Apache Spark is an open-source platform, built by a wide set of software developers from over 200 companies. Since 2009, more than 1000 developers have contributed to Apache Spark. Apache Spark provides better capabilities for Big Data applications, as compared to other Big Data technologies such as Hadoop or MapReduce. Listed below are some features of Apache Spark: 1. Comprehensive framework Spark provides a comprehensive and unified framework to manage Big Data processing, and supports a diverse range of data sets including text data, graphical data, batch data, and real-time streaming data. 2. Speed Spark can run programs up to 100 times faster than Hadoop clusters in memory, and 10 times faster when running on disk. Spark has an advanced DAG (directed acrylic graph) execution engine that provides support for cyclic data flow and in-memory data sharing across DAGs to execute different jobs with the same data. 3. Easy to use With a built-in set of over 80 high-level operators, Spark allows programmers to write Java, Scala, or Python applications in quick time. 4. Enhanced support In addition to Map and Reduce operations, Spark provides support for SQL queries, streaming data, machine learning, and graphic data processing. 5. Can be run on any platform Apache Spark applications can be run on a standalone cluster mode or in the cloud. Spark provides access to diverse data structures including HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. Spark can be deployed as a standalone server or on a distributed framework such as Mesos or YARN. 6. Flexibility In addition to Scala programming language, programmers can use Java, Python, Clojure, and R to build applications using Spark. Comprehensive library support As a Spark programmer, you can combine additional libraries within the same application, and provide Big Data analytical and Machine learning capabilities. The supported libraries include: Spark Streaming, used for processing of real-time streaming data. Spark SQL, used for exposing Spark datasets over JDBC APIs and for executing SQL-like queries on Spark datasets. Spark MLib, which is the machine learning library, consisting of common algorithms and utilities. Spark GraphX, which is the Spark API for graphs and graphical computation . BlinkDB, a query engine library used for running interactive SQL queries on large data volumes. Tachyon, which is a memory-centric distributed file system to enable file sharing across cluster frameworks. Spark Cassandra Connector and Spark R, which are integration adapters. With Cassandra Connector, Spark can access data from the Cassandra database and perform data analytics. Compatibility with Hadoop and MapReduce Apache Spark can be much faster as compared to other Big Data technologies. Apache Spark can run on an existing Hadoop Distributed File System (HDFS) to provide compatibility along with enhanced functionality. It is easy to deploy Spark applications on existing Hadoop v1 and v2 cluster. Spark uses the HDFS for data storage, and can work with Hadoop-compatible data sources including HBase and Cassandra. Apache Spark is compatible with MapReduce and enhances its capabilities with features such as in-memory data storage and real-time processing. Conclusion The standard API set of Apache Spark framework makes it the right choice for Big Data processing and data analytics. For client installation setups of MapReduce implementation with Hadoop, Spark and MapReduce can be used together for better results. Apache Spark is the right alternative to MapReduce for installations that involve large amounts of data that require low latency processing
26756
Analysis Of Big Data Using Spark And Scala

The use of Big Data over a network cluster has bec... Read More

Useful links