top

Search

Apache Spark Tutorial

IntroductionIn this section we will look at why should one use Apache Spark over other competing technologies, be it proprietary or open source technology. We will also look at some of the use cases and the industries where it is being used.Apache Spark enables users to write their applications in Scala, Python, Java and R programming languages. This gives developers the flexibility to use the programming languages of their choice in writing their big data applications or data science applications.   Scala: Apache Spark is written in Scala. So Spark applications written in Scala get the benefit of better performance than written in Python. Also, all the new APIs are first available in Scala and then in Java, Python and R languages. Scala has another advantage over Java in lines of code. Scala being functional language needs a lot less lines of code than required in Java to implement the same functionality. It is statically typed, which can be a good thing or a bad thing depending on the use case. But Scala has the disadvantage of having a steep learning curve as there are not many developers available in Scala compared to Java and Python.  Java: Java, on the other hand, has the advantage of having vast community support over other languages like Scala, Python and R, even though Python is catching up fast. With Java 8 the differences between Scala and Java in the functional aspect has got narrowed down and Java has become similar to Scala. Python: Python has the advantage over Scala and Java in Data Science and machine learning applications due to its easier syntax and readability. Python has evolved a lot over the last 10 years and a lot of libraries have been written and are available as open-source. But Python’s Hadoop integration is not that great.  R: R is also a very good choice for Data Science and analytics due to its simplicity.  Apache Spark use casesApache Spark is currently being used in almost all industries like: Finance Healthcare Retail Travel Media Energy Gaming  In the Finance industry, a typical use case of Apache Spark is building a Data Warehouse for batch processing and daily reporting. Financial services companies use Apache Spark MLlib to build and train machine learning models for transaction analysis and fraud detection. Some of the banks also use Spark for classifying texts in money transfers and transactions.  The healthcare industry has recently started adopting Apache Spark and its machine learning library to provide hi-tech facilities to their patients. Apache Spark is penetrating fast and is becoming the heartbeat in the latest Healthcare applications. Hospitals use Spark enabled healthcare applications and tools to analyze a patient’s medical history and identify possible health issues based on history and learning. Healthcare produces massive amounts of data and Apache Spark fits here again to quickly process the data and produce results and insights from the data. Hospitals have also started using Apache Spark for scheduling Operating Room(OR) within a hospital setting to optimize the usage and help patients with unnecessary hassles.  The retail industry uses Apache Spark and MLLib for inventory updation based on sales, sales prediction in promotional events and sale seasons. Historical data on customer sales and purchases are also used to predict and suggest customer behavior to help increase sales.  The travel and Tourism industry uses Apache Spark and MLLib for customer segmentation and the data-driven approach can be used to extract actionable insights on typical customer behavior and intentions. You can also learn more about Apache Spark use caseand their applications here.Difference between Spark and other similar frameworks (How Apache Spark is set apart from other frameworks)There are different technologies available today that can provide alternatives to Apache Spark like Hadoop MapReduce for Batch processing, Apache Storm, Samza and Flink for Stream processing, etc. But each of these technologies has some drawbacks which set Apache Spark apart from the competitors. Hadoop MapReduce is a very well established technology for batch processing and widely used in the industry. But Apache Spark beats MapReduce in batch processing speeds by a huge margin of the order of 10 to 100X. Also, its capability for Stream processing makes it the first choice over MapReduce.Apache Storm, Flink, and Samza provide stream processing functionality similar to Apache Spark’s Streaming. But each of these has some limitations like Samza only supports Java and is tightly coupled with Apache Kafka. Flink is very new and has not been widely used in Production grade deployments. Storm offers better Streaming capabilities than Apache Spark but it does not guarantee message ordering and it needs an external software to take care of this. The storm also does not offer batch processing capabilities. Keeping all these in perspective Apache Spark is the best choice as it covers so many capabilities under one umbrella like batch processing at very high speed, stream processing capabilities, machine learning, graphx. Apache Spark’s multi-language support is another big advantage it offers over other technologies.ConclusionWe have seen above how Apache Spark is used across multiple industries and how it solves different problems. We also saw how it is different from other technologies like Apache Flink and Storm.
logo

Apache Spark Tutorial

Why Apache Spark

Introduction

In this section we will look at why should one use Apache Spark over other competing technologies, be it proprietary or open source technology. We will also look at some of the use cases and the industries where it is being used.

Apache Spark enables users to write their applications in Scala, Python, Java and R programming languages. This gives developers the flexibility to use the programming languages of their choice in writing their big data applications or data science applications.   

  • Scala: Apache Spark is written in Scala. So Spark applications written in Scala get the benefit of better performance than written in Python. Also, all the new APIs are first available in Scala and then in Java, Python and R languages. Scala has another advantage over Java in lines of code. Scala being functional language needs a lot less lines of code than required in Java to implement the same functionality. It is statically typed, which can be a good thing or a bad thing depending on the use case. But Scala has the disadvantage of having a steep learning curve as there are not many developers available in Scala compared to Java and Python.  
  • Java: Java, on the other hand, has the advantage of having vast community support over other languages like Scala, Python and R, even though Python is catching up fast. With Java 8 the differences between Scala and Java in the functional aspect has got narrowed down and Java has become similar to Scala. 
  • Python: Python has the advantage over Scala and Java in Data Science and machine learning applications due to its easier syntax and readability. Python has evolved a lot over the last 10 years and a lot of libraries have been written and are available as open-source. But Python’s Hadoop integration is not that great.  
  • R: R is also a very good choice for Data Science and analytics due to its simplicity.  

Apache Spark use cases

Apache Spark use cases

Apache Spark is currently being used in almost all industries like: 

  • Finance 
  • Healthcare 
  • Retail 
  • Travel 
  • Media 
  • Energy 
  • Gaming  

In the Finance industry, a typical use case of Apache Spark is building a Data Warehouse for batch processing and daily reporting. Financial services companies use Apache Spark MLlib to build and train machine learning models for transaction analysis and fraud detection. Some of the banks also use Spark for classifying texts in money transfers and transactions.  

The healthcare industry has recently started adopting Apache Spark and its machine learning library to provide hi-tech facilities to their patients. Apache Spark is penetrating fast and is becoming the heartbeat in the latest Healthcare applications. Hospitals use Spark enabled healthcare applications and tools to analyze a patient’s medical history and identify possible health issues based on history and learning. Healthcare produces massive amounts of data and Apache Spark fits here again to quickly process the data and produce results and insights from the data. Hospitals have also started using Apache Spark for scheduling Operating Room(OR) within a hospital setting to optimize the usage and help patients with unnecessary hassles.  

The retail industry uses Apache Spark and MLLib for inventory updation based on sales, sales prediction in promotional events and sale seasons. Historical data on customer sales and purchases are also used to predict and suggest customer behavior to help increase sales.  

The travel and Tourism industry uses Apache Spark and MLLib for customer segmentation and the data-driven approach can be used to extract actionable insights on typical customer behavior and intentions. 

You can also learn more about Apache Spark use caseand their applications here.

Difference between Spark and other similar frameworks (How Apache Spark is set apart from other frameworks)

There are different technologies available today that can provide alternatives to Apache Spark like Hadoop MapReduce for Batch processing, Apache Storm, Samza and Flink for Stream processing, etc. But each of these technologies has some drawbacks which set Apache Spark apart from the competitors. Hadoop MapReduce is a very well established technology for batch processing and widely used in the industry. But Apache Spark beats MapReduce in batch processing speeds by a huge margin of the order of 10 to 100X. Also, its capability for Stream processing makes it the first choice over MapReduce.

Apache Storm, Flink, and Samza provide stream processing functionality similar to Apache Spark’s Streaming. But each of these has some limitations like Samza only supports Java and is tightly coupled with Apache Kafka. Flink is very new and has not been widely used in Production grade deployments. Storm offers better Streaming capabilities than Apache Spark but it does not guarantee message ordering and it needs an external software to take care of this. The storm also does not offer batch processing capabilities. Keeping all these in perspective Apache Spark is the best choice as it covers so many capabilities under one umbrella like batch processing at very high speed, stream processing capabilities, machine learning, graphx. Apache Spark’s multi-language support is another big advantage it offers over other technologies.

Conclusion

We have seen above how Apache Spark is used across multiple industries and how it solves different problems. We also saw how it is different from other technologies like Apache Flink and Storm.

Leave a Reply

Your email address will not be published. Required fields are marked *