Apache Spark was developed by a team at UC Berkeley in 2009. Since then, Apache Spark has seen a very high adoption rate from top-notch technology companies like Google, Facebook, Apple, Netflix etc. The demand has been ever increasing day by day. According to marketanalysis.com survey, the Apache Spark market worldwide will grow at a CAGR of 67% between 2019 and 2022. The Spark market revenue is zooming fast and may grow up $4.2 billion by 2022, with a cumulative market valued at $9.2 billion (2019 - 2022).
As per Apache, “Apache Spark is a unified analytics engine for large-scale data processing”.
Spark is a cluster computing framework, somewhat similar to MapReduce but has a lot more capabilities, features, speed and provides APIs for developers in many languages like Scala, Python, Java and R. It is also friendly for database developers as it provides Spark SQL which supports most of the ANSI SQL functionality. Spark also has out of the box support for Machine learning and Graph processing using components called MLlib and GraphX respectively. Spark also has support for streaming data using Spark Streaming.
Spark is developed in Scala programming language. Though the majority of use cases of Spark uses HDFS as the underlying data file storage layer, it is not mandatory to use HDFS. It does work with a variety of other Data sources like Cassandra, MySQL, AWS S3 etc. Apache Spark also comes with its default resource manager which might be good enough for the development environment and small size cluster, but it also integrates very well with YARN and Mesos. Most of the production-grade and large clusters use YARN and Mesos as the resource manager.
Features of Spark
- Speed: According to Apache, Spark can run applications on Hadoop cluster up to 100 times faster in memory and up to 10 times faster on disk. Spark is able to achieve such a speed by overcoming the drawback of MapReduce which always writes to disk for all intermediate results. Spark does not need to write intermediate results to disk and can work in memory using DAG, lazy evaluation, RDDs and caching. Spark has a highly optimized execution engine which makes it so fast.
- Fault Tolerance: Spark’s optimized execution engine not only makes it fast but is also fault tolerant. It achieves this using abstraction layer called RDD (Resilient Distributed Datasets) in combination with DAG, which is built to handle failures of tasks or even node failures.
- Lazy Evaluation: Spark works on lazy evaluation technique. This means that the processing(transformations) on Spark RDD/Datasets are evaluated in a lazy manner, i.e. the output RDDs/datasets are not available after transformation will be available only when needed i.e. when any action is performed. The transformations are just part of the DAG which gets executed when action is called.
- Multiple Language Support: Spark provides support for multiple programming languages like Scala, Java, Python, R and also Spark SQL which is very similar to SQL.
- Reusability: Spark code once written for batch processing jobs can also be utilized for writing processed on Stream processing and it can be used to join historical batch data and stream data on the fly.
- Machine Learning: MLlib is a Machine Learning library of Spark. which is available out of the box for creating ML pipelines for data analysis and predictive analytics also
- Graph Processing: Apache Spark also has Graph processing logic. Using GraphX APIs which is again provided out of the box one can write graph processing and do graph-parallel computation.
- Stream Processing and Structured Streaming: Spark can be used for batch processing and also has the capability to cater to stream processing use case with micro batches. Spark Streaming comes with Spark and one does not need to use any other streaming tools or APIs. Spark streaming also supports Structure Streaming. Spark streaming also has in-built connectors for Apache Kafka which comes very handy while developing Streaming applications.
- Spark SQL: Spark has an amazing SQL support and has an in-built SQL optimizer. Spark SQL features are used heavily in warehouses to build ETL pipelines.
Spark is being used in more than 1000 organizations who have built huge clusters for batch processing, stream processing, building warehouses, building data analytics engine and also predictive analytics platforms using many of the above features of Spark. Let’s look at some of the use cases in a few of these organizations.
What are the Different Apache Spark Applications?
Streaming is basically unstructured data produced by different types of data sources. The data sources could be anything like log files generated while customers using mobile apps or web applications, social media contents like tweets, facebook posts, telemetry from connected devices or instrumentation in data centres. The streaming data is usually unbounded and is being processed as received from the data source.
Then there is Structured streaming which works on the principle of polling data in intervals and then this interval data is processed and appended or updated to the unbounded result table.
Apache Spark has a framework for both i.e. Spark Streaming to handle Streaming using micro batches and DStreams and Structured Streaming using Datasets and Data frames.
Let us try to understand Spark Streaming from an example.
Suppose a big retail chain company wants to get a real-time dashboard to keep a close eye on its inventory and operations. Using this dashboard the management should be able to track how many products are being purchased, shipped and delivered to customers.
Spark Streaming can be an ideal fit here.
The order management system pushes the order status to the queue(could be Kafka) from where Streaming process reads every minute and picks all the orders with their status. Then Spark engine processes these and emits the output status count. Spark streaming process runs like a daemon until it is killed or error is encountered.
As defined by Arthur Samuel in 1959, “Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed”. In 1997, Tom Mitchell gave a definition which is more specifically from an engineering perspective, “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”. ML solves complex problems that could not be solved with just mathematical numerical methods or means. ML is not supposed to make perfect guesses. In ML’s domain, there is no such thing. Its goal is to make a prediction or make guesses which are good enough to be useful.
MLlib is the Apache Spark’s scalable machine learning library. MLlib has multiple algorithms for Supervised and Unsupervised ML which can scale out on a cluster for classification, regression, clustering, collaborative filtering. MLlib interoperates with Python’s math/numerical analysis library NumPy and also with R’s libraries. Some of these algorithms are also applicable to streaming data. MLlib helps Spark provide sentiment analysis, customer segmentation and predictive intelligence.
A very common use case of ML is text classification, say for categorising emails. An ML pipeline can be trained to classify emails by reading an Inbox. A typical ML pipeline looks like this. ML is a subject in itself so it is not possible to deep dive here.
Fog Computing is another use case of Apache Spark. To understand Fog computing we need to understand IoT first. IoT basically connects all our devices so that they can communicate with each other and provide solutions to the users of those devices. This would mean huge amounts of data and current cloud computing may not be sufficient to cater to so much data transfer, data processing and online demand of customer’s request.
Fog computing can be ideal here as it takes the work of processing to the devices on the edge of the network. This would need very low latency, parallel processing of ML and complex graph analytical algorithms, all of which are readily available in Apache spark out of the box and can be pick and choose as per the requirements of the processing. So it is expected that as IoT gains momentum Apache spark will be the leader in Fog computing.
- Event Detection:Apache Spark is increasingly used in event detection like credit card fraud detection, money laundering activities etc. Apache spark streaming along with MLlib and Apache Kafka forms the backbone of a fraud financial transaction detection.
Credit card transactions of a cardholder can be captured over a period of time to categorize user’s spending habits. Models can be developed and trained to predict any anomaly in the card transaction and along with Spark streaming and Kafka in real time.
- Interactive Analysis:Spark’s one of the most popular features is its ability to provide users with interactive analytics. MapReduce does provide tools like Pig and Hive for interactive analysis, but they are too slow in most of the cases. But Spark is very fast and swift and that’s why it has gained so much ground in the interactive analysis.
Spark interfaces with programming languages like R, Python, SQL and Scala which caters to a bigger set of developers and users for interactive analysis.Spark also came up with Structured Streaming in version 2.0 which can be used for interactive analysis with live data as well as join the live data with batch data output to get more insight into the data. Structured streaming in future has the potential to boost Web Analytics by allowing users to query user’s live web session. Even machine learning can be applied to live session data for more insights.
- Data Warehousing: Data warehousing is another function where Apache Spark has is getting tremendous traction. Due to an increasing volume of data day by day, the tradition ETL tools like Informatica along with RDBMS are not able to meet the SLAs as they are not able to scale horizontally. Spark along with Spark SQL is being used by many companies to migrate to Big Data based Warehouse which can scale horizontally as the load increases.
With Spark, even the processing can be scaled horizontally by adding machines to the Spark engine cluster.These migrated applications embed the Spark engine and offer a web UI to allow users to create, run, test and deploy jobs interactively. Jobs are primarily written in native Spark SQL or other flavours of SQL. These Spark clusters have been able to scale to process many terabytes of data every day and the clusters can be hundreds to thousands of nodes.
Companies Using Apache Spark
Apache Spark at Alibaba:
Alibaba is the world’s one of the biggest e-commerce players. Alibaba’s online shopping platform generates Petabytes of data as it has millions of users every day doing searches, shopping and placing orders. These user interactions are represented as complex graphs. The processing of these data points is done using Spark’s Machine learning component MLlib and then used to provide better user shopping experience by suggesting products based on choice, trending products, reviews etc.
Apache Spark at MyFitnessPal:
MyFitnessPal is one of the largest health and fitness lifestyle portals. It has over 80 million active users. The portal helps its users follow and achieve a healthy lifestyle by following a proper diet and fitness regime. The portal uses the data added by users about their food, exercise and lifestyles to identify the best quality food and effective exercise. Using Spark the portal is able to scan through the huge amount of structured and unstructured data and pull out best suggestions for its users.
Apache Spark at TripAdvisor:
TripAdvisor has a huge user base and generates a mammoth amount of data every day. It is one of the biggest names in the Travel and Tourism industry. It helps users plan their personal and official trips around the world. It uses Apache Spark to process petabytes of data from user interactions and destination details and gives recommendations on planning a perfect trip based on users choice and preferences. They help users identify best airlines, best prices on hotels and airlines, best places to eat, basically everything needed to plan any trip. It also ranks these places, hotels, airlines, restaurants based on user feedback and reviews. All this processing is done using Apache Spark
Apache Spark at Yahoo:
Yahoo is known to have one of the biggest Hadoop Cluster and everyone is aware of Yahoo’s contribution to the development of Big Data system. Yahoo is also heavily using Apache Spark Machine learning capabilities to identify topics and news which users are interested in. This is similar to trending tweets or hashtags on Twitter or Facebook. Earlier these Machine Learning algo were developed in C/C++ with thousands of lines of code. While today with Spark and Scala/Pythons these algorithms can be implemented in few hundreds of lines of code. This is a big leap in turnover time as well as code understanding and maintenance. This has been made possible due to Spark to a great extent.
Apache Spark Use cases
Spark is used in Finance industry across different functional and technology domains.
A typical use case is building a Data Warehouse for batch processing and daily reporting. The Spark data frames abstraction has been used as a generic ingestion platform capable of ingesting data from multiple sources of different formats.
Financial services companies also use Apache Spark MLlib to create and train models for fraud detection. Some of the banks have started using Spark as a tool for classifying text in money transfers.
Some of the companies use Apache spark as log collection, an analysis engine and detection engine.
Let’s look at Spain's 2nd biggest bank BBVA use case where every money transfer a customer makes goes through an engine that infers a category from its textual description. This engine has been developed in Spark, mixes MLLib and own implementations, and is currently into production serving more than 5M customers daily.
The challenges that the BBVA technology team faced while building this ML were many:
- They did not know the data source in advance
- They did not have a labelled set
- A fraction of texts is useless (detection rather than classification)
- Distribution of categories is imbalanced
- Prefer false negatives over false positives
- Very short text, language not even syntactically correct
The engineers solved these problems using the Spark MLlib pipeline using some other NLP tools like word2vec.
- TF-IDF features + linear classifier (98% precision, 21% recall
- Further tests with word2vec + Vector of Locally Aggregated Descriptors (VLAD)
- Implemented in Spark/Scala, using MLlib classes
- Own classes implemented for Multi-class Logistic Regression, VLAD
- Scala dependency injection useful to quickly setup variants of the above steps
Healthcare industry is the newest in adopting advanced technologies like big data and machine learning to provide hi-tech facilities to their patients. Apache Spark is penetrating fast and is becoming the heartbeat in the latest Healthcare applications. Hospitals use these Spark enabled healthcare applications to analyze patients medical history to identify possible health issues based on history and learning.
Also, healthcare produces massive amounts of data and to process so much of the data in quick time and provide insights based on that itself was a challenge which Spark solves with ease.
Another very interesting problem in hospitals is when working with Operating Room(OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures.
Let’s see a use case. For a basic surgical procedure, it costs around $15-20 per minute. So, OR is a scarce and valuable resource and it needs to be utilized carefully and optimally. OR efficiency differs depending on the OR staffing and allocation, not the workload. So the loss of efficiency means a loss for the patient. So time and management are the utmost importance here.
Spark and MLlib solve the problem by developing a predictive model that would identify available OR time 2 weeks in advance, allows hospitals to confirm waitlist cases two weeks in advance instead of when blocks normally release 4 days out. This OR Scheduling can be done by getting the historical data and running then linear regression model with multiple variables.
This model works because:
- Can coordinate waitlist scheduling logistics with physicians and patients within 2 weeks of surgery.
- Plan staff scheduling and resources so there are less last-minutes staffing issues for nursing and anaesthesia
- Utilization metrics show where elective surgical schedule and level demand can be maximized.
Big retail chains have this usual problem of optimising their supply chain to minimize cost and wastage, improve customer service and gain insights into customer’s shopping behaviour to serve better and in the process optimize their profit.
To achieve these goals these retail companies have a lot of challenges like to keep the inventory up to date based on sales and also to predict sales and inventory during some promotional events and sale seasons. Also, they need to keep a track on customer’s orders transit and delivery. All these pose huge technical challenges. Apache Spark and MLlib is being used by a lot of these companies to capture real-time sales and invoice data, ingest it and then figure out the inventory. The technology can also be used to identify in real-time the order’s transit and delivery status. Spark MLlib analytics and predictive models are being used to predict sales during promotions and sale seasons to match the inventory and be ready for the event. The historical data on customer’s buying behaviour is also used to provide the customer with personalized suggestions and improve customer satisfaction. A lot of stores have started using sensors to get data on customer’s location within the store, their preferences, shopping behaviour, etc to provide on-the-spot suggestions and help to find, buy a product by sending messages, using displays etc.
Airline customer segmentation is a challenging field to understand due to customer’s complex behaviour. Amadeus is one of the main IT solution providers in the airline industry. It has the resources and infrastructure to manage all the ticketing and booking data as well as understanding the Airline needs and market particularities. By combining different data sources produced by different airline systems, they have applied unsupervised machine learning techniques to improve our understanding of customer behaviour.
Challenges in the airline industry are to understand the health of the business:
- Are any segments growing or shrinking
- How is the yield developing
- Tune marketing to specific interests within segments
- Optimize product offers using fare structures and media offers
Traditional approaches for segmentation were based on business intuition and manually crafter rules set. But these approaches have limitations and prejudices which can sometimes be negative for the business. On the contrary, the data-driven approach is resilient against turn-over, prejudices and market change.
With a data-driven approach and using Spark and MLlib, the model is able to extract actionable insights on typical customer behaviour and intentions. Supervised and supervised learning using Spark MLlib techniques at scale are used to train models for prediction. These are then used to assist the customer in deploying the newfound insights into day-to-day operations.
Media companies Netflix, Hotstar etc are using Apache Spark at the heart of their technology engine to drive their business. When a user turns on Netflix, he is able to see his favourite content playing automatically. This is achieved through recommendation engines built on Machine learning algorithms and Spark MLlib. Netflix uses historical data from users content selection, trains its ML algorithms, tests it offline and then deploys it live and checks if it works in Production as well.
Netflix has built an engine something called Time Travel using Apache Spark and other big data technologies to: Snapshot online services and use the snapshot data offline to generate features and share facts and features between experiments without calling live systems.
If someone is interested in exploring the details of the use case, one can look at the below link:
Apache Spark is spreading its roots everywhere. A common man not related to software industry may not realise it but there are applications running or extracting data from his home environment and processed in Spark to make his life better and easier. An example we will discuss below is the British Gas.
British Gas is a 200-year-old company. Connected Homes is BG’s IoT “startup”. It is a leader in the UK’s connected home market. Connected Homes is trying to predict the usage consumption patterns of the electricity, gas at the homes and provide consumers with insights so they can smartly use their devices and reduce energy consumption and save energy and money. Connected homes use Apache Spark at the core of its Data Engineering and ML engine.
The challenges are there are millions of electric and gas meters and the meters are read every 30 minutes.
- Gas and electricity meter readings
- Thermostat temperature data
- Connected boiler data
- Real-time energy consumption data
- Introducing motion sensors, window and door sensors, etc
Apache Spark MLlib is used to apply machine learning to these data for disaggregation, similar home comparison and smart meters used in indirect algorithms for non-smart customers.
The analytics engine is used to show customers how they have spent energy, what are their top 3 spends, how can they reduce their energy consumption by showing patterns from smart consumers and smart meters etc. This gives customers a lot of insight and educates customers on optimally using energy at their homes.
Online Gaming industry is another beneficiary of the Apache Spark technology.
Riot Games uses Spark for Combating abusive language in chat in the team games. The challenges in online gaming are:
- 1% of all players are consistently unsportsmanlike
- 2% of all games infected by serious toxicity
- In-Game Toxicity 95% of all serious toxicity comes from players who are otherwise sportsmanlike
To solve this the game developers tried to predict the words used by the gamers in the context of the game or the scenario. They used the “Word2Vec” a neural model which has 256 dimensions embedding months of chat logs. Each word in the chat is document split in spaces and lowercase. The model was trained on NLP for acronyms, short forms, colloquial words etc and the deviations could be huge. The team built a model trained to predict bad/toxic language. The gaming company has 100+ million users every month and so the data is huge. They used Spark MLlib to train their models using different algorithms, one of them Logistic Regression Random Forest Gradient Boosted Trees. The results were impressive for them as they tuned their models for better precision.
Benefits of having Apache Spark for Individual Companies
Many of the companies across industries have been benefiting from Apache Spark. The reasons could be different:
- Speed of execution
- Multi-language support
- Machine learning library
- Graph processing library
- Batch processing as well as Stream & Structured stream processing
Apache Spark is beneficial for small as well as large enterprise. Spark offers a complete solution to many of the common problems like ETL and warehousing, Stream data processing, common use case of supervised and unsupervised learning for data analytics and predictive modelling. So with Apache Spark, the technology team does not require to look out for different technology stack and multiple vendors for a solution. This reduces the learning curve for additional development and maintenance. Also, since Spark has support for multiple languages Scala, Java, Python & R, it is easy to find developers.
Though there are so many benefits of Apache Spark as we have seen above, there are few limitations which Apache Spark has. We should be aware of these limitations before we decide to adopt any technology.
- Apache Spark does not come with an inbuilt file system and it has to depend on HDFS in most of the use cases. If not, it has to be used with some cloud-based data platform.
- Even though Spark has Stream processing feature it is not exactly real-time processing. It processes in batches which are called micro-batches.
- Apache Spark is expensive as it catches a lot of data and memory is not cheap.
- Spark faces issues while working with HDFS which has a very large number of small files.
- Though Spark MLlib provides machine learning capabilities, it does not come with a very exhaustive list of algorithms. It can solve a lot of ML problems but not all.
- Apache Spark does not have automatic code optimization process in place and so the code needs to be optimized manually.
Reasons Why You Should Learn Apache Spark
We have seen the wide impact and use cases if Apache Spark. So we know that Spark has become a buzzword these days. We should now also understand why we should learn Spark.
- Spark offers a complete package for developers and can act as a unified analytics engine. Hence it increases the productivity for the developers. So it ROI for any firm is high and most of the companies dependent on technology are aware of the fact and also willing to put their money in Spark.
- Learning Spark can help explore the world of Big data and data science. Both these technology fields are the future and bringing transformational changes in almost all industries. So getting exposed to Spark is becoming a necessity for all firms.
- With fast-paced Spark adoption by organizations, it is opening up new prospects in business and many of the applications have proved that instead of business driving technology it is becoming vice versa now, that technology is driving business.
- According to a survey, there is a huge demand for Spark engineers. Today, there are well over 1,000 contributors to the Apache Spark project across 250+ companies worldwide. Recently, Indeed.com listed over 2,400 full-time open positions for Apache Spark professionals across various industries including enterprise technology, e-commerce/retail, healthcare, and life sciences, oil and gas, manufacturing, and more.
- Apache Spark developers earn the highest average salary among all other programmers. So this is another and one of the major incentives one can get to learn and expertise Spark.
One can look at an old survey by Databricks to understand the importance and impact of Apache Spark by 2016. By 2019 these number would have grown much bigger.
Apache Spark has capabilities to process huge amount of data in a very efficient manner with high throughput. It can solve problems related to batch processing, near real-time processing, can be used to apply lambda architecture, can be used for Structured streaming. Also, it can solve many of the complex data analytics and predictive analytics problems with the help of the MLlib component which comes out of the box. Apache Spark has been making a big impact on the whole data engineering and data science gamut at scale.