Search

Apache Spark Use Cases & Applications

Apache Spark was developed by a team at UC Berkeley in 2009. Since then, Apache Spark has seen a very high adoption rate from top-notch technology companies like Google, Facebook, Apple, Netflix etc. The demand has been ever increasing day by day. According to marketanalysis.com survey, the Apache Spark market worldwide will grow at a CAGR of 67% between 2019 and 2022. The Spark market revenue is zooming fast and may grow up $4.2 billion by 2022, with a cumulative market valued at $9.2 billion (2019 - 2022).As per Apache, “Apache Spark is a unified analytics engine for large-scale data processing”.Spark is a cluster computing framework, somewhat similar to MapReduce but has a lot more capabilities, features, speed and provides APIs for developers in many languages like Scala, Python, Java and R. It is also friendly for database developers as it provides Spark SQL which supports most of the ANSI SQL functionality. Spark also has out of the box support for Machine learning and Graph processing using components called MLlib and GraphX respectively. Spark also has support for streaming data using Spark Streaming.Spark is developed in Scala programming language. Though the majority of use cases of Spark uses HDFS as the underlying data file storage layer, it is not mandatory to use HDFS. It does work with a variety of other Data sources like Cassandra, MySQL, AWS S3 etc. Apache Spark also comes with its default resource manager which might be good enough for the development environment and small size cluster, but it also integrates very well with YARN and Mesos. Most of the production-grade and large clusters use YARN and Mesos as the resource manager.Features of SparkSpeed: According to Apache, Spark can run applications on Hadoop cluster up to 100 times faster in memory and up to 10 times faster on disk. Spark is able to achieve such a speed by overcoming the drawback of MapReduce which always writes to disk for all intermediate results. Spark does not need to write intermediate results to disk and can work in memory using DAG, lazy evaluation, RDDs and caching. Spark has a highly optimized execution engine which makes it so fast. Fault Tolerance: Spark’s optimized execution engine not only makes it fast but is also fault tolerant. It achieves this using abstraction layer called RDD (Resilient Distributed Datasets) in combination with DAG, which is built to handle failures of tasks or even node failures. Lazy Evaluation: Spark works on lazy evaluation technique. This means that the processing(transformations) on Spark RDD/Datasets are evaluated in a lazy manner, i.e. the output RDDs/datasets are not available after transformation will be available only when needed i.e. when any action is performed. The transformations are just part of the DAG which gets executed when action is called.Multiple Language Support: Spark provides support for multiple programming languages like Scala, Java, Python, R and also Spark SQL which is very similar to SQL.Reusability: Spark code once written for batch processing jobs can also be utilized for writing processed on Stream processing and it can be used to join historical batch data and stream data on the fly.Machine Learning: MLlib is a Machine Learning library of Spark. which is available out of the box for creating ML pipelines for data analysis and predictive analytics alsoGraph Processing: Apache Spark also has Graph processing logic. Using GraphX APIs which is again provided out of the box one can write graph processing and do graph-parallel computation.Stream Processing and Structured Streaming: Spark can be used for batch processing and also has the capability to cater to stream processing use case with micro batches. Spark Streaming comes with Spark and one does not need to use any other streaming tools or APIs. Spark streaming also supports Structure Streaming. Spark streaming also has in-built connectors for Apache Kafka which comes very handy while developing Streaming applications.Spark SQL: Spark has an amazing SQL support and has an in-built SQL optimizer. Spark SQL features are used heavily in warehouses to build ETL pipelines.Spark is being used in more than 1000 organizations who have built huge clusters for batch processing, stream processing, building warehouses, building data analytics engine and also predictive analytics platforms using many of the above features of Spark. Let’s look at some of the use cases in a few of these organizations.What are the different Apache Spark applications?Streaming Data: Streaming is basically unstructured data produced by different types of data sources. The data sources could be anything like log files generated while customers using mobile apps or web applications, social media contents like tweets, facebook posts, telemetry from connected devices or instrumentation in data centres. The streaming data is usually unbounded and is being processed as received from the data source.Then there is Structured streaming which works on the principle of polling data in intervals and then this interval data is processed and appended or updated to the unbounded result table.Apache Spark has a framework for both i.e. Spark Streaming to handle Streaming using micro batches and DStreams and Structured Streaming using Datasets and Data frames.Let us try to understand Spark Streaming from an example.Suppose a big retail chain company wants to get a real-time dashboard to keep a close eye on its inventory and operations. Using this dashboard the management should be able to track how many products are being purchased, shipped and delivered to customers.Spark Streaming can be an ideal fit here.The order management system pushes the order status to the queue(could be Kafka) from where Streaming process reads every minute and picks all the orders with their status. Then Spark engine processes these and emits the output status count. Spark streaming process runs like a daemon until it is killed or error is encountered.Machine learning:As defined by Arthur Samuel in 1959, “Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed”. In 1997, Tom Mitchell gave a definition which is more specifically from an engineering perspective, “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”. ML solves complex problems that could not be solved with just mathematical numerical methods or means. ML is not supposed to make perfect guesses. In ML’s domain, there is no such thing. Its goal is to make a prediction or make guesses which are good enough to be useful.MLlib is the Apache Spark’s scalable machine learning library. MLlib has multiple algorithms for Supervised and Unsupervised ML which can scale out on a cluster for classification, regression, clustering, collaborative filtering. MLlib interoperates with Python’s math/numerical analysis library NumPy and also with R’s libraries. Some of these algorithms are also applicable to streaming data. MLlib helps Spark provide sentiment analysis, customer segmentation and predictive intelligence.A very common use case of ML is text classification, say for categorising emails. An ML pipeline can be trained to classify emails by reading an Inbox. A typical ML pipeline looks like this. ML is a subject in itself so it is not possible to deep dive here.Fog computing: Fog Computing is another use case of Apache Spark. To understand Fog computing we need to understand IoT first. IoT basically connects all our devices so that they can communicate with each other and provide solutions to the users of those devices. This would mean huge amounts of data and current cloud computing may not be sufficient to cater to so much data transfer, data processing and online demand of customer’s request.Fog computing can be ideal here as it takes the work of processing to the devices on the edge of the network. This would need very low latency, parallel processing of ML and complex graph analytical algorithms, all of which are readily available in Apache spark out of the box and can be pick and choose as per the requirements of the processing. So it is expected that as IoT gains momentum Apache spark will be the leader in Fog computing.Event Detection:Apache Spark is increasingly used in event detection like credit card fraud detection, money laundering activities etc. Apache spark streaming along with MLlib and Apache Kafka forms the backbone of a fraud financial transaction detection.Credit card transactions of a cardholder can be captured over a period of time to categorize user’s spending habits. Models can be developed and trained to predict any anomaly in the card transaction and along with Spark streaming and Kafka in real time.Interactive Analysis:Spark’s one of the most popular features is its ability to provide users with interactive analytics. MapReduce does provide tools like Pig and Hive for interactive analysis, but they are too slow in most of the cases. But Spark is very fast and swift and that’s why it has gained so much ground in the interactive analysis.Spark interfaces with programming languages like R, Python, SQL and Scala which caters to a bigger set of developers and users for interactive analysis.Spark also came up with Structured Streaming in version 2.0 which can be used for interactive analysis with live data as well as join the live data with batch data output to get more insight into the data. Structured streaming in future has the potential to boost Web Analytics by allowing users to query user’s live web session. Even machine learning can be applied to live session data for more insights.Data Warehousing: Data warehousing is another function where Apache Spark has is getting tremendous traction. Due to an increasing volume of data day by day, the tradition ETL tools like Informatica along with RDBMS are not able to meet the SLAs as they are not able to scale horizontally. Spark along with Spark SQL is being used by many companies to migrate to Big Data based Warehouse which can scale horizontally as the load increases.With Spark, even the processing can be scaled horizontally by adding machines to the Spark engine cluster.These migrated applications embed the Spark engine and offer a web UI to allow users to create, run, test and deploy jobs interactively. Jobs are primarily written in native Spark SQL or other flavours of SQL. These Spark clusters have been able to scale to process many terabytes of data every day and the clusters can be hundreds to thousands of nodes.Companies using Apache SparkApache Spark at Alibaba:Alibaba is the world’s one of the biggest e-commerce players. Alibaba’s online shopping platform generates Petabytes of data as it has millions of users every day doing searches, shopping and placing orders. These user interactions are represented as complex graphs. The processing of these data points is done using Spark’s Machine learning component MLlib and then used to provide better user shopping experience by suggesting products based on choice, trending products, reviews etc.Apache Spark at MyFitnessPal:MyFitnessPal is one of the largest health and fitness lifestyle portals. It has over 80 million active users. The portal helps its users follow and achieve a healthy lifestyle by following a proper diet and fitness regime. The portal uses the data added by users about their food, exercise and lifestyles to identify the best quality food and effective exercise. Using Spark the portal is able to scan through the huge amount of structured and unstructured data and pull out best suggestions for its users.Apache Spark at TripAdvisor:TripAdvisor has a huge user base and generates a mammoth amount of data every day. It is one of the biggest names in the Travel and Tourism industry. It helps users plan their personal and official trips around the world. It uses Apache Spark to process petabytes of data from user interactions and destination details and gives recommendations on planning a perfect trip based on users choice and preferences. They help users identify best airlines, best prices on hotels and airlines, best places to eat, basically everything needed to plan any trip. It also ranks these places, hotels, airlines, restaurants based on user feedback and reviews. All this processing is done using Apache SparkApache Spark at Yahoo:Yahoo is known to have one of the biggest Hadoop Cluster and everyone is aware of Yahoo’s contribution to the development of Big Data system. Yahoo is also heavily using Apache Spark Machine learning capabilities to identify topics and news which users are interested in. This is similar to trending tweets or hashtags on Twitter or Facebook. Earlier these Machine Learning algo were developed in C/C++ with thousands of lines of code. While today with Spark and Scala/Pythons these algorithms can be implemented in few hundreds of lines of code. This is a big leap in turnover time as well as code understanding and maintenance. This has been made possible due to Spark to a great extent.Apache Spark Use casesFinance: Spark is used in Finance industry across different functional and technology domains.A typical use case is building a Data Warehouse for batch processing and daily reporting. The Spark data frames abstraction has been used as a generic ingestion platform capable of ingesting data from multiple sources of different formats.Financial services companies also use Apache Spark MLlib to create and train models for fraud detection. Some of the banks have started using Spark as a tool for classifying text in money transfers.Some of the companies use Apache spark as log collection, an analysis engine and detection engine.Let’s look at Spain's 2nd biggest bank BBVA use case where every money transfer a customer makes goes through an engine that infers a category from its textual description. This engine has been developed in Spark, mixes MLLib and own implementations, and is currently into production serving more than 5M customers daily.The challenges that the BBVA technology team faced while building this ML were many:They did not know the data source in advanceThey did not have a labelled setA fraction of texts is useless (detection rather than classification)Distribution of categories is imbalancedPrefer false negatives over false positivesVery short text, language not even syntactically correctThe engineers solved these problems using the Spark MLlib pipeline using some other NLP tools like word2vec.TF-IDF features + linear classifier (98% precision, 21% recallFurther tests with word2vec + Vector of Locally Aggregated Descriptors (VLAD)Implemented in Spark/Scala, using MLlib classesOwn classes implemented for Multi-class Logistic Regression, VLADScala dependency injection useful to quickly setup variants of the above stepsHealthCare:Healthcare industry is the newest in adopting advanced technologies like big data and machine learning to provide hi-tech facilities to their patients. Apache Spark is penetrating fast and is becoming the heartbeat in the latest Healthcare applications. Hospitals use these Spark enabled healthcare applications to analyze patients medical history to identify possible health issues based on history and learning.Also, healthcare produces massive amounts of data and to process so much of the data in quick time and provide insights based on that itself was a challenge which Spark solves with ease.Another very interesting problem in hospitals is when working with Operating Room(OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures.Let’s see a use case. For a basic surgical procedure, it costs around $15-20 per minute. So, OR is a scarce and valuable resource and it needs to be utilized carefully and optimally. OR efficiency differs depending on the OR staffing and allocation, not the workload. So the loss of efficiency means a loss for the patient. So time and management are the utmost importance here.Spark and MLlib solve the problem by developing a predictive model that would identify available OR time 2 weeks in advance, allows hospitals to confirm waitlist cases two weeks in advance instead of when blocks normally release 4 days out. This OR Scheduling can be done by getting the historical data and running then linear regression model with multiple variables.This model works because:Can coordinate waitlist scheduling logistics with physicians and patients within 2 weeks of surgery.Plan staff scheduling and resources so there are less last-minutes staffing issues for nursing and anaesthesiaUtilization metrics show where elective surgical schedule and level demand can be maximized.Retail: Big retail chains have this usual problem of optimising their supply chain to minimize cost and wastage, improve customer service and gain insights into customer’s shopping behaviour to serve better and in the process optimize their profit.To achieve these goals these retail companies have a lot of challenges like to keep the inventory up to date based on sales and also to predict sales and inventory during some promotional events and sale seasons. Also, they need to keep a track on customer’s orders transit and delivery. All these pose huge technical challenges. Apache Spark and MLlib is being used by a lot of these companies to capture real-time sales and invoice data, ingest it and then figure out the inventory. The technology can also be used to identify in real-time the order’s transit and delivery status. Spark MLlib analytics and predictive models are being used to predict sales during promotions and sale seasons to match the inventory and be ready for the event. The historical data on customer’s buying behaviour is also used to provide the customer with personalized suggestions and improve customer satisfaction. A lot of stores have started using sensors to get data on customer’s location within the store, their preferences, shopping behaviour, etc to provide on-the-spot suggestions and help to find, buy a product by sending messages, using displays etc.Travel: Airline customer segmentation is a challenging field to understand due to customer’s complex behaviour. Amadeus is one of the main IT solution providers in the airline industry. It has the resources and infrastructure to manage all the ticketing and booking data as well as understanding the Airline needs and market particularities. By combining different data sources produced by different airline systems, they have applied unsupervised machine learning techniques to improve our understanding of customer behaviour.Challenges in the airline industry are to understand the health of the business:Are any segments growing or shrinkingHow is the yield developingTune marketing to specific interests within segmentsOptimize product offers using fare structures and media offersTraditional approaches for segmentation were based on business intuition and manually crafter rules set. But these approaches have limitations and prejudices which can sometimes be negative for the business. On the contrary, the data-driven approach is resilient against turn-over, prejudices and market change.With a data-driven approach and using Spark and MLlib, the model is able to extract actionable insights on typical customer behaviour and intentions. Supervised and supervised learning using Spark MLlib techniques at scale are used to train models for prediction. These are then used to assist the customer in deploying the newfound insights into day-to-day operations.Media: Media companies Netflix, Hotstar etc are using Apache Spark at the heart of their technology engine to drive their business. When a user turns on Netflix, he is able to see his favourite content playing automatically. This is achieved through recommendation engines built on Machine learning algorithms and Spark MLlib. Netflix uses historical data from users content selection, trains its ML algorithms, tests it offline and then deploys it live and checks if it works in Production as well.Netflix has built an engine something called Time Travel using Apache Spark and other big data technologies to: Snapshot online services and use the snapshot data offline to generate features and share facts and features between experiments without calling live systems.If someone is interested in exploring the details of the use case, one can look at the below link:Energy: Apache Spark is spreading its roots everywhere. A common man not related to software industry may not realise it but there are applications running or extracting data from his home environment and processed in Spark to make his life better and easier. An example we will discuss below is the British Gas.British Gas is a 200-year-old company. Connected Homes is BG’s IoT “startup”. It is a leader in the UK’s connected home market. Connected Homes is trying to predict the usage consumption patterns of the electricity, gas at the homes and provide consumers with insights so they can smartly use their devices and reduce energy consumption and save energy and money. Connected homes use Apache Spark at the core of its Data Engineering and ML engine.The challenges are there are millions of electric and gas meters and the meters are read every 30 minutes.There are:Gas and electricity meter readingsThermostat temperature dataConnected boiler dataReal-time energy consumption dataIntroducing motion sensors, window and door sensors, etcApache Spark MLlib is used to apply machine learning to these data for disaggregation, similar home comparison and smart meters used in indirect algorithms for non-smart customers.The analytics engine is used to show customers how they have spent energy, what are their top 3 spends, how can they reduce their energy consumption by showing patterns from smart consumers and smart meters etc. This gives customers a lot of insight and educates customers on optimally using energy at their homes.Gaming:Online Gaming industry is another beneficiary of the Apache Spark technology.Riot Games uses Spark for Combating abusive language in chat in the team games. The challenges in online gaming are:1% of all players are consistently unsportsmanlike2% of all games infected by serious toxicityIn-Game Toxicity 95% of all serious toxicity comes from players who are otherwise sportsmanlikeTo solve this the game developers tried to predict the words used by the gamers in the context of the game or the scenario. They used the “Word2Vec” a neural model which has 256 dimensions embedding months of chat logs. Each word in the chat is document split in spaces and lowercase. The model was trained on NLP for acronyms, short forms, colloquial words etc and the deviations could be huge. The team built a model trained to predict bad/toxic language. The gaming company has 100+ million users every month and so the data is huge. They used Spark MLlib to train their models using different algorithms, one of them Logistic Regression Random Forest Gradient Boosted Trees. The results were impressive for them as they tuned their models for better precision.Benefits of having Apache Spark for Individual companiesMany of the companies across industries have been benefiting from Apache Spark. The reasons could be different:Speed of executionMulti-language supportMachine learning libraryGraph processing libraryBatch processing as well as Stream & Structured stream processingApache Spark is beneficial for small as well as large enterprise. Spark offers a complete solution to many of the common problems like ETL and warehousing, Stream data processing, common use case of supervised and unsupervised learning for data analytics and predictive modelling. So with Apache Spark, the technology team does not require to look out for different technology stack and multiple vendors for a solution. This reduces the learning curve for additional development and maintenance. Also, since Spark has support for multiple languages Scala, Java, Python & R, it is easy to find developers.Limitations:Though there are so many benefits of Apache Spark as we have seen above, there are few limitations which Apache Spark has. We should be aware of these limitations before we decide to adopt any technology.Apache Spark does not come with an inbuilt file system and it has to depend on HDFS in most of the use cases. If not, it has to be used with some cloud-based data platform.Even though Spark has Stream processing feature it is not exactly real-time processing. It processes in batches which are called micro-batches.Apache Spark is expensive as it catches a lot of data and memory is not cheap.Spark faces issues while working with HDFS which has a very large number of small files.Though Spark MLlib provides machine learning capabilities, it does not come with a very exhaustive list of algorithms. It can solve a lot of ML problems but not all.Apache Spark does not have automatic code optimization process in place and so the code needs to be optimized manually.Reasons why you should learn Apache SparkWe have seen the wide impact and use cases if Apache Spark. So we know that Spark has become a buzzword these days. We should now also understand why we should learn Spark.Spark offers a complete package for developers and can act as a unified analytics engine. Hence it increases the productivity for the developers. So it ROI for any firm is high and most of the companies dependent on technology are aware of the fact and also willing to put their money in Spark.Learning Spark can help explore the world of Big data and data science. Both these technology fields are the future and bringing transformational changes in almost all industries. So getting exposed to Spark is becoming a necessity for all firms.With fast-paced Spark adoption by organizations, it is opening up new prospects in business and many of the applications have proved that instead of business driving technology it is becoming vice versa now, that technology is driving business.According to a survey, there is a huge demand for Spark engineers. Today, there are well over 1,000 contributors to the Apache Spark project across 250+ companies worldwide. Recently, Indeed.com listed over 2,400 full-time open positions for Apache Spark professionals across various industries including enterprise technology, e-commerce/retail, healthcare, and life sciences, oil and gas, manufacturing, and more.Apache Spark developers earn the highest average salary among all other programmers. So this is another and one of the major incentives one can get to learn and expertise Spark.One can look at an old survey by Databricks to understand the importance and impact of Apache Spark by 2016. By 2019 these number would have grown much bigger.ConclusionApache Spark has capabilities to process huge amount of data in a very efficient manner with high throughput. It can solve problems related to batch processing, near real-time processing, can be used to apply lambda architecture, can be used for Structured streaming. Also, it can solve many of the complex data analytics and predictive analytics problems with the help of the MLlib component which comes out of the box. Apache Spark has been making a big impact on the whole data engineering and data science gamut at scale.

Apache Spark Use Cases & Applications

7K
  • by Nitin Kumar
  • 04th Jun, 2019
  • Last updated on 18th Jun, 2019
  • 8 mins read
Apache Spark Use Cases & Applications

Apache Spark was developed by a team at UC Berkeley in 2009. Since then, Apache Spark has seen a very high adoption rate from top-notch technology companies like Google, Facebook, Apple, Netflix etc. The demand has been ever increasing day by day. According to marketanalysis.com survey, the Apache Spark market worldwide will grow at a CAGR of 67% between 2019 and 2022. The Spark market revenue is zooming fast and may grow up $4.2 billion by 2022, with a cumulative market valued at $9.2 billion (2019 - 2022).

As per Apache, “Apache Spark is a unified analytics engine for large-scale data processing”.

Spark is a cluster computing framework, somewhat similar to MapReduce but has a lot more capabilities, features, speed and provides APIs for developers in many languages like Scala, Python, Java and R. It is also friendly for database developers as it provides Spark SQL which supports most of the ANSI SQL functionality. Spark also has out of the box support for Machine learning and Graph processing using components called MLlib and GraphX respectively. Spark also has support for streaming data using Spark Streaming.

Spark is developed in Scala programming language. Though the majority of use cases of Spark uses HDFS as the underlying data file storage layer, it is not mandatory to use HDFS. It does work with a variety of other Data sources like Cassandra, MySQL, AWS S3 etc. Apache Spark also comes with its default resource manager which might be good enough for the development environment and small size cluster, but it also integrates very well with YARN and Mesos. Most of the production-grade and large clusters use YARN and Mesos as the resource manager.

Features of SparkFeatures of Spark

  1. SpeedAccording to Apache, Spark can run applications on Hadoop cluster up to 100 times faster in memory and up to 10 times faster on disk. Spark is able to achieve such a speed by overcoming the drawback of MapReduce which always writes to disk for all intermediate results. Spark does not need to write intermediate results to disk and can work in memory using DAG, lazy evaluation, RDDs and caching. Spark has a highly optimized execution engine which makes it so fast.
  2.  Fault Tolerance: Spark’s optimized execution engine not only makes it fast but is also fault tolerant. It achieves this using abstraction layer called RDD (Resilient Distributed Datasets) in combination with DAG, which is built to handle failures of tasks or even node failures.
  3.  Lazy Evaluation: Spark works on lazy evaluation technique. This means that the processing(transformations) on Spark RDD/Datasets are evaluated in a lazy manner, i.e. the output RDDs/datasets are not available after transformation will be available only when needed i.e. when any action is performed. The transformations are just part of the DAG which gets executed when action is called.
  4. Multiple Language Support: Spark provides support for multiple programming languages like Scala, Java, Python, R and also Spark SQL which is very similar to SQL.
  5. Reusability: Spark code once written for batch processing jobs can also be utilized for writing processed on Stream processing and it can be used to join historical batch data and stream data on the fly.
  6. Machine Learning: MLlib is a Machine Learning library of Spark. which is available out of the box for creating ML pipelines for data analysis and predictive analytics also
  7. Graph Processing: Apache Spark also has Graph processing logic. Using GraphX APIs which is again provided out of the box one can write graph processing and do graph-parallel computation.
  8. Stream Processing and Structured Streaming: Spark can be used for batch processing and also has the capability to cater to stream processing use case with micro batches. Spark Streaming comes with Spark and one does not need to use any other streaming tools or APIs. Spark streaming also supports Structure Streaming. Spark streaming also has in-built connectors for Apache Kafka which comes very handy while developing Streaming applications.
  9. Spark SQL: Spark has an amazing SQL support and has an in-built SQL optimizer. Spark SQL features are used heavily in warehouses to build ETL pipelines.

Spark is being used in more than 1000 organizations who have built huge clusters for batch processing, stream processing, building warehouses, building data analytics engine and also predictive analytics platforms using many of the above features of Spark. Let’s look at some of the use cases in a few of these organizations.

What are the different Apache Spark applications?

Streaming Data: 

Streaming is basically unstructured data produced by different types of data sources. The data sources could be anything like log files generated while customers using mobile apps or web applications, social media contents like tweets, facebook posts, telemetry from connected devices or instrumentation in data centres. The streaming data is usually unbounded and is being processed as received from the data source.

Then there is Structured streaming which works on the principle of polling data in intervals and then this interval data is processed and appended or updated to the unbounded result table.

Apache Spark has a framework for both i.e. Spark Streaming to handle Streaming using micro batches and DStreams and Structured Streaming using Datasets and Data frames.

Let us try to understand Spark Streaming from an example.

Suppose a big retail chain company wants to get a real-time dashboard to keep a close eye on its inventory and operations. Using this dashboard the management should be able to track how many products are being purchased, shipped and delivered to customers.

Spark Streaming can be an ideal fit here.

Streaming Data for Apache

The order management system pushes the order status to the queue(could be Kafka) from where Streaming process reads every minute and picks all the orders with their status. Then Spark engine processes these and emits the output status count. Spark streaming process runs like a daemon until it is killed or error is encountered.

Machine learning:

As defined by Arthur Samuel in 1959, “Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed”. In 1997, Tom Mitchell gave a definition which is more specifically from an engineering perspective, “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”. ML solves complex problems that could not be solved with just mathematical numerical methods or means. ML is not supposed to make perfect guesses. In ML’s domain, there is no such thing. Its goal is to make a prediction or make guesses which are good enough to be useful.

MLlib is the Apache Spark’s scalable machine learning library. MLlib has multiple algorithms for Supervised and Unsupervised ML which can scale out on a cluster for classification, regression, clustering, collaborative filtering. MLlib interoperates with Python’s math/numerical analysis library NumPy and also with R’s libraries. Some of these algorithms are also applicable to streaming data. MLlib helps Spark provide sentiment analysis, customer segmentation and predictive intelligence.

A very common use case of ML is text classification, say for categorising emails. An ML pipeline can be trained to classify emails by reading an Inbox. A typical ML pipeline looks like this. ML is a subject in itself so it is not possible to deep dive here.

Machine learning application for Apache

Fog computing: 

Fog Computing is another use case of Apache Spark. To understand Fog computing we need to understand IoT first. IoT basically connects all our devices so that they can communicate with each other and provide solutions to the users of those devices. This would mean huge amounts of data and current cloud computing may not be sufficient to cater to so much data transfer, data processing and online demand of customer’s request.

Fog computing can be ideal here as it takes the work of processing to the devices on the edge of the network. This would need very low latency, parallel processing of ML and complex graph analytical algorithms, all of which are readily available in Apache spark out of the box and can be pick and choose as per the requirements of the processing. So it is expected that as IoT gains momentum Apache spark will be the leader in Fog computing.

  • Event Detection:Apache Spark is increasingly used in event detection like credit card fraud detection, money laundering activities etc. Apache spark streaming along with MLlib and Apache Kafka forms the backbone of a fraud financial transaction detection.
    Credit card transactions of a cardholder can be captured over a period of time to categorize user’s spending habits. Models can be developed and trained to predict any anomaly in the card transaction and along with Spark streaming and Kafka in real time.
  • Interactive Analysis:Spark’s one of the most popular features is its ability to provide users with interactive analytics. MapReduce does provide tools like Pig and Hive for interactive analysis, but they are too slow in most of the cases. But Spark is very fast and swift and that’s why it has gained so much ground in the interactive analysis.
    Spark interfaces with programming languages like R, Python, SQL and Scala which caters to a bigger set of developers and users for interactive analysis.Spark also came up with Structured Streaming in version 2.0 which can be used for interactive analysis with live data as well as join the live data with batch data output to get more insight into the data. Structured streaming in future has the potential to boost Web Analytics by allowing users to query user’s live web session. Even machine learning can be applied to live session data for more insights.
  • Data Warehousing: Data warehousing is another function where Apache Spark has is getting tremendous traction. Due to an increasing volume of data day by day, the tradition ETL tools like Informatica along with RDBMS are not able to meet the SLAs as they are not able to scale horizontally. Spark along with Spark SQL is being used by many companies to migrate to Big Data based Warehouse which can scale horizontally as the load increases.
    With Spark, even the processing can be scaled horizontally by adding machines to the Spark engine cluster.These migrated applications embed the Spark engine and offer a web UI to allow users to create, run, test and deploy jobs interactively. Jobs are primarily written in native Spark SQL or other flavours of SQL. These Spark clusters have been able to scale to process many terabytes of data every day and the clusters can be hundreds to thousands of nodes.

Companies using Apache Spark

Apache Spark at Alibaba:

Alibaba is the world’s one of the biggest e-commerce players. Alibaba’s online shopping platform generates Petabytes of data as it has millions of users every day doing searches, shopping and placing orders. These user interactions are represented as complex graphs. The processing of these data points is done using Spark’s Machine learning component MLlib and then used to provide better user shopping experience by suggesting products based on choice, trending products, reviews etc.

Apache Spark at MyFitnessPal:

MyFitnessPal is one of the largest health and fitness lifestyle portals. It has over 80 million active users. The portal helps its users follow and achieve a healthy lifestyle by following a proper diet and fitness regime. The portal uses the data added by users about their food, exercise and lifestyles to identify the best quality food and effective exercise. Using Spark the portal is able to scan through the huge amount of structured and unstructured data and pull out best suggestions for its users.

Apache Spark at TripAdvisor:

TripAdvisor has a huge user base and generates a mammoth amount of data every day. It is one of the biggest names in the Travel and Tourism industry. It helps users plan their personal and official trips around the world. It uses Apache Spark to process petabytes of data from user interactions and destination details and gives recommendations on planning a perfect trip based on users choice and preferences. They help users identify best airlines, best prices on hotels and airlines, best places to eat, basically everything needed to plan any trip. It also ranks these places, hotels, airlines, restaurants based on user feedback and reviews. All this processing is done using Apache Spark

Apache Spark at Yahoo:

Yahoo is known to have one of the biggest Hadoop Cluster and everyone is aware of Yahoo’s contribution to the development of Big Data system. Yahoo is also heavily using Apache Spark Machine learning capabilities to identify topics and news which users are interested in. This is similar to trending tweets or hashtags on Twitter or Facebook. Earlier these Machine Learning algo were developed in C/C++ with thousands of lines of code. While today with Spark and Scala/Pythons these algorithms can be implemented in few hundreds of lines of code. This is a big leap in turnover time as well as code understanding and maintenance. This has been made possible due to Spark to a great extent.

Apache Spark Use cases

Finance

Spark is used in Finance industry across different functional and technology domains.

A typical use case is building a Data Warehouse for batch processing and daily reporting. The Spark data frames abstraction has been used as a generic ingestion platform capable of ingesting data from multiple sources of different formats.

Financial services companies also use Apache Spark MLlib to create and train models for fraud detection. Some of the banks have started using Spark as a tool for classifying text in money transfers.

Some of the companies use Apache spark as log collection, an analysis engine and detection engine.

Let’s look at Spain's 2nd biggest bank BBVA use case where every money transfer a customer makes goes through an engine that infers a category from its textual description. This engine has been developed in Spark, mixes MLLib and own implementations, and is currently into production serving more than 5M customers daily.

Finance for Apache Spark

The challenges that the BBVA technology team faced while building this ML were many:

  • They did not know the data source in advance
  • They did not have a labelled set
  • A fraction of texts is useless (detection rather than classification)
  • Distribution of categories is imbalanced
  • Prefer false negatives over false positives
  • Very short text, language not even syntactically correct

The engineers solved these problems using the Spark MLlib pipeline using some other NLP tools like word2vec.

  • TF-IDF features + linear classifier (98% precision, 21% recall
  • Further tests with word2vec + Vector of Locally Aggregated Descriptors (VLAD)
  • Implemented in Spark/Scala, using MLlib classes
  • Own classes implemented for Multi-class Logistic Regression, VLAD
  • Scala dependency injection useful to quickly setup variants of the above steps

NLP tools for Apache Spark

HealthCare:

Healthcare industry is the newest in adopting advanced technologies like big data and machine learning to provide hi-tech facilities to their patients. Apache Spark is penetrating fast and is becoming the heartbeat in the latest Healthcare applications. Hospitals use these Spark enabled healthcare applications to analyze patients medical history to identify possible health issues based on history and learning.

Also, healthcare produces massive amounts of data and to process so much of the data in quick time and provide insights based on that itself was a challenge which Spark solves with ease.

Another very interesting problem in hospitals is when working with Operating Room(OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures.

Let’s see a use case. For a basic surgical procedure, it costs around $15-20 per minute. So, OR is a scarce and valuable resource and it needs to be utilized carefully and optimally. OR efficiency differs depending on the OR staffing and allocation, not the workload. So the loss of efficiency means a loss for the patient. So time and management are the utmost importance here.

HealthCare of Apache Spark

Spark and MLlib solve the problem by developing a predictive model that would identify available OR time 2 weeks in advance, allows hospitals to confirm waitlist cases two weeks in advance instead of when blocks normally release 4 days out. This OR Scheduling can be done by getting the historical data and running then linear regression model with multiple variables.

This model works because:

  • Can coordinate waitlist scheduling logistics with physicians and patients within 2 weeks of surgery.
  • Plan staff scheduling and resources so there are less last-minutes staffing issues for nursing and anaesthesia
  • Utilization metrics show where elective surgical schedule and level demand can be maximized.

Retail: 

Big retail chains have this usual problem of optimising their supply chain to minimize cost and wastage, improve customer service and gain insights into customer’s shopping behaviour to serve better and in the process optimize their profit.

To achieve these goals these retail companies have a lot of challenges like to keep the inventory up to date based on sales and also to predict sales and inventory during some promotional events and sale seasons. Also, they need to keep a track on customer’s orders transit and delivery. All these pose huge technical challenges. Apache Spark and MLlib is being used by a lot of these companies to capture real-time sales and invoice data, ingest it and then figure out the inventory. The technology can also be used to identify in real-time the order’s transit and delivery status. Spark MLlib analytics and predictive models are being used to predict sales during promotions and sale seasons to match the inventory and be ready for the event. The historical data on customer’s buying behaviour is also used to provide the customer with personalized suggestions and improve customer satisfaction. A lot of stores have started using sensors to get data on customer’s location within the store, their preferences, shopping behaviour, etc to provide on-the-spot suggestions and help to find, buy a product by sending messages, using displays etc.

Apache Spark Retail

Travel: 

Airline customer segmentation is a challenging field to understand due to customer’s complex behaviour. Amadeus is one of the main IT solution providers in the airline industry. It has the resources and infrastructure to manage all the ticketing and booking data as well as understanding the Airline needs and market particularities. By combining different data sources produced by different airline systems, they have applied unsupervised machine learning techniques to improve our understanding of customer behaviour.

Challenges in the airline industry are to understand the health of the business:

  • Are any segments growing or shrinking
  • How is the yield developing
  • Tune marketing to specific interests within segments
  • Optimize product offers using fare structures and media offers

Traditional approaches for segmentation were based on business intuition and manually crafter rules set. But these approaches have limitations and prejudices which can sometimes be negative for the business. On the contrary, the data-driven approach is resilient against turn-over, prejudices and market change.

Data Driven Methodology for Apache Spark

With a data-driven approach and using Spark and MLlib, the model is able to extract actionable insights on typical customer behaviour and intentions. Supervised and supervised learning using Spark MLlib techniques at scale are used to train models for prediction. These are then used to assist the customer in deploying the newfound insights into day-to-day operations.

Media: 

Media companies Netflix, Hotstar etc are using Apache Spark at the heart of their technology engine to drive their business. When a user turns on Netflix, he is able to see his favourite content playing automatically. This is achieved through recommendation engines built on Machine learning algorithms and Spark MLlib. Netflix uses historical data from users content selection, trains its ML algorithms, tests it offline and then deploys it live and checks if it works in Production as well.

Netflix has built an engine something called Time Travel using Apache Spark and other big data technologies to: Snapshot online services and use the snapshot data offline to generate features and share facts and features between experiments without calling live systems.

If someone is interested in exploring the details of the use case, one can look at the below link:

Energy: 

Apache Spark is spreading its roots everywhere. A common man not related to software industry may not realise it but there are applications running or extracting data from his home environment and processed in Spark to make his life better and easier. An example we will discuss below is the British Gas.

British Gas is a 200-year-old company. Connected Homes is BG’s IoT “startup”. It is a leader in the UK’s connected home market. Connected Homes is trying to predict the usage consumption patterns of the electricity, gas at the homes and provide consumers with insights so they can smartly use their devices and reduce energy consumption and save energy and money. Connected homes use Apache Spark at the core of its Data Engineering and ML engine.

The challenges are there are millions of electric and gas meters and the meters are read every 30 minutes.

There are:

  • Gas and electricity meter readings
  • Thermostat temperature data
  • Connected boiler data
  • Real-time energy consumption data
  • Introducing motion sensors, window and door sensors, etc

Apache Spark MLlib is used to apply machine learning to these data for disaggregation, similar home comparison and smart meters used in indirect algorithms for non-smart customers.

The analytics engine is used to show customers how they have spent energy, what are their top 3 spends, how can they reduce their energy consumption by showing patterns from smart consumers and smart meters etc. This gives customers a lot of insight and educates customers on optimally using energy at their homes.

Gaming:

Online Gaming industry is another beneficiary of the Apache Spark technology.

Riot Games uses Spark for Combating abusive language in chat in the team games. The challenges in online gaming are:

  • 1% of all players are consistently unsportsmanlike
  • 2% of all games infected by serious toxicity
  • In-Game Toxicity 95% of all serious toxicity comes from players who are otherwise sportsmanlike

To solve this the game developers tried to predict the words used by the gamers in the context of the game or the scenario. They used the “Word2Vec” a neural model which has 256 dimensions embedding months of chat logs. Each word in the chat is document split in spaces and lowercase. The model was trained on NLP for acronyms, short forms, colloquial words etc and the deviations could be huge. The team built a model trained to predict bad/toxic language. The gaming company has 100+ million users every month and so the data is huge. They used Spark MLlib to train their models using different algorithms, one of them Logistic Regression Random Forest Gradient Boosted Trees. The results were impressive for them as they tuned their models for better precision.

Benefits of having Apache Spark for Individual companies

Benefits of having Apache Spark for Individual companies

Many of the companies across industries have been benefiting from Apache Spark. The reasons could be different:

  • Speed of execution
  • Multi-language support
  • Machine learning library
  • Graph processing library
  • Batch processing as well as Stream & Structured stream processing

Apache Spark is beneficial for small as well as large enterprise. Spark offers a complete solution to many of the common problems like ETL and warehousing, Stream data processing, common use case of supervised and unsupervised learning for data analytics and predictive modelling. So with Apache Spark, the technology team does not require to look out for different technology stack and multiple vendors for a solution. This reduces the learning curve for additional development and maintenance. Also, since Spark has support for multiple languages Scala, Java, Python & R, it is easy to find developers.

Limitations:

Though there are so many benefits of Apache Spark as we have seen above, there are few limitations which Apache Spark has. We should be aware of these limitations before we decide to adopt any technology.

  • Apache Spark does not come with an inbuilt file system and it has to depend on HDFS in most of the use cases. If not, it has to be used with some cloud-based data platform.
  • Even though Spark has Stream processing feature it is not exactly real-time processing. It processes in batches which are called micro-batches.
  • Apache Spark is expensive as it catches a lot of data and memory is not cheap.
  • Spark faces issues while working with HDFS which has a very large number of small files.
  • Though Spark MLlib provides machine learning capabilities, it does not come with a very exhaustive list of algorithms. It can solve a lot of ML problems but not all.
  • Apache Spark does not have automatic code optimization process in place and so the code needs to be optimized manually.

Reasons why you should learn Apache Spark

We have seen the wide impact and use cases if Apache Spark. So we know that Spark has become a buzzword these days. We should now also understand why we should learn Spark.

  • Spark offers a complete package for developers and can act as a unified analytics engine. Hence it increases the productivity for the developers. So it ROI for any firm is high and most of the companies dependent on technology are aware of the fact and also willing to put their money in Spark.
  • Learning Spark can help explore the world of Big data and data science. Both these technology fields are the future and bringing transformational changes in almost all industries. So getting exposed to Spark is becoming a necessity for all firms.
  • With fast-paced Spark adoption by organizations, it is opening up new prospects in business and many of the applications have proved that instead of business driving technology it is becoming vice versa now, that technology is driving business.
  • According to a survey, there is a huge demand for Spark engineers. Today, there are well over 1,000 contributors to the Apache Spark project across 250+ companies worldwide. Recently, Indeed.com listed over 2,400 full-time open positions for Apache Spark professionals across various industries including enterprise technology, e-commerce/retail, healthcare, and life sciences, oil and gas, manufacturing, and more.
  • Apache Spark developers earn the highest average salary among all other programmers. So this is another and one of the major incentives one can get to learn and expertise Spark.

One can look at an old survey by Databricks to understand the importance and impact of Apache Spark by 2016. By 2019 these number would have grown much bigger.

Conclusion

Apache Spark has capabilities to process huge amount of data in a very efficient manner with high throughput. It can solve problems related to batch processing, near real-time processing, can be used to apply lambda architecture, can be used for Structured streaming. Also, it can solve many of the complex data analytics and predictive analytics problems with the help of the MLlib component which comes out of the box. Apache Spark has been making a big impact on the whole data engineering and data science gamut at scale.

Nitin

Nitin Kumar

Blog Author

I am an Alumni of IIT(ISM) Dhanbad. I have 15+ years of experience in Software industry working with Investment Banking and Financial Services domain. I have worked for Wall Street banks like Morgan Stanley and JP Morgan Chase. I have been working on Big Data technologies like hadoop,spark, cloudera for 3+ years.

Join the Discussion

Your email address will not be published. Required fields are marked *

1 comments

Navya venkey 18 Jun 2019

It’s really really great information for becoming a better Blogger. Keep sharing, Thanks

Suggested Blogs

5 Big Data Challenges in 2021

The year 2019 saw some enthralling changes in volume and variety of data across businesses, worldwide. The surge in data generation is only going to continue. Foresighted enterprises are the ones who will be able to leverage this data for maximum profitability through data processing and handling techniques. With the rise in opportunities related to Big Data, challenges are also bound to increase.Below are the 5 major Big Data challenges that enterprises face in 2020:1. The Need for More Trained ProfessionalsResearch shows that since 2018, 2.5 quintillion bytes (or 2.5 exabytes) of information is being generated every day. The previous two years have seen significantly more noteworthy increments in the quantity of streams, posts, searches and writings, which have cumulatively produced an enormous amount of data. Additionally, this number is only growing by the day. A study has predicted that by 2025, each person will be making a bewildering 463 exabytes of information every day.A report by Indeed, showed a 29 percent surge in the demand for data scientists yearly and a 344 percent increase since 2013 till date. However, the searches by job seekers skilled in data science continue to grow at a snail’s pace at 14 percent. In August 2018, LinkedIn reported claimed that US alone needs 151,717 professionals with data science skills. This along with a 15 percent discrepancy between job postings and job searches on Indeed, makes it quite evident that the demand for data scientists outstrips supply. The greatest data processing challenge of 2020 is the lack of qualified data scientists with the skill set and expertise to handle this gigantic volume of data.2. Inability to process large volumes of dataOut of the 2.5 quintillion data produced, only 60 percent workers spend days on it to make sense of it. A major portion of raw data is usually irrelevant. And about 43 percent companies still struggle or aren’t fully satisfied with the filtered data. 3. Syncing Across Data SourcesOnce you import data into Big Data platforms you may also realize that data copies migrated from a wide range of sources on different rates and schedules can rapidly get out of the synchronization with the originating system. This implies two things, one, the data coming from one source is out of date when compared to another source. Two, it creates a commonality of data definitions, concepts, metadata and the like. The traditional data management and data warehouses, and the sequence of data transformation, extraction and migration- all arise a situation in which there are risks for data to become unsynchronized.4. Lack of adequate data governanceData collected from multiple sources should have some correlation to each other so that it can be considered usable by enterprises. In a recent Big Data Maturity Survey, the lack of stringent data governance was recognized the fastest-growing area of concern. Organizations often have to setup the right personnel, policies and technology to ensure that data governance is achieved. This itself could be a challenge for a lot of enterprises.5. Threat of compromised data securityWhile Big Data opens plenty of opportunities for organizations to grow their businesses, there’s an inherent risk of data security. Some of the biggest cyber threats to big players like Panera Bread, Facebook, Equifax and Marriot have brought to light the fact that literally no one is immune to cyberattacks. As far as Big Data is concerned, data security should be high on their priorities as most modern businesses are vulnerable to fake data generation, especially if cybercriminals have access to the database of a business. However, regulating access is one of the primary challenges for companies who frequently work with large sets of data. Even the way Big Data is designed makes it harder for enterprises to ensure data security. Working with data distributed across multiple systems makes it both cumbersome and risky.Overcoming Big Data challenges in 2020Whether it’s ensuring data governance and security or hiring skilled professionals, enterprises should leave no stone unturned when it comes to overcoming the above Big Data challenges. Several courses and online certifications are available to specialize in tackling each of these challenges in Big Data. Training existing personnel with the analytical tools of Big Data will help businesses unearth insightful data about customer. Frameworks related to Big Data can help in qualitative analysis of the raw information.
1333
5 Big Data Challenges in 2021

The year 2019 saw some enthralling changes in volu... Read More

How to install Apache Spark on Windows?

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.In this document, we will cover the installation procedure of Apache Spark on Windows 10 operating systemPrerequisitesThis guide assumes that you are using Windows 10 and the user had admin permissions.System requirements:Windows 10 OSAt least 4 GB RAMFree space of at least 20 GBInstallation ProcedureStep 1: Go to the below official download page of Apache Spark and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.The page will look like below.Step 2:  Once the download is completed unzip the file, to unzip the file using WinZip or WinRAR or 7-ZIP.Step 3: Create a folder called Spark under your user Directory like below and copy paste the content from the unzipped file.C:\Users\\SparkIt looks like below after copy-pasting into the Spark directory.Step 4: Go to the conf folder and open log file called, log4j.properties. template. Change INFO to WARN (It can be ERROR to reduce the log). This and next steps are optional.Remove. template so that Spark can read the file.Before removing. template all files look like below.After removing. template extension, files will look like belowStep 5: Now we need to configure path.Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment VariablesAdd below new user variable (or System variable) (To add new user variable click on New button under User variable for )Click OK.Add %SPARK_HOME%\bin to the path variable.Click OK.Step 6: Spark needs a piece of Hadoop to run. For Hadoop 2.7, you need to install winutils.exe.You can find winutils.exe from below pageDownload it.Step 7: Create a folder called winutils in C drive and create a folder called bin inside. Then, move the downloaded winutils file to the bin folder.C:\winutils\binAdd the user (or system) variable %HADOOP_HOME% like SPARK_HOME.Click OK.Step 8: To install Apache Spark, Java should be installed on your computer. If you don’t have java installed in your system. Please follow the below processJava Installation Steps:Go to the official Java site mentioned below  the page.Accept Licence Agreement for Java SE Development Kit 8u201Download jdk-8u201-windows-x64.exe fileDouble Click on Downloaded .exe file, you will the window shown below.Click Next.Then below window will be displayed.Click Next.Below window will be displayed after some process.Click Close.Test Java Installation:Open Command Line and type java -version, then it should display installed version of JavaYou should also check JAVA_HOME and path of %JAVA_HOME%\bin included in user variables (or system variables)1. In the end, the environment variables have 3 new paths (if you need to add Java path, otherwise SPARK_HOME and HADOOP_HOME).2. Create c:\tmp\hive directory. This step is not necessary for later versions of Spark. When you first start Spark, it creates the folder by itself. However, it is the best practice to create a folder.C:\tmp\hiveTest Installation:Open command line and type spark-shell, you get the result as below.We have completed spark installation on Windows system. Let’s create RDD and     Data frameWe create one RDD and Data frame then will end up.1. We can create RDD in 3 ways, we will use one way to create RDD.Define any list then parallelize it. It will create RDD. Below is code and copy paste it one by one on the command line.val list = Array(1,2,3,4,5) val rdd = sc.parallelize(list)Above will create RDD.2. Now we will create a Data frame from RDD. Follow the below steps to create Dataframe.import spark.implicits._ val df = rdd.toDF("id")Above code will create Dataframe with id as a column.To display the data in Dataframe use below command.Df.show()It will display the below output.How to uninstall Spark from Windows 10 System: Please follow below steps to uninstall spark on Windows 10.Remove below System/User variables from the system.SPARK_HOMEHADOOP_HOMETo remove System/User variables please follow below steps:Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables, then find SPARK_HOME and HADOOP_HOME then select them, and press DELETE button.Find Path variable Edit -> Select %SPARK_HOME%\bin -> Press DELETE ButtonSelect % HADOOP_HOME%\bin -> Press DELETE Button -> OK ButtonOpen Command Prompt the type spark-shell then enter, now we get an error. Now we can confirm that Spark is successfully uninstalled from the System.
9384
How to install Apache Spark on Windows?

Apache Spark is a fast and general-purpose cluster... Read More

Top In-demand Jobs During Coronavirus Pandemic

With the global positive cases for the COVID-19 reaching over two crores globally, and over 281,000 jobs lost in the US alone, the impact of the coronavirus pandemic already has been catastrophic for workers worldwide. While tourism and the supply chain industries are the hardest hit, the healthcare and transportation sectors have faced less severe heat. According to a Goldman Sachs report, the number of unemployed individuals in the US can climb up to 2.25 million. However, despite these alarming figures, the NBC News states that this is merely 20% of the total unemployment rate of the US. Job portals like LinkedIn, Shine, and Monster are also witnessing continued hiring for specific roles. So, what are these roles defining the pandemic job sector? Top In-demand Jobs During Coronavirus Pandemic Healthcare specialist For obvious reasons, the demand for healthcare specialists has spiked up globally. This includes doctors, nurses, surgical technologists, virologists, diagnostic technicians, pharmacists, and medical equipment providers. Logistics personnel This largely involves shipping and delivery companies that include a broad profile of employees, right from warehouse managers, transportation-oriented job roles, and packaging and fulfillment jobs. Presently, Amazon is hiring over 1,00,000 workers for its operations while making amends in the salaries and timings to accommodate the situation.  Online learning companies Teaching and learning are at the forefront of the current global scenario. With most of the individuals either working from home or anticipating a loss of a job, several of them are resorting to upskilling or attaining new skills to embrace broader job roles. The demand for teachers or trainers for these courses and academic counselors has also shot up. Remote learning facilities and online upskilling have made these courses much more accessible to individuals as well.  Remote meeting and communication companies The entirety of remote working is heavily dependant on communication and meeting tools such as Zoom, Slack, and Microsoft teams. The efficiency of these tools and the effectivity of managing projects with remote communication has enabled several industries to sustain global pandemic. Even project management is taking an all-new shape thanks to these modern tools. Moreover, several schools are also relying on these tools to continue education through online classes.  Psychologists/Mental health-related businesses Many companies and individuals are seeking help to cope up with the undercurrent. This has created a surge in the demand for psychologists. Businesses like PwC and Starbucks have introduced/enhanced their mental health coaching. Mental health and wellness apps like Headspace have seen a 400% increase in the demand from top companies like Adobe and GE.  Data analysts Hiring companies like Shine have seen a surge in the hiring of data analysts. The simple reason being that there is a constant demand for information about the coronavirus, its status, its impact on the global economy, different markets, and many other industries. Companies are also hiring data analysts rapidly to study current customer behavior and reach out to public sentiments.  How to find a job during the coronavirus pandemicWhether you are looking for a job change, have already faced the heat of the coronavirus, or are at the risk of losing your job, here are some ways to stay afloat despite the trying times.  Be proactive on job portals, especially professional networking sites like LinkedIn to expand your network Practise phone and video job interviews Expand your work portfolio by on-boarding more freelance projects Pick up new skills by leveraging on the online courses available  Stay focused on your current job even in uncertain times Job security is of paramount importance during a global crisis like this. Andrew Seaman, an editor at LinkedIn notes that recruiters are going by the ‘business as usual approach’, despite concerns about COVID-19. The only change, he remarks, is that the interviews may be conducted over a video call, rather than in person. If the outbreak is not contained soon enough though, hiring may eventually take a hit. 
8555
Top In-demand Jobs During Coronavirus Pandemic

With the global positive cases for the COVID-19 re... Read More

Useful links