In data science activities, data processing is the most essential and desired step after data acquisition. It is well known that data has now coined itself as big data everywhere due to its large volume, variety, and the velocity with which it is generated. For processing this big data, the big data analytics environment uses two popular platforms, either MapReduce from Hadoop or Apache Spark. Some notable features of Apache Spark are flexibility, reusability, support for in-memory computation of RDDs, fault tolerance, real-time stream processing, and many more. If you’re keen on building your career in the field of Data Science, exploring KnowledgeHut’s data science with python syllabus is a great starting point, covering all the relevant concepts in data science.
In 2009, Apache Spark began as a research project at the UC Berkeley AMPLab, and in early 2010 it was open-sourced. The main reason for creating the Apache Spark framework was to address MapReduce's inefficiencies. Although MapReduce was quite popular and widely used, it could not solve a wide range of issues, especially for multi-pass applications requiring low-latency data sharing across multiple parallel operations. Spark evolved into a large development community after its debut, and in 2013 it was moved to the Apache Software Foundation. Today, a community of hundreds of developers from various companies collaborates on this project.
Apache Spark is more recent and, although it cannot entirely replace MapReduce, it is more helpful in processing big datasets at relatively high speeds. It is an open-source unified analytics processing engine for very large datasets. It uses RAM for fast and cluster computing systems for general-purpose processing, making it handle a variety of workloads. It can also handle all three types of big data processing tasks, like batch, streaming, and iterative. Therefore, it has become the choice of many industries for data processing and there is a great demand for Spark professionals with a data science background. A data science certificate can build the necessary expertise in data science for learning Spark.
This article will cover the Spark ecosystem, its architecture, and utility, covering some clear situations where it is more beneficial to a company, saving time and money. We will also look at some of the newer alternatives are being developed for Spark to overcome some of its shortcomings. Finally, we will explore the best practices we need to follow while using Spark.
Architecture of Spark
The ecosystem of Spark is shown in the following diagram, which is self-explanatory and shows different levels of the main Spark engine with its core.
The architecture of Spark has a typical flowline which can be observed in the following illustration.
Let us consider the elements in Spark architecture.
Driver
The Driver (Program) is the main process that runs the application's main function and creates an object called SparkContext. This is the first link to coordinate further Spark applications in the cluster. The main method converts the user program into a task and then schedules the task to the executors.
Cluster Manager
- Cluster Manager is responsible for allotting resources for the applications and arrange resources through the resource manager. The cluster manager decides about executors. Spark can run several clusters and accordingly fix up the number of executors.
- Spark's different types of cluster managers are Hadoop YARN, Apache Mesos or Spark Standalone Scheduler.
Executor
- The executor is a process that distributes the tasks to worker nodes and monitors them.
- It is responsible for running the tasks on the worker node while keeping RDDs stored created during processing and sending the results to the driver.
- Each executor has to handle the job application allotted to it and distribute it to multiple worker nodes depending upon the size of data.
Worker Node
- The Spark architecture also works on the Master-Slave principle wherein the worker node is a slave under the master.
- A worker node runs the application code in the cluster.
- The number of worker nodes depends on the volume of jobs and distribution criteria.
The Spark architecture depends upon two abstractions:
- Resilient Distributed Dataset (RDD)
- Directed Acyclic Graph (DAG)
Resilient Distributed Datasets (RDD)
The term "Resilient Distributed Datasets" refers to a collection of data objects that may be stored in memory on worker nodes. Three important words have special meaning as follows:
- Resilient: Ability to restore data in the event of a failure
- Distributed: Different nodes share large data in smaller portions.
- Dataset: Group of data.
Directed Acyclic Graph (DAG)
A finite direct graph that conducts a series of calculations on data is known as a Directed Acyclic Graph. Each node represents an RDD partition, and the edge is a transformation on the top of the data. Directed means having direction, and acyclic means does not complete a cycle. The graph is a visual representation like a flow chart about the tasks to be executed in the order indicated.
There are two main steps in data processing in Spark, which resemble MapReduce. The first one is Transformation, where an initial RDD creates further RDDs and DAGs are formed. Because Spark processing is known as lazy evaluation, the action takes place only when required. The Action is the second step which gives a result of the process performed. The result is then given back to the driver.
Factors to Consider When Assessing Spark
It is a tricky issue when to use Spark in place of some existing processing platform such as MapReduce. Whenever the processing of big data issue is under consideration, many factors must be considered like speed, fault tolerance, scalability, latency, languages required, APIs available and cost of implementation. Therefore, we must consider the situation we are dealing with and judge whether the benefits of Spark meet the needs. Just for the sake of opting for Spark cannot be a wise step. Here’s a handy list of situations where Spark is a good choice and those where it isn’t.
Factors in favour of spark usage:
- When data is large, streaming and iterative, Spark’s efficiency and speed are an asset.
- If the cost of running Spark clusters is not a concern, Spark can work with other complex systems.
- When you need to use multiple languages/libraries on the same cluster as Python, Scala and R.
- When you can afford to have sufficient RAM capacity for processing large data, as the prime requirement of Spark operation depends on in-built memory.
- When you possess a competent bunch of people aware of all aspects of Spark, its architecture, and operating parameters.
Factors that are not in favour
- Highly expensive if in-built memory requirements are very high.
- It is less efficient in handling small-sized datasets.
- It often requires debugging is required due to its operating system.
- MLlib has limited algorithms for machine learning problems compared to other similar libraries. The performance of the available algorithms is also less satisfactory than other machine learning tools.
- On a shared memory basis, it is slow in responding in the case of a multi-user system. Hence, it becomes more time-consuming.
These are only some of the factors going in favour or against in choosing Spark for processing large data. However, there can be different opinions in favour and against Spark depending upon the users and what they find during actual usage of Spark for data science project.
Alternatives to Spark
Until recently, Hadoop was the chosen platform for distributed data processing. But with the growing popularity of Spark, businesses have transitioned from Hadoop to Spark. As technology trends are dynamic in the IT sector, alternatives to Spark also have come up. There are already some distributed computing frameworks that provide attractive and established alternatives to Spark. The two main alternatives to Spark are Dask and Ray frameworks.
Dask
Dask (2018) aims to provide a strong parallel computing framework that is incredibly user-friendly for Python programmers and can operate on a single laptop or a cluster. Dask is lighter than Spark and easier to incorporate into existing code and hardware.
Spark requires a certain level of learning and practice due to a novel API and execution paradigm. In contrast, Dask is a pure Python framework that supports Pandas data frames and Numpy array data structures. It is a real advantage for most data scientists who can begin using it nearly immediately. Dask also allows interfacing with Scikit-JobLib Learn's parallel computing library, making it possible to run parallel processing of Scikit-learn code with minor code modifications.
Ray
Ray framework has been developed at RISELab, UC Berkeley, as a simple and universal API for building distributed applications. The Ray core is packaged with the following four libraries for accelerating machine learning and deep learning processes.
- Tune (Scalable Hyperparameter Tuning),
- RLlib(Scalable Reinforcement Learning),
- Ray Train (Distributed Deep Learning)
- Datasets (Distributed Data Loading and Compute - beta phase)
It enables Ray to be used in major machine learning use cases such as simulation, distributed training, complex computations, and deployment in interactive situations while preserving all of Hadoop and Spark's desirable characteristics.
Both Dask and Ray are exceptionally good in standard natural language processing tasks (text normalization, calculating word frequency tables, etc.).
In situations where Spark is used as an ETL tool, other products from Apache like Apache Storm and Apache Flink can be good alternatives to Spark for real-time stream processing. Apache Flume can be an alternative for processing large amounts of log data. For data discovery or exploration type of tasks, Python/R with MySQL/PostgreSQL can be tools of choice for Data scientists compared to Spark.
Similarly, when business intelligence and reporting is required, it is challenging to process any semi-structured data using Spark before a BI developer can use it for dashboarding purposes. Here, companies can choose between Snowflake and Google BigQuery or similar tools.
Further, when working on Machine learning projects, Apache Spark can be used to handle complex data due to its capacity to perform large-scale transformations. So, Apache Spark can be utilized to create the training dataset, but deploying a model in production might require a separate system like Redis or Cassandra to provide data in real-time. Google Dataflow and FlinkML are two good alternatives to Spark in such cases.
Spark Best Practices
Best practices inherently optimize the performance at optimized cost. These are to be implemented in practice for optimization. Following are some practices that will give required objectives while using Spark.
- To have a clear understanding of Spark architecture, data formats like RDD, data frames and datasets, and a clear vision about the role and responsibilities of driver, executors and nature of the task.
- To select the correct number of cores, partitions and tasks to balance out the workload of all tasks and smooth distribution for parallel processing to avoid bottlenecks.
- To load the data on disks only if such need arises when a particular executor informs running short on memory in its worker node. Frequent shuffling, loading the disk and excessive garbage collection should be avoided as far as possible.
- To have expertise in debugging requirements, as errors can occur when large data is processed and is usually a complex process. Proper manpower can efficiently handle these processes.
- To have smaller chunks of data for efficient distribution in parallel processing is a good step. In the beginning, a small sample can be tried on transformation and action processes to confirm the correct approach, if found satisfactory, it can be scaled for further processing.
- To be on the watch out for the latest versions so that available new features can be utilized to improve performance.
These are some of the best practices that have to be followed based on architecture setup, expertise and load distribution strategies. When actually processing big data, many tips and precautionary steps are required to be implemented, pertaining to specific functions and methods that depend on actual stages of processing data with Spark.
Does this Spark your Interest?
In this article, we have seen the important concepts, the pros and cons, and the newer alternatives of Apache Spark. Even then, Spark is still a popular choice with frequent releases and updates by Apache. Speed, Simplicity and multiple language support are major reasons companies prefer to use Apache Spark for faster data processing. Companies like Uber, Netflix, IBM, Databricks and many more have deployed Spark at a massive scale for data processing applications. Although Spark may not be the greatest tool for every project, it is a tool worth considering if you operate in today's Big Data world. If you are looking to enhance your Apache Spark skills, go through our data science with python syllabus and see if the course adds value to your existing skillset.
In conclusion, Spark in data science domain is an incredibly versatile Big Data platform with strong data processing capabilities. Since it is an open-source framework, it continues to be improved and developed, with new features and functionalities being introduced frequently. As Big Data applications become increasingly broad and challenging, Apache Spark and its applications will evolve accordingly.