Apache Spark has many features which make it a great choice as a big data processing engine. Many of these features establish the advantages of Apache Spark over other Big Data processing engines. Let us look into details of some of the main features which distinguish it from its competition.
- Fault tolerance
- Dynamic In Nature
- Lazy Evaluation
- Real-Time Stream Processing
- Advanced Analytics
- In Memory Computing
- Supporting Multiple languages
- Integrated with Hadoop
- Cost efficient
- Fault Tolerance: Apache Spark is designed to handle worker node failures. It achieves this fault tolerance by using DAG and RDD (Resilient Distributed Datasets). DAG contains the lineage of all the transformations and actions needed to complete a task. So in the event of a worker node failure, the same results can be achieved by rerunning the steps from the existing DAG.
- Dynamic nature:Sparkoffers over 80 high-level operators that make it easy to build parallel apps.
- Lazy Evaluation: Spark does not evaluate any transformation immediately. All the transformations are lazily evaluated. The transformations are added to the DAG and the final computation or results are available only when actions are called. This gives Spark the ability to make optimization decisions, as all the transformations become visible to the Spark engine before performing any action.
- Real Time Stream Processing: Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs.
- Speed: Spark enables applications running on Hadoop to run up to 100x faster in memory and up to 10x faster on disk. Spark achieves this by minimizing disk read/write operations for intermediate results. It stores in memory and performs disk operations only when essential. Spark achieves this using DAG, query optimizer and highly optimized physical execution engine.
- Reusability: Spark code can be used for batch-processing, joining streaming data against historical data as well as running ad-hoc queries on streaming state.
- Advanced Analytics: Apache Spark has rapidly become the de facto standard for big data processing and data sciences across multiple industries. Spark provides both machine learning and graph processing libraries, which companies across sectors leverage to tackle complex problems. And all this is easily done using the power of Spark and highly scalable clustered computers. Databricks provides an Advanced Analytics platform with Spark.
- In Memory Computing: Unlike Hadoop MapReduce, Apache Spark is capable of processing tasks in memory and it is not required to write back intermediate results to the disk. This feature gives massive speed to Spark processing. Over and above this, Spark is also capable of caching the intermediate results so that it can be reused in the next iteration. This gives Spark added performance boost for any iterative and repetitive processes, where results in one step can be used later, or there is a common dataset which can be used across multiple tasks.
- Supporting Multiple languages: Spark comes inbuilt with multi-language support. It has most of the APIs available in Java, Scala, Python and R. Also, there are advanced features available with R language for data analytics. Also, Spark comes with SparkSQL which has an SQL like feature. SQL developers find it therefore very easy to use, and the learning curve is reduced to a great level.
- Integrated with Hadoop: Apache Spark integrates very well with Hadoop file system HDFS. It offers support to multiple file formats like parquet, json, csv, ORC, Avro etc. Hadoop can be easily leveraged with Spark as an input data source or destination.
- Cost efficient: Apache Spark is an open source software, so it does not have any licensing fee associated with it. Users have to just worry about the hardware cost. Also, Apache Spark reduces a lot of other costs as it comes inbuilt for stream processing, ML and Graph processing. Spark does not have any locking with any vendor, which makes it very easy for organizations to pick and choose Spark features as per their use case.
After looking at these features above it can be easily said that Apache Spark is the most advanced and popular product from Apache which caters to Big Data processing. It has different modules for Machine Learning, Streaming and Structured and Unstructured data processing.