
Domains
Agile Management
Master Agile methodologies for efficient and timely project delivery.
View All Agile Management Coursesicon-refresh-cwCertifications
Scrum Alliance
16 Hours
Best Seller
Certified ScrumMaster (CSM) CertificationScrum Alliance
16 Hours
Best Seller
Certified Scrum Product Owner (CSPO) CertificationScaled Agile
16 Hours
Trending
Leading SAFe 6.0 CertificationScrum.org
16 Hours
Professional Scrum Master (PSM) CertificationScaled Agile
16 Hours
SAFe 6.0 Scrum Master (SSM) CertificationAdvanced Certifications
Scaled Agile, Inc.
32 Hours
Recommended
Implementing SAFe 6.0 (SPC) CertificationScaled Agile, Inc.
24 Hours
SAFe 6.0 Release Train Engineer (RTE) CertificationScaled Agile, Inc.
16 Hours
Trending
SAFe® 6.0 Product Owner/Product Manager (POPM)IC Agile
24 Hours
ICP Agile Certified Coaching (ICP-ACC)Scrum.org
16 Hours
Professional Scrum Product Owner I (PSPO I) TrainingMasters
32 Hours
Trending
Agile Management Master's Program32 Hours
Agile Excellence Master's ProgramOn-Demand Courses
Agile and ScrumRoles
Scrum MasterTech Courses and Bootcamps
Full Stack Developer BootcampAccreditation Bodies
Scrum AllianceTop Resources
Scrum TutorialProject Management
Gain expert skills to lead projects to success and timely completion.
View All Project Management Coursesicon-standCertifications
PMI
36 Hours
Best Seller
Project Management Professional (PMP) CertificationAxelos
32 Hours
PRINCE2 Foundation & Practitioner CertificationAxelos
16 Hours
PRINCE2 Foundation CertificationAxelos
16 Hours
PRINCE2 Practitioner CertificationSkills
Change ManagementMasters
Job Oriented
45 Hours
Trending
Project Management Master's ProgramUniversity Programs
45 Hours
Trending
Project Management Master's ProgramOn-Demand Courses
PRINCE2 Practitioner CourseRoles
Project ManagerAccreditation Bodies
PMITop Resources
Theories of MotivationCloud Computing
Learn to harness the cloud to deliver computing resources efficiently.
View All Cloud Computing Coursesicon-cloud-snowingCertifications
AWS
32 Hours
Best Seller
AWS Certified Solutions Architect - AssociateAWS
32 Hours
AWS Cloud Practitioner CertificationAWS
24 Hours
AWS DevOps CertificationMicrosoft
16 Hours
Azure Fundamentals CertificationMicrosoft
24 Hours
Best Seller
Azure Administrator CertificationMicrosoft
45 Hours
Recommended
Azure Data Engineer CertificationMicrosoft
32 Hours
Azure Solution Architect CertificationMicrosoft
40 Hours
Azure DevOps CertificationAWS
24 Hours
Systems Operations on AWS Certification TrainingAWS
24 Hours
Developing on AWSMasters
Job Oriented
48 Hours
New
AWS Cloud Architect Masters ProgramBootcamps
Career Kickstarter
100 Hours
Trending
Cloud Engineer BootcampRoles
Cloud EngineerOn-Demand Courses
AWS Certified Developer Associate - Complete GuideAuthorized Partners of
AWSTop Resources
Scrum TutorialIT Service Management
Understand how to plan, design, and optimize IT services efficiently.
View All DevOps Coursesicon-git-commitCertifications
Axelos
16 Hours
Best Seller
ITIL 4 Foundation CertificationAxelos
16 Hours
ITIL Practitioner CertificationPeopleCert
16 Hours
ISO 14001 Foundation CertificationPeopleCert
16 Hours
ISO 20000 CertificationPeopleCert
24 Hours
ISO 27000 Foundation CertificationAxelos
24 Hours
ITIL 4 Specialist: Create, Deliver and Support TrainingAxelos
24 Hours
ITIL 4 Specialist: Drive Stakeholder Value TrainingAxelos
16 Hours
ITIL 4 Strategist Direct, Plan and Improve TrainingOn-Demand Courses
ITIL 4 Specialist: Create, Deliver and Support ExamTop Resources
ITIL Practice TestData Science
Unlock valuable insights from data with advanced analytics.
View All Data Science Coursesicon-dataBootcamps
Job Oriented
6 Months
Trending
Data Science BootcampJob Oriented
289 Hours
Data Engineer BootcampJob Oriented
6 Months
Data Analyst BootcampJob Oriented
288 Hours
New
AI Engineer BootcampSkills
Data Science with PythonRoles
Data ScientistOn-Demand Courses
Data Analysis Using ExcelTop Resources
Machine Learning TutorialDevOps
Automate and streamline the delivery of products and services.
View All DevOps Coursesicon-terminal-squareCertifications
DevOps Institute
16 Hours
Best Seller
DevOps Foundation CertificationCNCF
32 Hours
New
Certified Kubernetes AdministratorDevops Institute
16 Hours
Devops LeaderSkills
KubernetesRoles
DevOps EngineerOn-Demand Courses
CI/CD with Jenkins XGlobal Accreditations
DevOps InstituteTop Resources
Top DevOps ProjectsBI And Visualization
Understand how to transform data into actionable, measurable insights.
View All BI And Visualization Coursesicon-microscopeBI and Visualization Tools
Certification
24 Hours
Recommended
Tableau CertificationCertification
24 Hours
Data Visualization with Tableau CertificationMicrosoft
24 Hours
Best Seller
Microsoft Power BI CertificationTIBCO
36 Hours
TIBCO Spotfire TrainingCertification
30 Hours
Data Visualization with QlikView CertificationCertification
16 Hours
Sisense BI CertificationOn-Demand Courses
Data Visualization Using Tableau TrainingTop Resources
Python Data Viz LibsCyber Security
Understand how to protect data and systems from threats or disasters.
View All Cyber Security Coursesicon-refresh-cwCertifications
CompTIA
40 Hours
Best Seller
CompTIA Security+EC-Council
40 Hours
Certified Ethical Hacker (CEH v12) CertificationISACA
22 Hours
Certified Information Systems Auditor (CISA) CertificationISACA
40 Hours
Certified Information Security Manager (CISM) Certification(ISC)²
40 Hours
Certified Information Systems Security Professional (CISSP)(ISC)²
40 Hours
Certified Cloud Security Professional (CCSP) Certification16 Hours
Certified Information Privacy Professional - Europe (CIPP-E) CertificationISACA
16 Hours
COBIT5 Foundation16 Hours
Payment Card Industry Security Standards (PCI-DSS) CertificationOn-Demand Courses
CISSPTop Resources
Laptops for IT SecurityWeb Development
Learn to create user-friendly, fast, and dynamic web applications.
View All Web Development Coursesicon-codeBootcamps
Career Kickstarter
6 Months
Best Seller
Full-Stack Developer BootcampJob Oriented
3 Months
Best Seller
UI/UX Design BootcampEnterprise Recommended
6 Months
Java Full Stack Developer BootcampCareer Kickstarter
490+ Hours
Front-End Development BootcampCareer Accelerator
4 Months
Backend Development Bootcamp (Node JS)Skills
ReactOn-Demand Courses
Angular TrainingTop Resources
Top HTML ProjectsBlockchain
Understand how transactions and databases work in blockchain technology.
View All Blockchain Coursesicon-stop-squareBlockchain Certifications
40 Hours
Blockchain Professional Certification32 Hours
Blockchain Solutions Architect Certification32 Hours
Blockchain Security Engineer Certification24 Hours
Blockchain Quality Engineer Certification5+ Hours
Blockchain 101 CertificationOn-Demand Courses
NFT Essentials 101: A Beginner's GuideTop Resources
Blockchain Interview QsProgramming
Learn to code efficiently and design software that solves problems.
View All Programming Coursesicon-codeSkills
Python CertificationInterview Prep
Career Accelerator
3 Months
Software Engineer Interview PrepOn-Demand Courses
Data Structures and Algorithms with JavaScriptTop Resources
Python TutorialNow since we have some understanding of Spark, let us dive deeper and understand its components. Apache Spark consists of Spark Core Engine, Spark SQL, Spark Streaming, MLlib, GraphX and Spark R. You can use Spark Core Engine along with any of the other five components mentioned above. It is not necessary to use all the Spark components together. Depending on the use case and application, any one or more of these can be used along with Spark Core.
Let us look at each of these components in detail.

Now since we have some understanding of Spark let us dive deeper into Spark and understand the components Apache Spark consists of. Apache Spark consists of Spark Core Engine, Spark SQL, Spark Streaming, MLlib, GraphX, and Spark R. You can use Spark Core Engine along with any of the other five components mentioned above. It is not necessary to use all the Spark components together. Depending on the use case and application any one or more of these can be used along with Spark Core.
Spark Core: Spark Core is the heart of the Apache Spark framework. Spark Core provides the execution engine for the Spark platform which is required and used by other components which are built on top of Spark Core as per the requirement. Spark Core provides the in-built memory computing and referencing datasets stored in external storage systems. It is Spark’s core responsibility to perform all the basic I/O functions, scheduling, monitoring, etc. Also, fault recovery and effective memory management are Spark Core’s other important functions.
Spark Core uses a very special data structure called the RDD. Data sharing in distributed processing systems like MapReduce need the data in intermediate steps to be stored and then retrieved from permanent storage like HDFS or S3 which makes it very slow due to the serialization and deserialization of I/O steps. RDDs overcome this as these data structures are in-memory and fault-tolerant and can be shared across different tasks within the same Spark process. The RDDs can be any immutable and partitioned collections and can contain any type of objects; Python, Scala, Java or some user-defined class objects. RDDs can be created either by Transformations of an existing RDD or loading from external sources like HDFS or HBase etc. We will look into RDD and its transformations in-depth in later sections in the tutorial.
Spark SQL: Spark SQL is built on top of Shark which was the first interactive SQL on the Hadoop system. Shark was built on top of Hive codebase and achieved performance improvement by swapping out the physical execution engine part of the Hive. But due to the limitations of Hive, Shark was not able to achieve the performance it was supposed to. So the Shark project was stopped and Spark SQL was built with the knowledge of Shark on top of Spark Core Engine to leverage the power of Spark. You can read more about Shark in the following blog by Reynold Xin, one of the Spark SQL code maintainers.
Spark SQL is named like this because it works with the data in a similar fashion to SQL. In fact it there is a mention that Spark SQL’s aim is to meet SQL 92 standards. But the gist is that it allows developers to write declarative code letting the engine use as much of the data and stored structure (RDDs) as it can to optimize the resultant distributed query behind the scenes. The goal is to allow the user to not have to worry about the distributed nature as much and focus on the business use case. Users can perform extract, transform and load functions on data from a variety of sources in different formats like JSON, Parquet or Hive and then execute ad-hoc queries using Spark SQL.
DataFrame constitutes the main abstraction for Spark SQL. Distributed collection of data ordered into named columns is known as a DataFrame in Spark. In the earlier versions of Spark SQL, DataFrames were referred to as SchemaRDDs. DataFrame API in Spark integrates with the Spark procedural code to render tight integration between procedural and relational processing. DataFrame API evaluates operations in a lazy manner to provide support for relational optimizations and optimize the overall data processing workflow. All relational functionalities in Spark can be encapsulated using the SparkSQL context or HiveContext.
Catalyst, an extensible optimizer is at the core functioning of Spark SQL, which is an optimization framework embedded in Scala to help developers improve their productivity and performance of the queries that they write. Using Catalyst, Spark developers can briefly specify complex relational optimizations and query transformations in a few lines of code by making the best use of Scala’s powerful programming constructs like pattern matching and runtime metaprogramming. Catalyst eases the process of adding optimization rules, data sources and data types for machine learning domains.
Spark Streaming: This Spark library is primarily maintained by Tathagat Das and helped by MatieZaharia. As the name suggests this library is for Streaming data. This is a very popular Spark library as it takes Spark’s big data processing power and cranks up the speed. Spark Streaming has the ability to Stream gigabytes per second. This capability of big and fast data has a lot of potentials. Spark Streaming is used for analyzing a continuous stream of data. A common example is processing log data from a website or server.
Spark streaming is not really streaming technically. What it really does is it breaks down the data into individual chunks that it processes together as small RDDs. So it actually does not process data as bytes at a time as it comes in, but it processes data every second or two seconds or some fixed interval of time. So strictly speaking Spark streaming is not real-time but near real-time or micro batching, but it suffices for a vast majority of applications.
Spark streaming can be configured to talk to a variety of data sources. So we can just listen to a port that has a bunch of data being thrown at it, or we can connect to data sources like Amazon Kinesis, Kafka, Flume, etc. There are connectors available to connect Spark to these sources. The good thing about Spark streaming is it is reliable. It has a concept called “checkpointing” to store state to the disk periodically and depending on what kind of data sources or receiver we are using, it can pick up data from the point of failure. It is a very robust mechanism to handle all kinds of failures like disk failure or node failure etc. Spark Streaming has exactly-once message guarantees and helps recover lost work without having to write any extra code or adding additional configurations.
Just like how Spark SQL has the concept of Dataframe/Dataset built on top of RDD, Spark streaming has something called Dstream. This is a collection of RDDs that embodies the entire stream data. The good thing about Dstream is that we can apply most of the built-in functions on RDDs also on the DStream like flatMap, map, etc. Also, the Dstream can be broken into individual RDDs and can be processed one chunk at a time. Spark developers can reuse the same code for stream and batch processing and can also integrate the streaming data with historical data.
MLlib: Today many companies focus on building customer-centric data products and services which need machine learning to build predictive insights, recommendations, and personalized results. Data scientists can solve these problems using popular languages like Python and R, but they spend a lot of time in building and supporting infrastructure for these languages. Spark has built-in support for doing machine learning and data science at a massive scale using the clusters. It’s called MLLib which stands for Machine Learning Library.
MLlib is a low-level machine learning library. It can be called from Java, Scala and Python programming languages. It is simple to use, scalable and can be easily integrated with other tools and frameworks. MLlib eases the deployment and development of scalable machine learning pipelines. Machine learning in itself is a subject and it may not be possible to get into details here. But these are some of the important features and capabilities Spark MLLib offers:
GraphX: For graphs and graph-parallel processing Apache Spark provides another API called GraphX. The graph here does not mean charts, lines or bar graphs, but these are graphs in computer sciences like social networks which consist of vertices where each vertex consists of an individual user in the social network and there are many users connected to each other by edges. These edges represent the relationship between the users in the network.
GraphX is useful in giving overall information about the graph network like it can tell how many triangles appear in the graph and apply the PageRank algorithm to it. It can measure things like “connectedness”, degree distribution, average path length and other high-level measures of a graph. It can also join graphs together and transform graphs quickly. It also supports the Pregel API for traversing a graph. Spark GraphX provides Resilient Distributed Graph (RDG- an abstraction of Spark RDD’s). RDG’s API is used by data scientists to perform several graph operations through various computational primitives. Similar to RDDs basic operations like map, filter, property graphs also consist of basic operators. Those operators take UDFs (user-defined functions) and produce new graphs. Moreover, these are produced with transformed properties and structure.
Spark R: R programming language is widely used by Data scientists due to its simplicity and ability to run complex algorithms. But R suffers from a problem that its data processing capacity is limited to a single node. This makes R not usable when processing a huge amount of data. The problem is solved by SparkR which is an R package in Apache Spark. SparkR provides data frame implementation that supports operations like selection, filtering, aggregation, etc. on distributed large datasets. SparkR also has support for distributed machine learning using Spark MLlib.
The above components make Apache Spark the best Big data processing engine. All these components are provided out of the box and we can use them separately or together.