
Domains
Agile Management
Master Agile methodologies for efficient and timely project delivery.
View All Agile Management Coursesicon-refresh-cwCertifications
Scrum Alliance
16 Hours
Best Seller
Certified ScrumMaster (CSM) CertificationScrum Alliance
16 Hours
Best Seller
Certified Scrum Product Owner (CSPO) CertificationScaled Agile
16 Hours
Trending
Leading SAFe 6.0 CertificationScrum.org
16 Hours
Professional Scrum Master (PSM) CertificationScaled Agile
16 Hours
SAFe 6.0 Scrum Master (SSM) CertificationAdvanced Certifications
Scaled Agile, Inc.
32 Hours
Recommended
Implementing SAFe 6.0 (SPC) CertificationScaled Agile, Inc.
24 Hours
SAFe 6.0 Release Train Engineer (RTE) CertificationScaled Agile, Inc.
16 Hours
Trending
SAFe® 6.0 Product Owner/Product Manager (POPM)IC Agile
24 Hours
ICP Agile Certified Coaching (ICP-ACC)Scrum.org
16 Hours
Professional Scrum Product Owner I (PSPO I) TrainingMasters
32 Hours
Trending
Agile Management Master's Program32 Hours
Agile Excellence Master's ProgramOn-Demand Courses
Agile and ScrumRoles
Scrum MasterTech Courses and Bootcamps
Full Stack Developer BootcampAccreditation Bodies
Scrum AllianceTop Resources
Scrum TutorialProject Management
Gain expert skills to lead projects to success and timely completion.
View All Project Management Coursesicon-standCertifications
PMI
36 Hours
Best Seller
Project Management Professional (PMP) CertificationAxelos
32 Hours
PRINCE2 Foundation & Practitioner CertificationAxelos
16 Hours
PRINCE2 Foundation CertificationAxelos
16 Hours
PRINCE2 Practitioner CertificationSkills
Change ManagementMasters
Job Oriented
45 Hours
Trending
Project Management Master's ProgramUniversity Programs
45 Hours
Trending
Project Management Master's ProgramOn-Demand Courses
PRINCE2 Practitioner CourseRoles
Project ManagerAccreditation Bodies
PMITop Resources
Theories of MotivationCloud Computing
Learn to harness the cloud to deliver computing resources efficiently.
View All Cloud Computing Coursesicon-cloud-snowingCertifications
AWS
32 Hours
Best Seller
AWS Certified Solutions Architect - AssociateAWS
32 Hours
AWS Cloud Practitioner CertificationAWS
24 Hours
AWS DevOps CertificationMicrosoft
16 Hours
Azure Fundamentals CertificationMicrosoft
24 Hours
Best Seller
Azure Administrator CertificationMicrosoft
45 Hours
Recommended
Azure Data Engineer CertificationMicrosoft
32 Hours
Azure Solution Architect CertificationMicrosoft
40 Hours
Azure DevOps CertificationAWS
24 Hours
Systems Operations on AWS Certification TrainingAWS
24 Hours
Developing on AWSMasters
Job Oriented
48 Hours
New
AWS Cloud Architect Masters ProgramBootcamps
Career Kickstarter
100 Hours
Trending
Cloud Engineer BootcampRoles
Cloud EngineerOn-Demand Courses
AWS Certified Developer Associate - Complete GuideAuthorized Partners of
AWSTop Resources
Scrum TutorialIT Service Management
Understand how to plan, design, and optimize IT services efficiently.
View All DevOps Coursesicon-git-commitCertifications
Axelos
16 Hours
Best Seller
ITIL 4 Foundation CertificationAxelos
16 Hours
ITIL Practitioner CertificationPeopleCert
16 Hours
ISO 14001 Foundation CertificationPeopleCert
16 Hours
ISO 20000 CertificationPeopleCert
24 Hours
ISO 27000 Foundation CertificationAxelos
24 Hours
ITIL 4 Specialist: Create, Deliver and Support TrainingAxelos
24 Hours
ITIL 4 Specialist: Drive Stakeholder Value TrainingAxelos
16 Hours
ITIL 4 Strategist Direct, Plan and Improve TrainingOn-Demand Courses
ITIL 4 Specialist: Create, Deliver and Support ExamTop Resources
ITIL Practice TestData Science
Unlock valuable insights from data with advanced analytics.
View All Data Science Coursesicon-dataBootcamps
Job Oriented
6 Months
Trending
Data Science BootcampJob Oriented
289 Hours
Data Engineer BootcampJob Oriented
6 Months
Data Analyst BootcampJob Oriented
288 Hours
New
AI Engineer BootcampSkills
Data Science with PythonRoles
Data ScientistOn-Demand Courses
Data Analysis Using ExcelTop Resources
Machine Learning TutorialDevOps
Automate and streamline the delivery of products and services.
View All DevOps Coursesicon-terminal-squareCertifications
DevOps Institute
16 Hours
Best Seller
DevOps Foundation CertificationCNCF
32 Hours
New
Certified Kubernetes AdministratorDevops Institute
16 Hours
Devops LeaderSkills
KubernetesRoles
DevOps EngineerOn-Demand Courses
CI/CD with Jenkins XGlobal Accreditations
DevOps InstituteTop Resources
Top DevOps ProjectsBI And Visualization
Understand how to transform data into actionable, measurable insights.
View All BI And Visualization Coursesicon-microscopeBI and Visualization Tools
Certification
24 Hours
Recommended
Tableau CertificationCertification
24 Hours
Data Visualization with Tableau CertificationMicrosoft
24 Hours
Best Seller
Microsoft Power BI CertificationTIBCO
36 Hours
TIBCO Spotfire TrainingCertification
30 Hours
Data Visualization with QlikView CertificationCertification
16 Hours
Sisense BI CertificationOn-Demand Courses
Data Visualization Using Tableau TrainingTop Resources
Python Data Viz LibsCyber Security
Understand how to protect data and systems from threats or disasters.
View All Cyber Security Coursesicon-refresh-cwCertifications
CompTIA
40 Hours
Best Seller
CompTIA Security+EC-Council
40 Hours
Certified Ethical Hacker (CEH v12) CertificationISACA
22 Hours
Certified Information Systems Auditor (CISA) CertificationISACA
40 Hours
Certified Information Security Manager (CISM) Certification(ISC)²
40 Hours
Certified Information Systems Security Professional (CISSP)(ISC)²
40 Hours
Certified Cloud Security Professional (CCSP) Certification16 Hours
Certified Information Privacy Professional - Europe (CIPP-E) CertificationISACA
16 Hours
COBIT5 Foundation16 Hours
Payment Card Industry Security Standards (PCI-DSS) CertificationOn-Demand Courses
CISSPTop Resources
Laptops for IT SecurityWeb Development
Learn to create user-friendly, fast, and dynamic web applications.
View All Web Development Coursesicon-codeBootcamps
Career Kickstarter
6 Months
Best Seller
Full-Stack Developer BootcampJob Oriented
3 Months
Best Seller
UI/UX Design BootcampEnterprise Recommended
6 Months
Java Full Stack Developer BootcampCareer Kickstarter
490+ Hours
Front-End Development BootcampCareer Accelerator
4 Months
Backend Development Bootcamp (Node JS)Skills
ReactOn-Demand Courses
Angular TrainingTop Resources
Top HTML ProjectsBlockchain
Understand how transactions and databases work in blockchain technology.
View All Blockchain Coursesicon-stop-squareBlockchain Certifications
40 Hours
Blockchain Professional Certification32 Hours
Blockchain Solutions Architect Certification32 Hours
Blockchain Security Engineer Certification24 Hours
Blockchain Quality Engineer Certification5+ Hours
Blockchain 101 CertificationOn-Demand Courses
NFT Essentials 101: A Beginner's GuideTop Resources
Blockchain Interview QsProgramming
Learn to code efficiently and design software that solves problems.
View All Programming Coursesicon-codeSkills
Python CertificationInterview Prep
Career Accelerator
3 Months
Software Engineer Interview PrepOn-Demand Courses
Data Structures and Algorithms with JavaScriptTop Resources
Python TutorialIntroduction
In this section we will look at the Apache Spark architecture in detail, and also try to understand how it works internally. We will also understand some of the main technical terms associated with Spark’s architecture like Driver, Executor, Master, Cluster and Worker.
Now since we have a fair understanding of Spark and its main features, let us dive deeper into the architecture of Spark and understand the anatomy of a Spark application. We know Spark is a distributed, cluster computing framework and Spark works in a master-slave fashion. Whenever we need to execute a Spark program we need to perform an operation called “spark-submit”. We will go over the details of what this means in later sections. But to simply understand, spark-submit is like calling the main program as we do in Java. On performing a “spark-submit” on a cluster, a master and one or more slaves are launched to accomplish the task written in the Spark program. There are different modes of launching a Spark program like standalone, client, cluster mode. We will see these options in detail later.
To visualize the architecture of a Spark cluster, let us look at the below diagram and understand each component and its functions.
Whenever we want to run an application we need to perform a spark-submit with some parameters. Say we submitted an Application A, this leads to the creation of one Driver process for A which is usually the Master and one or more Executors on the Worker nodes. This entire set of a Driver and Executors is exclusive for the Application A. Now say we want to run another application B and perform a spark-submit, another set of one Driver and few Executors are started which are totally independent of Driver and Executors for Application A. Even though both the Drivers might run on the same machine on the cluster, they are mutually exclusive. Same applies for Executors. So, a Spark cluster consists of a Master Node and Worker Nodes which can be shared across multiple applications, but each application runs mutually exclusive of each other.
When we launch a Spark cluster using a Resource Manager such as YARN, there are two ways to do it: using cluster mode and client mode. In cluster mode, YARN creates and manages an Application Master where the Driver runs and the client can go away once the application is started. In client mode, the Driver keeps running on the client and Application Master only requests resources from the YARN.
To launch a Spark application in cluster mode:
$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options]
<app jar> [app options]
To launch it in the client mode:
$ ./bin/spark-shell --master yarn --deploy-mode client

When we run Spark in a Standalone mode, a Master node first needs to be started which can be done by executing:
./sbin/start-master.sh
This creates a Master node on the machine where the command is executed. Once the master node starts, it gives the Spark URL of the form: spark://HOST: PORT which can be used to start the Worked Nodes.
Several worker nodes can be started on different machines on the cluster using the command:
./sbin/start-slave.sh <master-spark-URL>
The master’s web UI can be accessed using: http://localhost:8080 or http://server:8080
We will see these scripts in detail in the Spark Installation section.

Spark Driver: Driver is the process that runs the main() function of the application and also creates the SparkContext. A driver is a separate JVM process and it is the responsibility of the driver to analyze, distribute, schedule and monitor work across the worker nodes. Each application launched or submitted on a cluster will have its own separate Driver running, and even if there are multiple applications running simultaneously on a cluster, these Drivers will not talk to each other in any way. The Driver program also plays host to a bunch of processes which are part of the application like the following:
The Spark application which we want to run is instantiated within the Spark Driver.
Spark Executor: The Driver program launches the tasks which run on individual worker nodes. These tasks are what operate on a subset of RDDs that are present on that node. These programs running on the worker nodes are called executors. The actual program written in your application gets executed by these executors. The Driver program after getting started interacts with the Cluster Manager (YARN, Mesos, Default) to spin off the resources on the Worker nodes and then assign the tasks to the executors. Tasks are the basic units of execution.
SparkSession and SparkContext: SparkContext is the heart of any Spark application. The Sparkcontext can be thought of as a bridge to the Spark environment and all that it has to offer from your program. SparkContext is used as the entry point to kickstart the application. SparkContext can be used to create RDDs like the ones below:
distData is the RDD which gets created using SparkContext.
SparkSession is a simplified entry point into Spark application and it also encapsulates the SparkContext. SparkSession is introduced in Spark 2.x. Prior to this, Spark had different contexts for different use cases, like SQLContext when used with SQL queries, HiveContext if running Spark on Hive, StreamingContext, etc. SparkSession makes it simple so there is no confusion which context to use. It subsumes SQLContext and HiveContext. SparkSession is instantiated using a builder and it is an important component of Spark 2.0.
valspark = SparkSession.builder()
.master("local")
.appName("SparkSessionExample")
.getOrCreate()
SparkSession.builder()
In the Spark interactive Scala shell, the SparkSession/Context is automatically provided by the environment and it is not required to manually create it. But in standalone applications, we need to explicitly create it.
| Mode | Driver | When To Use |
Client Mode | Driver runs on the machine from where Spark job is submitted. | When job submitting machine is close to the Cluster, there is no network latency. Failure chances are high due to network issues. |
Cluster Mode | Driver is launched on any of the machines on the Cluster not on the Client machine where job is submitted. | When job submitting machine is far from the cluster, failure chances are less due to network issues. |
Standalone Mode | Driver will be launched on the machine where master script is started. | Useful for development and testing purpose, not recommended for Production grade applications. |
Conclusion
In this section we have understood the internals of Apache Spark which are very important, as we will have to look into many of these processes when we work with Spark in a production environment. Most of this understanding comes in handy while debugging and tuning our applications.