All Courses

Apache Spark Architecture

Updated on Oct 7, 2025

20,740 Views

Introduction

In this section we will look at the Apache Spark architecture in detail, and also try to understand how it works internally. We will also understand some of the main technical terms associated with Spark’s architecture like Driver, Executor, Master, Cluster and Worker.

Apache Spark Architectural Concepts, Key Terms, and Keywords

Now since we have a fair understanding of Spark and its main features, let us dive deeper into the architecture of Spark and understand the anatomy of a Spark application. We know Spark is a distributed, cluster computing framework and Spark works in a master-slave fashion. Whenever we need to execute a Spark program we need to perform an operation called “spark-submit”. We will go over the details of what this means in later sections. But to simply understand, spark-submit is like calling the main program as we do in Java. On performing a “spark-submit” on a cluster, a master and one or more slaves are launched to accomplish the task written in the Spark program. There are different modes of launching a Spark program like standalone, client, cluster mode. We will see these options in detail later.

Spark Cluster

To visualize the architecture of a Spark cluster, let us look at the below diagram and understand each component and its functions.

Whenever we want to run an application we need to perform a spark-submit with some parameters. Say we submitted an Application A, this leads to the creation of one Driver process for A which is usually the Master and one or more Executors on the Worker nodes. This entire set of a Driver and Executors is exclusive for the Application A. Now say we want to run another application B and perform a spark-submit, another set of one Driver and few Executors are started which are totally independent of Driver and Executors for Application A. Even though both the Drivers might run on the same machine on the cluster, they are mutually exclusive. Same applies for Executors. So, a Spark cluster consists of a Master Node and Worker Nodes which can be shared across multiple applications, but each application runs mutually exclusive of each other.

When we launch a Spark cluster using a Resource Manager such as YARN, there are two ways to do it: using cluster mode and client mode. In cluster mode, YARN creates and manages an Application Master where the Driver runs and the client can go away once the application is started. In client mode, the Driver keeps running on the client and Application Master only requests resources from the YARN.

To launch a Spark application in cluster mode:

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options]
 <app jar> [app options] 

To launch it in the client mode:

$ ./bin/spark-shell --master yarn --deploy-mode client

cluster mode

Spark Master

When we run Spark in a Standalone mode, a Master node first needs to be started which can be done by executing:

./sbin/start-master.sh

This creates a Master node on the machine where the command is executed. Once the master node starts, it gives the Spark URL of the form: spark://HOST: PORT which can be used to start the Worked Nodes.

Spark Worker

Several worker nodes can be started on different machines on the cluster using the command:

./sbin/start-slave.sh <master-spark-URL>

The master’s web UI can be accessed using: http://localhost:8080 or http://server:8080

We will see these scripts in detail in the Spark Installation section.

Spark Worker

Spark Driver: Driver is the process that runs the main() function of the application and also creates the SparkContext. A driver is a separate JVM process and it is the responsibility of the driver to analyze, distribute, schedule and monitor work across the worker nodes. Each application launched or submitted on a cluster will have its own separate Driver running, and even if there are multiple applications running simultaneously on a cluster, these Drivers will not talk to each other in any way. The Driver program also plays host to a bunch of processes which are part of the application like the following:

SparkEnv
DAGScheduler
TaskScheduler
SparkUI

The Spark application which we want to run is instantiated within the Spark Driver.

Spark Executor: The Driver program launches the tasks which run on individual worker nodes. These tasks are what operate on a subset of RDDs that are present on that node. These programs running on the worker nodes are called executors. The actual program written in your application gets executed by these executors. The Driver program after getting started interacts with the Cluster Manager (YARN, Mesos, Default) to spin off the resources on the Worker nodes and then assign the tasks to the executors. Tasks are the basic units of execution.

SparkSession and SparkContext: SparkContext is the heart of any Spark application. The Sparkcontext can be thought of as a bridge to the Spark environment and all that it has to offer from your program. SparkContext is used as the entry point to kickstart the application. SparkContext can be used to create RDDs like the ones below:

val conf =newSparkConf().setAppName(“FirstSpark”).setMaster(master)
valsc = newSparkContext(conf)
val data =Array(1,2,3,4,5)
val distData=sc.parallelize(data)

distData is the RDD which gets created using SparkContext.

SparkSession is a simplified entry point into Spark application and it also encapsulates the SparkContext. SparkSession is introduced in Spark 2.x. Prior to this, Spark had different contexts for different use cases, like SQLContext when used with SQL queries, HiveContext if running Spark on Hive, StreamingContext, etc. SparkSession makes it simple so there is no confusion which context to use. It subsumes SQLContext and HiveContext. SparkSession is instantiated using a builder and it is an important component of Spark 2.0.

valspark = SparkSession.builder()

.master("local")
.appName("SparkSessionExample")
.getOrCreate()
SparkSession.builder()

In the Spark interactive Scala shell, the SparkSession/Context is automatically provided by the environment and it is not required to manually create it. But in standalone applications, we need to explicitly create it.

Spark Deployment Modes Cheat Sheet

Mode	Driver	When To Use
Client Mode	Driver runs on the machine from where Spark job is submitted.	When job submitting machine is close to the Cluster, there is no network latency. Failure chances are high due to network issues.
Cluster Mode	Driver is launched on any of the machines on the Cluster not on the Client machine where job is submitted.	When job submitting machine is far from the cluster, failure chances are less due to network issues.
Standalone Mode	Driver will be launched on the machine where master script is started.	Useful for development and testing purpose, not recommended for Production grade applications.

Conclusion

In this section we have understood the internals of Apache Spark which are very important, as we will have to look into many of these processes when we work with Spark in a production environment. Most of this understanding comes in handy while debugging and tuning our applications.

Full Name*

Email*

+91

Phone Number*

United States +1

India +91

Canada +1

Australia +61

Singapore +65

New Zealand +64

Germany +49

United Arab Emirates +971

Hong Kong +852

Ireland +353

Afghanistan +93

Aland Islands +358

Albania +355

Algeria +213

AmericanSamoa +1684

Andorra +376

Angola +244

Anguilla +1264

Antarctica +672

Antigua and Barbuda +1268

Argentina +54

Armenia +374

Aruba +297

Ascension Island +247

Austria +43

Azerbaijan +994

Bahamas +1242

Bahrain +973

Bangladesh +880

Barbados +1246

Belarus +375

Belgium +32

Belize +501

Benin +229

Bermuda +1441

Bhutan +975

Bolivia +591

Bosnia and Herzegovina +387

Botswana +267

Brazil +55

British Indian Ocean Territory +246

Brunei Darussalam +673

Bulgaria +359

Burkina Faso +226

Burundi +257

Cambodia +855

Cameroon +237

Cape Verde +238

Cayman Islands +1345

Central African Republic +236

Chad +235

Chile +56

China +86

Christmas Island +61

Cocos (Keeling) Islands +61

Colombia +57

Comoros +269

Congo +242

Cook Islands +682

Costa Rica +506

Cote d'Ivoire +225

Croatia +385

Cuba +53

Cyprus +357

Czech Republic +420

Democratic Republic of the Congo +243

Denmark +45

Djibouti +253

Dominica +1767

Dominican Republic +1849

Ecuador +593

Egypt +20

El Salvador +503

Equatorial Guinea +240

Eritrea +291

Estonia +372

Eswatini +268

Ethiopia +251

Falkland Islands (Malvinas) +500

Faroe Islands +298

Fiji +679

Finland +358

France +33

French Guiana +594

French Polynesia +689

Gabon +241

Gambia +220

Georgia +995

Ghana +233

Gibraltar +350

Greece +30

Greenland +299

Grenada +1473

Guadeloupe +590

Guam +1671

Guatemala +502

Guernsey +44

Guinea +224

Guinea-Bissau +245

Guyana +592

Haiti +509

Holy See (Vatican City State) +379

Honduras +504

Hungary +36

Iceland +354

Indonesia +62

Iran +98

Iraq +964

Isle of Man +44

Israel +972

Italy +39

Jamaica +1876

Japan +81

Jersey +44

Jordan +962

Kazakhstan +77

Kenya +254

Kiribati +686

Korea, Democratic People's Republic of Korea +850

Korea, Republic of South Korea +82

Kosovo +383

Kyrgyzstan +996

Laos +856

Latvia +371

Lebanon +961

Lesotho +266

Liberia +231

Libya +218

Liechtenstein +423

Lithuania +370

Luxembourg +352

Macau +853

Madagascar +261

Malawi +265

Malaysia +60

Maldives +960

Mali +223

Malta +356

Marshall Islands +692

Martinique +596

Mauritania +222

Mauritius +230

Mayotte +262

Mexico +52

Micronesia, Federated States of Micronesia +691

Moldova +373

Monaco +377

Mongolia +976

Montenegro +382

Montserrat +1664

Morocco +212

Mozambique +258

Myanmar +95

Namibia +264

Nauru +674

Nepal +977

Netherlands +31

New Caledonia +687

Nicaragua +505

Niger +227

Nigeria +234

Niue +683

Norfolk Island +672

North Macedonia +389

Northern Mariana Islands +1670

Norway +47

Oman +968

Pakistan +92

Palau +680

Palestine +970

Papua New Guinea +675

Paraguay +595

Peru +51

Philippines +63

Pitcairn +872

Poland +48

Portugal +351

Puerto Rico +1939

Qatar +974

Reunion +262

Romania +40

Russia +7

Rwanda +250

Saint Barthelemy +590

Saint Helena, Ascension and Tristan Da Cunha +290

Saint Kitts and Nevis +1869

Saint Lucia +1758

Saint Martin +590

Saint Pierre and Miquelon +508

Saint Vincent and the Grenadines +1784

Samoa +685

San Marino +378

Sao Tome and Principe +239

Saudi Arabia +966

Senegal +221

Serbia +381

Seychelles +248

Sierra Leone +232

Sint Maarten +1721

Slovakia +421

Slovenia +386

Solomon Islands +677

Somalia +252

South Africa +27

South Georgia and the South Sandwich Islands +500

South Sudan +211

Spain +34

Sri Lanka +94

Sudan +249

Suriname +597

Svalbard and Jan Mayen +47

Sweden +46

Switzerland +41

Syrian Arab Republic +963

Taiwan +886

Tajikistan +992

Tanzania, United Republic of Tanzania +255

Thailand +66

Timor-Leste +670

Togo +228

Tokelau +690

Tonga +676

Trinidad and Tobago +1868

Tunisia +216

Turkey +90

Turkmenistan +993

Turks and Caicos Islands +1649

Tuvalu +688

Uganda +256

Ukraine +380

United Kingdom +44

Uruguay +598

Uzbekistan +998

Vanuatu +678

Venezuela, Bolivarian Republic of Venezuela +58

Vietnam +84

Virgin Islands, British +1284

Virgin Islands, U.S. +1340

Wallis and Futuna +681

Yemen +967

Zambia +260

Zimbabwe +263

By Signing up, you agree to ourTerms & Conditionsand ourPrivacy and Policy

10% OFF

Coupon Code "SELF10"

Coupon Expires 09/03

Copy

Get your free handbook for CSM!!

Recommended Courses