All Courses

DataFrames, Datasets, and Spark SQL Essentials

Updated on Oct 7, 2025

20,567 Views

Spark SQL and its interfaces DataFrames and Datasets are the future of Spark performance. DataFrames and Datasets are the most important features in getting the best performance out of Spark for the structured data. These data structures use more efficient storage options and optimizers to give users the best performance.

SQL Engine was introduced in Spark 1.0, DataFrames in Spark 1.3 and Dataset in Spark 1.6.
Developers in Spark mostly use DataFrames/Datasets for all processing of data.

Introduction to Datasets and DataFrame Guide in Apache Spark

DataFrames and Datasets in Spark are higher-level APIs which internally use RDDs. Even though we can do anything we wanted to do with our data with RDDs, the higher-level APIs DataFrames and Datasets allow us to become proficient with Spark quicker especially if you have RDBMS and SQL background.

SparkSQL is the module for structured data processing with the added benefit of Schema for the data which we did not have for RDDs. Schema gives more information about the data which Spark is processing. Hence it can perform more optimizations on the data during the processing. And we can work on the data using interactive SQL queries which adhere to the 2003 ANSI SQL. It is also compatible with HIVE.

The other option for querying and processing data is the DataFrames. DataFrames have distributed a collection of row objects. A row is an object which contains the data and we can access each column of the data. So DataFrames can be thought of as a database table with the data organized in rows and columns. In Spark 2.0 the higher-level APIs were unified to Dataset. DataFrame can be thought of as a row of Dataset i.e Dataset[Row]. The DataFrames can be converted to RDDs and then back to DataFrames as and when required.

Querying DataFrames/Datasets is very easy. Querying DataFrames can be done using Domain Specific Language (DSL) and is very relational in nature. This allows Spark for optimizations.

The below diagram shows the steps in Query execution in SparkSQL/DataFrames/Datasets.

Query execution in SparkSQL

When a query is executed it is resolved into an unresolved logical plan. This means there are unresolved attributes and relations in the plan. So then it has to look into the catalog to fill in the missing information for the plan. This leads to the generation of a logical plan. Here a series of optimizations are performed which generates an optimized logical plan. This optimization engine in Spark is called Catalyst optimizer. The optimized plan is then converted to multiple physical plans where a Cost model is used to select an Optimal Physical Plan. This then gets into the final Code Generation step and then the final query is executed to generate the final output as RDDs.

Let’s look at how we can create DataFrames/Datasets or how to execute Spark SQL. We have seen how SparkContext can be created. In Spark 2.0 we have something called SparkSession which is a simplified entry point for Spark applications. SparkSession encapsulates the SparkContext. Earlier Spark had different contexts to use with different use cases like SQLContext, HiveContext, and SparkContext. All this is unified into SparkSession which simplifies things for developers as there is no confusion as to which context to use.

Another benefit of having SparkSession is unlike SparkContext we can create multiple SparkSessions when needed like below.

SparkSession Code

Let’s look at how we can create a DataFrame and query the data. DataFrames can be created by loading data from external files from the filesystem, HDFS, S3, RDBMS, HBase, etc. They can also be created from existing DataFrames by applying transformations. For simplicity we will create a DataFrameby following method as below:

Spark code

We can collect() the RDD created from our DataFrame but we do not get to see what the DataFrame intended to give us. So we do a .show() which gives us a nice tabular view of our data. The other thing we can do is to check the schema using .schema or .printSchema().

If we just do a listDf, and press tab we can see all the methods available.

Spark code

Also, since we did not provide any column names to our DataFrame we see _1,_2 as the default names. We can change it by giving proper names using toDF() function.

Spark code

We can query the DataFrames similar to how we query a Table using SQL. This is very similar by using Domain-Specific Language in Spark. Below we can see the way it can be done.

SQL	Spark DSL
Select * from Country	listDF2.show()
Select Id from Country	listDF2.select(“Id”).show()
Select * from Country where Id = 1	listDF2.select("*").where(col("Id") === 1 ).show

Spark code

The DataFrame can also be saved in a filesystem, HDFS, S3, etc. Here we just save it to file system.

Spark code

Working with Spark SQL is very similar to working with DataFrames. The advantage is that we can use our familiar ANSI SQL queries instead of DSL which makes it very convenient and reduces the learning curve a lot. For that, we just need to register our DataFrame as a temporary table or a view and then we can run all our SQL queries. Let’s see how to do it below:

Spark code

Let’s move on to Datasets. Datasets are very similar to DataFrames with the distinction that they have a strongly typed collection of objects. So they are type safe. This helps us catch some of the errors at compile time which is not possible with RDDs and DataFrames. And using Datasets is almost similar to using DataFrames like processing and querying we have seen earlier.

Creating Dataset

Spark code

Creating Dataset from DataFrames

Spark code

Creating Dataset from RDD

Spark code

A tale of Three APIS: RDDs, DataFrames, and Datasets, Spark SQL

Now since we have understood all the three APIs Spark provides i.e. RDD, DataFrame, and Dataset, we should understand which one to use when and how is the performance of each of these APIs.

RDD: We should use RDDs in the following use cases:

When we are working with unstructured data like media streams, texts, logs, etc
When we want low-level transformation and actions and control on our data
When we do not care about the schema of the data and don’t want to represent in a columnar fashion
When we want to use functional programming constructs and domain-specific expressions
When we can ignore some performance, optimizations and memory size

DataFrame: We should use DataFrames in the following use cases:

When we are dealing with structured and semi-structured data
When performance and memory is the key to our application
When we want to represent our data in tabular format and think of SQL like processing and querying

We should remember that DataFrames and Datasets are unified API since Spark 2.0 and so most of the functionalities are now available in both.

Dataset: We should go with Dataset for the following reasons:

When we want to work with semi-structured and structured data
When we want type safety and it is quite important in our application
When we want to catch type errors at the development stage with compile-time errors
When we want to have a tabular view of our data with type information
When performance and memory are of utmost importance.

We can summarize the above three on basis of performance in the following manner:

performance

Conclusion

In this section we looked at the higher level APIs and understood when and how to use them. We also looked at their performance comparison which gives us a clear picture about their usage.

Full Name*

Email*

+91

Phone Number*

United States +1

India +91

Canada +1

Australia +61

Singapore +65

New Zealand +64

Germany +49

United Arab Emirates +971

Hong Kong +852

Ireland +353

Afghanistan +93

Aland Islands +358

Albania +355

Algeria +213

AmericanSamoa +1684

Andorra +376

Angola +244

Anguilla +1264

Antarctica +672

Antigua and Barbuda +1268

Argentina +54

Armenia +374

Aruba +297

Ascension Island +247

Austria +43

Azerbaijan +994

Bahamas +1242

Bahrain +973

Bangladesh +880

Barbados +1246

Belarus +375

Belgium +32

Belize +501

Benin +229

Bermuda +1441

Bhutan +975

Bolivia +591

Bosnia and Herzegovina +387

Botswana +267

Brazil +55

British Indian Ocean Territory +246

Brunei Darussalam +673

Bulgaria +359

Burkina Faso +226

Burundi +257

Cambodia +855

Cameroon +237

Cape Verde +238

Cayman Islands +1345

Central African Republic +236

Chad +235

Chile +56

China +86

Christmas Island +61

Cocos (Keeling) Islands +61

Colombia +57

Comoros +269

Congo +242

Cook Islands +682

Costa Rica +506

Cote d'Ivoire +225

Croatia +385

Cuba +53

Cyprus +357

Czech Republic +420

Democratic Republic of the Congo +243

Denmark +45

Djibouti +253

Dominica +1767

Dominican Republic +1849

Ecuador +593

Egypt +20

El Salvador +503

Equatorial Guinea +240

Eritrea +291

Estonia +372

Eswatini +268

Ethiopia +251

Falkland Islands (Malvinas) +500

Faroe Islands +298

Fiji +679

Finland +358

France +33

French Guiana +594

French Polynesia +689

Gabon +241

Gambia +220

Georgia +995

Ghana +233

Gibraltar +350

Greece +30

Greenland +299

Grenada +1473

Guadeloupe +590

Guam +1671

Guatemala +502

Guernsey +44

Guinea +224

Guinea-Bissau +245

Guyana +592

Haiti +509

Holy See (Vatican City State) +379

Honduras +504

Hungary +36

Iceland +354

Indonesia +62

Iran +98

Iraq +964

Isle of Man +44

Israel +972

Italy +39

Jamaica +1876

Japan +81

Jersey +44

Jordan +962

Kazakhstan +77

Kenya +254

Kiribati +686

Korea, Democratic People's Republic of Korea +850

Korea, Republic of South Korea +82

Kosovo +383

Kyrgyzstan +996

Laos +856

Latvia +371

Lebanon +961

Lesotho +266

Liberia +231

Libya +218

Liechtenstein +423

Lithuania +370

Luxembourg +352

Macau +853

Madagascar +261

Malawi +265

Malaysia +60

Maldives +960

Mali +223

Malta +356

Marshall Islands +692

Martinique +596

Mauritania +222

Mauritius +230

Mayotte +262

Mexico +52

Micronesia, Federated States of Micronesia +691

Moldova +373

Monaco +377

Mongolia +976

Montenegro +382

Montserrat +1664

Morocco +212

Mozambique +258

Myanmar +95

Namibia +264

Nauru +674

Nepal +977

Netherlands +31

New Caledonia +687

Nicaragua +505

Niger +227

Nigeria +234

Niue +683

Norfolk Island +672

North Macedonia +389

Northern Mariana Islands +1670

Norway +47

Oman +968

Pakistan +92

Palau +680

Palestine +970

Papua New Guinea +675

Paraguay +595

Peru +51

Philippines +63

Pitcairn +872

Poland +48

Portugal +351

Puerto Rico +1939

Qatar +974

Reunion +262

Romania +40

Russia +7

Rwanda +250

Saint Barthelemy +590

Saint Helena, Ascension and Tristan Da Cunha +290

Saint Kitts and Nevis +1869

Saint Lucia +1758

Saint Martin +590

Saint Pierre and Miquelon +508

Saint Vincent and the Grenadines +1784

Samoa +685

San Marino +378

Sao Tome and Principe +239

Saudi Arabia +966

Senegal +221

Serbia +381

Seychelles +248

Sierra Leone +232

Sint Maarten +1721

Slovakia +421

Slovenia +386

Solomon Islands +677

Somalia +252

South Africa +27

South Georgia and the South Sandwich Islands +500

South Sudan +211

Spain +34

Sri Lanka +94

Sudan +249

Suriname +597

Svalbard and Jan Mayen +47

Sweden +46

Switzerland +41

Syrian Arab Republic +963

Taiwan +886

Tajikistan +992

Tanzania, United Republic of Tanzania +255

Thailand +66

Timor-Leste +670

Togo +228

Tokelau +690

Tonga +676

Trinidad and Tobago +1868

Tunisia +216

Turkey +90

Turkmenistan +993

Turks and Caicos Islands +1649

Tuvalu +688

Uganda +256

Ukraine +380

United Kingdom +44

Uruguay +598

Uzbekistan +998

Vanuatu +678

Venezuela, Bolivarian Republic of Venezuela +58

Vietnam +84

Virgin Islands, British +1284

Virgin Islands, U.S. +1340

Wallis and Futuna +681

Yemen +967

Zambia +260

Zimbabwe +263

By Signing up, you agree to ourTerms & Conditionsand ourPrivacy and Policy

10% OFF

Coupon Code "GIFT10"

Coupon Expires 22/12

Copy

Get your free handbook for CSM!!

Recommended Courses