All Courses

Streaming Operations on DataFrames and Datasets

Updated on Oct 7, 2025

20,618 Views

Table of Content

continuous processing mode

Introduction

Even though Spark is a general purpose computing engine, it also helps to handle Streaming data. Spark streaming can be used for near-real time processing.

Structured Streaming supports all types of untyped SQL like operations such as select, where, groupBy and typed RDD like operations like map, filter, flatMap, etc.

Most of the common operations on DataFrames/Datasets can be applied to Structure streaming. There are only a few operations which are not supported which we will see later.

Let us see some examples:

case class DeviceData(device: String, deviceType: String,
 signal: Double, time: DateTime)

val df: DataFrame =...// streaming DataFrame with IOT device
 data with schema { device: string, deviceType: string, signal:
 double, time: string } 
val ds: Dataset[DeviceData] = df.as[DeviceData]// streaming
 Dataset with IOT device data 

// Select the devices which have signal more than 10 
df.select("device").where("signal > 10")  // using untyped APIs    

ds.filter(_.signal >10).map(_.device)    // using typed
 APIs 

// Running count of the number of updates for each device type 
df.groupBy("deviceType").count()        // 
using untyped API 

// Running average signal for each device type 
import org.apache.spark.sql.expressions.scalalang.typed 
ds.groupByKey(_.deviceType).agg(typed.avg(_.signal))  // using
 typed API

We can also register the streaming Dataframe as a temporary view or table the same way we did for DataFrames.

df.createOrReplaceTempView("updates") 
spark.sql("select count(*) from updates")  // returns another streaming DF

We can also check if the Dataframe is streaming or not using below:

df.isStreaming

Window Operation on Event Time

If we want to aggregate data over a sliding event-time window it is easy using Structured Streaming. It is also very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window that the event-time of a row falls into.

Event Time Aggregations and WaterMarking

From a logical point of view, the key idea in event time is to treat application-specified timestamps as an arbitrary field in the data, allowing records to arrive out-of-order. We can then use standard operators and incremental processing to update results grouped by event time. In practice, however, it is useful for the processing system to have some loose bounds on how late data can arrive, for two reasons:

- Allowing arbitrarily late data might require storing arbitrarily large states. For example, if we count data by a 1-minute event time window, the system needs to remember a count for every 1-minute window since the application began, because a late record might still arrive for any particular minute. This can quickly lead to large amounts of state, especially if combined with another grouping key. The same issue happens with joins.

- Some sinks do not support data retraction, making it useful to be able to write the results for a given event time after a timeout. For example, custom downstream applications want to start working with a “final" result and might not support retractions. Append-mode sinks also do not support retractions.

Structured Streaming lets developers set a watermark for event time columns using the withWatermark operator. This operator gives the system a delay threshold tC for a given timestamp column C. At any point in time, the watermark for C is max(C) −tC, that is, tC seconds before the maximum event time seen so far in C. Note that this choice of watermark is naturally robust to backlogged data: if the system cannot keep up with the input rate for a period of time, the watermark will not move forward arbitrarily during that time, and all events that arrived within at most T seconds of being produced will still be processed. When present, watermarks affect when stateful operators can forget old state (e.g., if grouping by a window derived from a watermarked column), and when Structured Streaming will output data with an event time key to append-mode sinks. Different input streams can have different watermarks.

Late data handling in windowed grouped aggregation

With Spark 2.0, Structured Streaming has supported joins (inner join and some type of outer joins) between a streaming and a static DataFrame/Dataset.

Spark 2.3 added support for stream-stream joins, that is, you can join two streaming Datasets/DataFrames. The challenge of generating join results between two data streams is that, at any point in time, the view of the dataset is incomplete for both sides of the join making it much harder to find matches between inputs. Any row received from one input stream can match with any future, yet-to-be-received row from the other input stream.

Many use cases require more advanced stateful operations than aggregations. For example, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger. Since Spark 2.2, this can be done using the operation mapGroupsWithState and the more powerful operation flatMapGroupsWithState.

Both operations allow the application of user-defined code on grouped Datasets to update user-defined state.

Continuous Processing Mode

A new continuous processing added in Apache Spark 2.3 executes Structured Streaming jobs using long-lived operators as in traditional streaming systems such as Telegraph and Borealis. This mode enables lower latency at a cost of less operational flexibility (e.g., limited support for rescaling the job at runtime).

The key enabler for this execution mode was choosing a declarative API for Structured Streaming that is not tied to the execution strategy. For example, the original Spark Streaming API had some operators based on processing time that leaked the concept of micro-batches into the programming model, making it hard to move programs to another type of engine.

In contrast, Structured Streaming’s API and semantics are independent of the execution engine: continuous execution is similar to having a much larger number of triggers. Note that unlike systems based purely on unsynchronized message passing, such as Storm, we do retain the concept of triggers and epochs in this mode so the output from multiple nodes can be coordinated and committed together to the sink.

Conclusion

We saw how Spark Streaming works and this makes it a very suitable choice for processing batches as well as streaming data processing. It is also a very good choice for implementing Lambda architecture.

Full Name*

Email*

+91

Phone Number*

United States +1

India +91

Canada +1

Australia +61

Singapore +65

New Zealand +64

Germany +49

United Arab Emirates +971

Hong Kong +852

Ireland +353

Afghanistan +93

Aland Islands +358

Albania +355

Algeria +213

AmericanSamoa +1684

Andorra +376

Angola +244

Anguilla +1264

Antarctica +672

Antigua and Barbuda +1268

Argentina +54

Armenia +374

Aruba +297

Ascension Island +247

Austria +43

Azerbaijan +994

Bahamas +1242

Bahrain +973

Bangladesh +880

Barbados +1246

Belarus +375

Belgium +32

Belize +501

Benin +229

Bermuda +1441

Bhutan +975

Bolivia +591

Bosnia and Herzegovina +387

Botswana +267

Brazil +55

British Indian Ocean Territory +246

Brunei Darussalam +673

Bulgaria +359

Burkina Faso +226

Burundi +257

Cambodia +855

Cameroon +237

Cape Verde +238

Cayman Islands +1345

Central African Republic +236

Chad +235

Chile +56

China +86

Christmas Island +61

Cocos (Keeling) Islands +61

Colombia +57

Comoros +269

Congo +242

Cook Islands +682

Costa Rica +506

Cote d'Ivoire +225

Croatia +385

Cuba +53

Cyprus +357

Czech Republic +420

Democratic Republic of the Congo +243

Denmark +45

Djibouti +253

Dominica +1767

Dominican Republic +1849

Ecuador +593

Egypt +20

El Salvador +503

Equatorial Guinea +240

Eritrea +291

Estonia +372

Eswatini +268

Ethiopia +251

Falkland Islands (Malvinas) +500

Faroe Islands +298

Fiji +679

Finland +358

France +33

French Guiana +594

French Polynesia +689

Gabon +241

Gambia +220

Georgia +995

Ghana +233

Gibraltar +350

Greece +30

Greenland +299

Grenada +1473

Guadeloupe +590

Guam +1671

Guatemala +502

Guernsey +44

Guinea +224

Guinea-Bissau +245

Guyana +592

Haiti +509

Holy See (Vatican City State) +379

Honduras +504

Hungary +36

Iceland +354

Indonesia +62

Iran +98

Iraq +964

Isle of Man +44

Israel +972

Italy +39

Jamaica +1876

Japan +81

Jersey +44

Jordan +962

Kazakhstan +77

Kenya +254

Kiribati +686

Korea, Democratic People's Republic of Korea +850

Korea, Republic of South Korea +82

Kosovo +383

Kyrgyzstan +996

Laos +856

Latvia +371

Lebanon +961

Lesotho +266

Liberia +231

Libya +218

Liechtenstein +423

Lithuania +370

Luxembourg +352

Macau +853

Madagascar +261

Malawi +265

Malaysia +60

Maldives +960

Mali +223

Malta +356

Marshall Islands +692

Martinique +596

Mauritania +222

Mauritius +230

Mayotte +262

Mexico +52

Micronesia, Federated States of Micronesia +691

Moldova +373

Monaco +377

Mongolia +976

Montenegro +382

Montserrat +1664

Morocco +212

Mozambique +258

Myanmar +95

Namibia +264

Nauru +674

Nepal +977

Netherlands +31

New Caledonia +687

Nicaragua +505

Niger +227

Nigeria +234

Niue +683

Norfolk Island +672

North Macedonia +389

Northern Mariana Islands +1670

Norway +47

Oman +968

Pakistan +92

Palau +680

Palestine +970

Papua New Guinea +675

Paraguay +595

Peru +51

Philippines +63

Pitcairn +872

Poland +48

Portugal +351

Puerto Rico +1939

Qatar +974

Reunion +262

Romania +40

Russia +7

Rwanda +250

Saint Barthelemy +590

Saint Helena, Ascension and Tristan Da Cunha +290

Saint Kitts and Nevis +1869

Saint Lucia +1758

Saint Martin +590

Saint Pierre and Miquelon +508

Saint Vincent and the Grenadines +1784

Samoa +685

San Marino +378

Sao Tome and Principe +239

Saudi Arabia +966

Senegal +221

Serbia +381

Seychelles +248

Sierra Leone +232

Sint Maarten +1721

Slovakia +421

Slovenia +386

Solomon Islands +677

Somalia +252

South Africa +27

South Georgia and the South Sandwich Islands +500

South Sudan +211

Spain +34

Sri Lanka +94

Sudan +249

Suriname +597

Svalbard and Jan Mayen +47

Sweden +46

Switzerland +41

Syrian Arab Republic +963

Taiwan +886

Tajikistan +992

Tanzania, United Republic of Tanzania +255

Thailand +66

Timor-Leste +670

Togo +228

Tokelau +690

Tonga +676

Trinidad and Tobago +1868

Tunisia +216

Turkey +90

Turkmenistan +993

Turks and Caicos Islands +1649

Tuvalu +688

Uganda +256

Ukraine +380

United Kingdom +44

Uruguay +598

Uzbekistan +998

Vanuatu +678

Venezuela, Bolivarian Republic of Venezuela +58

Vietnam +84

Virgin Islands, British +1284

Virgin Islands, U.S. +1340

Wallis and Futuna +681

Yemen +967

Zambia +260

Zimbabwe +263

By Signing up, you agree to ourTerms & Conditionsand ourPrivacy and Policy

Get your free handbook for CSM!!

Recommended Courses