A new breed of ‘Fast Data’ architectures has evolved to be stream-oriented, where data is processed as it arrives, providing businesses with a competitive advantage. - Dean Wampler (Renowned author of many big data technology-related books)
Dean Wampler makes an important point in one of his webinars. The demand for stream processing is increasing every day in today’s era. The main reason behind it is, processing only volumes of data is not sufficient but processing data at faster rates and making insights out of it in real time is very essential so that organization can react to changing business conditions in real time.
And hence, there is a need to understand the concept “stream processing “and technology behind it.
Think of streaming as an unbounded, continuous real-time flow of records and processing these records in similar timeframe is stream processing.
AWS (Amazon Web Services) defines “Streaming Data” is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling.
In stream processing method, continuous computation happens as the data flows through the system.
Stream processing is highly beneficial if the events you wish to track are happening frequently and close together in time. It is also best to utilize if the event needs to be detected right away and responded to quickly.
There is a subtle difference between stream processing, real-time processing (Rear real-time) and complex event processing (CEP). Let’s quickly look at the examples to understand the difference.
We have multiple tools available to accomplish above-mentioned Stream, Realtime or Complex event Processing. Spark Streaming, Kafka Stream, Flink, Storm, Akka, Structured streaming are to name a few.
We will try to understand Spark streaming and Kafka stream in depth further in this article. As historically, these are occupying significant market share.
Kafka is actually a message broker with a really good performance so that all your data can flow through it before being redistributed to applications. Kafka works as a data pipeline.
Typically, Kafka Stream supports per-second stream processing with millisecond latency.
Kafka Streams is a client library for processing and analyzing data stored in Kafka. Kafka streams can process data in 2 ways.
It also does not do mini batching, which is “real streaming”.
Kafka Streams is built upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple (yet efficient) management of application state. It is based on many concepts already contained in Kafka, such as scaling by partitioning.
Also, for this reason, it comes as a lightweight library that can be integrated into an application.
The application can then be operated as desired, as mentioned below:
Spark Streaming receives live input data streams, it collects data for some time, builds RDD, divides the data into micro-batches, which are then processed by the Spark engine to generate the final stream of results in micro-batches. Following data flow diagram explains the working of Spark streaming.
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.
DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs. Think about RDD as the underlying concept for distributing data over a cluster of computers.
It makes it very easy for developers to use a single framework to satisfy all the processing needs. They can use MLib (Spark's machine learning library) to train models offline and directly use them online for scoring live data in Spark Streaming. In fact, some models perform continuous, online learning, and scoring.
Not all real-life use-cases need data to be processed at real real-time, few seconds delay is tolerated over having a unified framework like Spark Streaming and volumes of data processing. It provides a range of capabilities by integrating with other spark tools to do a variety of data processing.
Now that we have understood high level what these tools mean, it’s obvious to have curiosity around differences between both the tools. Following table briefly explain you, key differences between the two.
|Sr.No||Spark streaming||Kafka Streams|
|1||Data received form live input data streams is Divided into Micro-batched for processing.||processes per data stream(real real-time)|
|2||Separated processing Cluster is requried||No separated processing cluster is requried.|
|3||Needs re-configuration for Scaling||Scales easily by just adding java processes, No reconfiguration requried.|
|4||At least one semantics||Exactly one semantics|
|5||Spark streaming is better at processing group of rows(groups,by,ml,window functions etc.)||Kafka streams provides true a-record-at-a-time processing capabilities. it's better for functions like rows parsing, data cleansing etc.|
|6||Spark streaming is standalone framework.||Kafka stream can be used as part of microservice,as it's just a library.|
Following are a couple of many industry Use cases where Kafka stream is being used:
Broadly, Kafka is suitable for microservices integration use cases and have wider flexibility.
Following are a couple of the many industries use-cases where spark streaming is being used:
Broadly, spark streaming is suitable for requirements with batch processing for massive datasets, for bulk processing and have use-cases more than just data streaming.
Dean Wampler explains factors to evaluation for tool basis Use-cases beautifully, as mentioned below:
|Sr.No||Evaluation Characteristic||Response Time window||Typical Use Case Requirement|
|1.||Latency tolerance||Pico to Microseconds (Real Real time)||Flight control system for space programs etc.|
|Latency tolerance||< 100 Microseconds||Regular stock trading market transactions, Medical diagnostic equipment output|
|Latency tolerance||< 10 milliseconds||Credit cards verification window when consumer buy stuff online|
|Latency tolerance||< 100 milliseconds||human attention required Dashboards, Machine learning models|
|Latency tolerance||< 1 second to minutes||Machine learning model training|
|Latency tolerance||1 minute and above||Periodic short jobs|
(typical ETL applications)
|2.||Evaluation Characteristic||Transaction/events frequency||Typical Use Case Requirement|
|Velocity||<10K-100K per second||Websites|
|Velocity||>1M per second||Nest Thermostat, Big spikes during specific time period.|
|3||Evaluation Characteristic||Types of data processing||NA|
Data Processing Requirement
|4. Training and/or Serving Machine learning models|
|Data Processing Requirement||1. Bulk data processing||NA|
|2. Individual Events/Transaction processing|
|4.||Evaluation Characteristic||Use of tool||NA|
|Flexibility of implementation||1. Kafka : flexible as provides library.||NA|
|2. Spark: Not flexible as it’s part of a distributed framework|
Kafka Streams is still best used in a ‘Kafka -> Kafka’ context, while Spark Streaming could be used for a ‘Kafka -> Database’ or ‘Kafka -> Data science model’ type of context.
Although, when these 2 technologies are connected, they bring complete data collection and processing capabilities together and are widely used in commercialized use cases and occupy significant market share.