Stream Processing

TL;DR

Stream Processing is a big data paradigm.
In Batch Processing,
- We need to have all data stored ahead of time.
- We process data in batches.
- We aggregate the results across all batches at the end.
- Batch processing tries to process all the data at once.
In Stream Processing,
- Data come as a never-ending continuous stream of events.
- Stream processing naturally fits with time series data.
- Data are processed in real-time and we can respond to the events faster.
- Stream processing distributes the processing over time.
You can use stream processing when:
- When the data is huge and cannot be stored, stream processing is the only solution.
- Stream processing works when processing can be done with a single pass over the data or has temporal locality.
- Stream processing fits into use-cases where approximate results are sufficient.
You should not use stream processing when:
- Processing needs multiple passes through full data.
- Processing needs to have random access to data.
- Examples: training machine learning models, etc.
When doing stream processing using a message broker:
- We use a message broker system (Kafka, NATS, RabbitMQ, etc.).
- We create applications and write code to receive messages, do some calculations, and publish back results (actors).
When using a stream processing framework (Flink, Kafka Streams, etc.):
- We only write the logic for actors.
- We connect the actors and data streams.
A stream processor will take care of the hard work (collecting data, running actors in the right order, collecting results, scaling, and so on).
Streaming SQL allows users to write SQL-like statements to query streaming data.
A window is a working memory on top of a stream. The most common types of streaming windows:
- Sliding Length Window Keeps last N events and triggers for each new event.
- Batch Length Window Keeps last N events and triggers once for every N event.
- Sliding Time Window Keeps events triggered at last N time units and triggers for each new event.
- Batch Time Window Keeps events triggered at last N time units and triggers once for the time period in the end.
Event Sourcing is an architectural pattern built on top of stream processing.
- Changes to an application state are stored as a sequence of events.
- These events can be replayed and queried to reconstruct the state of the application at any time.

TL;DR

Read More