TL;DR

  • Stream Processing is a big data paradigm.
  • In Batch Processing,
    • We need to have all data stored ahead of time.
    • We process data in batches.
    • We aggregate the results across all batches at the end.
    • Batch processing tries to process all the data at once.
  • In Stream Processing,
    • Data come as a never-ending continuous stream of events.
    • Stream processing naturally fits with time series data.
    • Data are processed in real-time and we can respond to the events faster.
    • Stream processing distributes the processing over time.
  • You can use stream processing when:
    • When the data is huge and cannot be stored, stream processing is the only solution.
    • Stream processing works when processing can be done with a single pass over the data or has temporal locality.
    • Stream processing fits into use-cases where approximate results are sufficient.
  • You should not use stream processing when:
    • Processing needs multiple passes through full data.
    • Processing needs to have random access to data.
    • Examples: training machine learning models, etc.
  • When doing stream processing using a message broker:
    • We use a message broker system (Kafka, NATS, RabbitMQ, etc.).
    • We create applications and write code to receive messages, do some calculations, and publish back results (actors).
  • When using a stream processing framework (Flink, Kafka Streams, etc.):
    • We only write the logic for actors.
    • We connect the actors and data streams.
  • A stream processor will take care of the hard work (collecting data, running actors in the right order, collecting results, scaling, and so on).
  • Streaming SQL allows users to write SQL-like statements to query streaming data.
  • A window is a working memory on top of a stream. The most common types of streaming windows:
    • Sliding Length Window Keeps last N events and triggers for each new event.
    • Batch Length Window Keeps last N events and triggers once for every N event.
    • Sliding Time Window Keeps events triggered at last N time units and triggers for each new event.
    • Batch Time Window Keeps events triggered at last N time units and triggers once for the time period in the end.
  • Event Sourcing is an architectural pattern built on top of stream processing.
    • Changes to an application state are stored as a sequence of events.
    • These events can be replayed and queried to reconstruct the state of the application at any time.

Read More