What Time Is It? Understanding the Complexity of Data Streaming Tools

I argue that static documentation is insufficient to reason about the stateful operations of data streaming tools.

A representation of the complexity in data streaming tools

In a computer program, when values change over time we call this state. This is why we have two different words available to us for differentiating the context: the specific use of “state” instead of “value” signifies to the reader that we are intentionally composing two things, namely value and time.

Each of those concepts are simpler to reason about on their own, but when put together they require much more care. When you hear people saying “state is inherently complex”, this is what they are referring to. This is especially relevant when we are learning about data streaming tools as they need to consider state in many areas, and at scale: windowed aggregations, joins and other stateful operations, not to mention horizontal scaling, memory management through watermarks and checkpoints, fault tolerance and more.

So how best to understand the state management of these tools? Let’s take a look at what some of the most popular options provide to educate and inform users in this regard:

Tool	Flink	Kafka Streams	Spark Structured Streaming	Storm
Word count of reference docs	20,042	45,492	19,908	28,682
Informing users on stateful operations: t = written text d = diagrams & charts a = animations u = unit test facility s = simulator	t,d,u,s ¹	t,d,u	t,d,u	t,d,u
Execution plan checked against documented capabilities before running it?	Yes ²	No ³	Yes ⁴	Yes ⁵
If a simulator is available, where does it run? 1 = local 2 = browser + server / cloud 3 = browser only	N/A	N/A	N/A	N/A

Based on the above, we can see that

Despite their complexity, these tools are mostly limited to written documentation for educating and informing users on their stateful operations.
While they all provide unit test facilities, these are intended to test your usage or composition of these operations rather than understanding the operations themselves. One could also argue that unit testing is targeting a later phase of your project than “educate and inform”.
They are somewhat limited in any checks of their execution plans, and as a result things can slip through the cracks.

With that in mind, consider again that stateful data streaming problems necessarily involve the consideration of time, and as such they are fundamentally a dynamic concern. By contrast, written documentation is of course only static and for that reason I will submit that it is inefficient and inadequate for the intended purpose.

Indeed, I was affected by this issue personally when I ran into trouble with one of these tools. I still don't know if the problem I encountered is due to a misunderstanding of the (20,000 word) documentation or a bug.

So what’s the solution? Well, consider that we learn best by a combination of reading and doing rather than by reading alone. The “doing” is something that happens in real time, and one way to achieve this is by simulation.

In our case, I will define a simulation thus:

A means to observe the effects of a stateful operation, where for the same input as given to the production equivalent, the simulation will give the same output.

Given the stated purpose of a simulator in our case is to educate and inform, I will also add to the definition that it must require zero installation or setup. Further, to reduce costs and complexity it should also require minimal resources and ideally be fully serverless or standalone in operation. Finally, since the goal is to represent stateful operations, it should be capable of representing those in a visual dynamic by using animated forms for example.

I think there’s an opportunity for these tools (or new ones) to provide visual simulators as the primary means of reasoning for their stateful operations, and also as a complement to their existing documentation.

So with that out of the way, if we could build such a simulator what would it look like, how would it work and how could it be built? Here’s a motivational blueprint!

Make the simulator available in a web browser.
Write the core functions for the streaming solution and its stateful operations in a hosted language that compiles to code that can run in a browser. Then the same code can be used for both a production implementation and the browser-based simulation.
Represent unbounded inputs using generators over lazy sequences.
Define the execution plan specification and plan validation rules as data. Then, both the code that checks plans against the rules and any written reference guide can parse this same data, avoiding the possibility of inconsistencies.
Within the simulator, represent stateful operations as declarative example-based or property-based BDD style given-when-then constructs, with an animated accumulation of results over time.

In conclusion, I hope you can appreciate the benefits that simulations would bring in this space, and I also hope to have suitably motivated other people in the community to take the baton!

Credits.

Igor Garcia for your feedback and advice: thank you 🙏

Flink provides an operations playground but it doesn’t specifically cover stateful operations. There’s also a worked example based on fraud detection, but the explanations are 100% written.↩
Flink performs semantic checks for jobs defined using the Table API and SQL. However, the pre-execution validation is not exhaustive, and certain subtle errors or issues might only manifest as unexpected behavior or silent failures during runtime.↩
Kafka Streams doesn’t have a distinct pre-execution validation phase in the traditional sense. Instead it relies on a combination of static type checking and the TopologyTestDriver as part of a unit testing strategy.↩
Spark Structured Streaming has a multi-layered validation process to ensure correctness and feasibility of computations before their execution. The UnsupportedOperationChecker enforces streaming-specific rules during the logical planning stage.↩
Apache Storm's pre-execution topology checks focus on structural and configuration validity, and exceptions are thrown to indicate structural problems, configuration errors and authorization failures. Although it provides facilities for programmatically defining and inspecting topology structure and configuration, a dedicated validation API against documented capabilities is absent.↩

Published: 2025-05-28

Tagged: spark streaming storm kafka data flink

Alza Bitz

What Time Is It? Understanding the Complexity of Data Streaming Tools

I argue that static documentation is insufficient to reason about the stateful operations of data streaming tools.

Credits.