2025-06-16 –, Frannz Salon
Stream processing systems promise fresh results, strong consistency, and S3-based cost savings. But pitfalls exist:
- Backfilling takes too long due to incremental state maintenance.
- Consistency causes system stalls during joins.
- S3 costs spike with cache misses.
This talk explores these issues, mitigations, and hard truths.
Stream processing systems seem magical: they deliver much fresher results compared to batch processing, promise the highest levels of consistency, and leverage S3 to reduce state storage costs.
But is it too good to be true? In the world of data systems, there’s no such thing as a free lunch. Every benefit comes with trade-offs.
Here are three pitfalls that new stream processing practitioners often overlook:
-
Backfilling Takes Forever
Stream processing systems continuously maintain internal states to enable incremental computation. However, this comes at a cost: bootstrapping a streaming job—or creating a materialized view in the database context—can take an arbitrarily long time. The larger the historical data or the more complex the processing, the worse this problem becomes. -
Consistency Isn't Free
Many stream processing systems claim to offer "strong" consistency, even across multiple streaming jobs. However, this level of consistency has a price: system stalls during events like join amplifications or dependency mismatches. These bottlenecks can significantly impact real-time performance and overall system reliability. -
S3 is Cost-Effective, Until It’s Not
Modern stream processing systems often use S3 as the primary storage for maintaining states, as it promises lower costs compared to in-memory or on-disk alternatives. But here’s the catch: S3 access costs can skyrocket when cache misses are too frequent. What starts as a cost-saving measure can quickly turn into a major expense.
In this talk, I’ll dive deep into these three pitfalls, explaining their causes, possible mitigations, and the hard truths about unsolvable challenges. I’ll share real-world examples of how these issues manifest and the “bloody facts” of how they can bite even the most experienced practitioners.
Stream, Store, Scale
Level:Intermediate
Yingjun Wu is the founder of RisingWave Labs (https://www.risingwave.com/), a database company developing RisingWave, a distributed SQL database for stream processing. Before running the company, Yingjun was a software engineer at the Redshift team, Amazon Web Services, and a researcher at the Database group, IBM Almaden Research Center. Yingjun received his PhD degree from National University of Singapore, and was a visiting PhD at Carnegie Mellon University. He has been working in the field of stream processing and database systems for over a decade.