Berlin Buzzwords 2024

Streaming doesn’t have to be hard
2024-06-11 , Frannz Salon

Nearly half of data scientists find it difficult to adopt streaming technologies. The audience will learn about the technological barrier and the operational burden that inspired us to look for better solutions, and how we built a stream-batch unified Python dataframe API.


Machine learning is moving towards real-time. In online prediction, real-time observability, and continual learning, streaming technologies play an indispensable role. Hence, data scientists are increasingly exposed to streaming technologies. However, our surveys show that close to half of data scientists shy away from streaming technologies because “streaming is hard, both technically and operationally”.

Unlike batch systems, which operate on bounded data, streaming systems are designed with unbounded data in mind and operate with unique concepts such as “event time”, “processing time”, “watermark”, etc. Furthermore, making batch and streaming systems work together is a difficult challenge.

For a large percentage of data scientists and companies, despite a high potential of return on investment, the barrier to adopt streaming technologies into existing batch workloads is too high. We faced the same challenges and set out to find better solutions. Since batch is just a special case of streaming, we believe in a world where data scientists should be able to define feature logic without worrying about whether the workload will be deployed in batch or streaming.

Inspired by Flink SQL, our solution is a stream-batch unified Python dataframe API that is decoupled from backend execution. This allows us to write the transformation logic once, parse it into a standard query plan, and execute it everywhere, across streaming and batch engines (even across different languages and systems). We don’t need to worry about differences in dialects, query plans, and how computations are carried out underneath the hood.

Introducing a single, unified API lowers the barrier of entry for data scientists who are new to streaming and companies who are looking to get started with streaming. It protects data scientists and platform engineers from elusive bugs that often arise from intricate discrepancies across engines. It also reduces vendor lock-in and allows us to choose the best engine for each workload without ridiculously high switching costs. Through this work, we aspire to close the gap between batch and streaming and work with the community to push for better unification standards.

This talk is for data scientists who are new to streaming and companies that are interested in learning more about and adopting streaming technologies. The audience will learn about basic streaming concepts, how we created an open-source library that unifies batch and streaming, and how they can use this to adopt streaming technologies into existing batch workloads with minimal code changes.

See also: Slides (3.6 MB)

Chloe has a background in data science and started working on streaming systems when she was a Founding Engineer at Claypot AI, a startup tackling challenges in real-time machine learning. She led the infrastructure development of an open-source real-time feature engineering platform and worked on the translating and optimizing streaming workloads that served low-latency use cases. Later, she brought her streaming expertise to Voltron Data, where she now leads the development of streaming technologies.