Berlin Buzzwords 2025 :: pretalx

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.

Sunday, June 15, 2025

Monday, June 16, 2025

Tuesday, June 17, 2025

14:30

Barcamps are informal sessions, a kind of "un-conference", with a schedule decided on the day. It is all driven by the interests and expertise of those who attend so each one is different, but ours are always great!

09:30

Opening Session

Berlin Buzzwords Team

Join us as we kick off Berlin Buzzwords 2025!

09:35

Unpacking Digital Sovereignty: How to avoid fueling the nationalist rise

Aline Blankertz

This talk shows that digital sovereignty is prone to open the door to a nationalist agenda which favours the power concentration that led to Big Tech, and it easily slips into alt-right narratives that put colonising space over the needs of most our planet's population

10:20

10:20

20min

Breakfast Break

Kesselhaus

10:20

20min

Breakfast Break

Maschinenhaus

10:20

20min

Breakfast Break

Palais Atelier

10:20

20min

Breakfast Break

Frannz Salon

10:40

Going Local-First: A Primer

Miloš Sutanovac

The local-first paradigm promises transformative benefits — user-owned data, seamless offline capabilities, and instant interactions. But how do you get started? In this lightning talk, we’ll cover the key concepts and show you how to begin your local-first journey.

Mixture of Encoders: A Vector-Native Approach to Search

Filip Makraduli, Ben Gutkovich

Mixture of Encoders is a vector-native alternative that models both structured and unstructured data in a unified embedding space. We will introduce the method, show how it powers natural language search and real-time recommendations, and share open-source tools and benchmarks for replacing complex hybrid stacks.

Which GPU for Local LLMs?

Radu Gheorghe, Rafał Kuć

You’re using local LLMs. For example, to power RAG. You want to deploy them in production, but you don’t know where: which type of GPU? How large should it be? Should you use a larger model but quantize more aggressively?

Our benchmark results and their interpretation will give you some answers.

Zero to Scale: Telemetry pipeline with Apache Cassandra

Shikhar Srivastava, Nomin-Erdene Oyun

Picture billions of messages pouring in daily from thousands of data providers around the globe, which are then processed and published to customers. How can one design a telemetry system to capture, publish, and then index essential information about the data flowing through the system to give internal teams visibility to aid in troubleshooting?

11:10

AI-Powered Search Results Navigation with LLMs & JSON Schema

Anna Ruggero, Edward Lambe, Ilaria Petreti

Struggling to identify relevant filters among too many facets and frustrating results navigation? We explore an AI Filter Assistant for statistical data (SDMX) showing how LLMs can be leveraged to suggest the best filters for your natural language query, helping you refine the results in Apache Solr. We share wins, fails, and lessons learned.

Quiet on Set: Building an On-Air Sign with Open Source Tech

Learn how to build a custom On-Air sign using Apache Kafka®, Apache Flink®, and Apache Iceberg™! See how to capture events like Zoom meetings and camera usage with Python, process data with FlinkSQL, analyze trends from your Iceberg tables, and bring it all together with a practical IoT project that easily scales out.

Shipping Lucene 10.0, 25 years in the making

Luca Cavanna, Adrien Grand

The fascinating journey towards releasing version 10.0 of the popular java search engine Apache Lucene. An inspiring and challenging venture seen through the eyes of its release manager, made possible by the vibrant Lucene community, culminated in deploying the new major to production in record times.

The Dark Secrets of Stream Processing

Stream processing systems promise fresh results, strong consistency, and S3-based cost savings. But pitfalls exist:

Backfilling takes too long due to incremental state maintenance.
Consistency causes system stalls during joins.
S3 costs spike with cache misses.

This talk explores these issues, mitigations, and hard truths.

12:00

Airflow 3 - the new beginning

If you were living under the rock and have not heard that Airlfow 3 is out, and solves most of the pain points that you had with Airflow 2, this talk is for you. You will learn how you can boost your Data Engineering and AI/ML workflows (without having to rewrite your DAGs) with what Airlfow 3 community worked for last 12 months.

Building a knowledge graph for climate policy

Harrison Pim, Fred O'Loughlin

At Climate Policy Radar, we're building an open-source knowledge graph for climate policy. In this talk, we'll share how we combine in-house expertise with scalable data infrastructure to identify key concepts in thousands of global climate policy documents. We'll also touch on ontology design, equitable evaluation, and the climate impacts of AI.

Performance & Fault Tolerance: Building a Modern Database

What are some key concepts and design decisions behind modern, scalable, highly performant databases?
Learn how a database delivers sub-millisecond 99 percentile latency at throughputs of millions of operations per second, at scale, and how you can use it.

State of native access in Apache Lucene

Lucene 10 came out last year in October. One of the changes was about the minimum version requirement of Java 21 – this looks like it allows to introduce new features like native access to file system cache and therefore better memory mapping. Is it as easy as it sounds?

12:40

12:40

80min

Lunch Break

Kesselhaus

12:40

80min

Lunch Break

Maschinenhaus

12:40

80min

Lunch Break

Palais Atelier

12:40

80min

Lunch Break

Frannz Salon

14:00

A decade of lessons in Open Source licensing

Drawing from experience reviewing over 1,000 open-source releases, I'll address common misconceptions, frequent compliance issues, and the evolution of policies to mitigate these challenges.. Attendees will gain practical insights to ensure smoother project releases and foster a compliant, collaborative, open-source community.

Accelerating QuestDB: Lessons from a 6x Performance Boost

In this talk, I share our journey in making QuestDB, an Apache 2.0-licensed open-source time-series database, a significantly faster analytical database. In this session, I'll walk through how we identified opportunities for improvement, the key changes we implemented, and how those changes delivered dramatic performance improvements.

What you see is what you mean; intent based ecommerce search

Dennis Berger, Marco Petris, Volker Carlguth

Intent based clustering is our approach to overcome some limitations of modern hybrid search systems. We show how an upfront LLM-supported in-depth query understanding can leverage steps like retrieval, clustering, validation and presentation. We address various aspects from prototype to production in a large-scale high-volume e-commerce search.

When StatefulSets are not enough

K8s StatefulSets present significant hurdles for scaling and migrating large-scale cloud database workloads. We'll cover scaling strategies beyond vanilla StatefulSets and share lessons on executing zero-downtime live migrations using custom controllers, durable execution workflows, and tackling complex synchronization problems in ClickHouse Cloud.

14:50

Performance Tuning Apache Solr for Dense Vectors

While powerful, dense vector search is not a plug-and-play feature that will scale straight out-of-the-box, particularly when it comes to extracting the maximum performance from limited compute resources. Come learn how we tuned dense vector indexes for our 100M+ document dataset, and drastically sped up our queries.

Reproducibility in Embedding Benchmarks

Reproducibility in embedding benchmarks is challenging, especially with embedding models that are instruction-tuned and increasingly large. Learn how MTEB tackles prompt variability, scaling issues, and large datasets to ensure fair and consistent evaluations, setting a standard for benchmarking in embeddings.

Self-hosting AI LLMs - a beginners guide

Want to avoid cloud-hosted AIs, and run your LLMs on your own systems, but not sure where to start? Or even what everything means? Join us to see how easy it can be, and what a beginner needs to know!

gamma_flow: Denoise, classify and disentangle spectral data!

Viola Rädle, Raphael Franke

gamma_flow is an open-source Python package for real-time spectral data analysis. Designed for speed and efficiency, it avoids large models, opting instead for a novel supervised dimensionality reduction approach. This enables seamless denoising, classification, and disentangling of single-label or multi-label spectra.

15:20

Anatomy of Table-Level Locks in PostgreSQL

Gülçin Yıldırım Jelinek

This talks explains locking mechanisms (MVCC, lock queue) in PostgreSQL, focusing on table-level locks that are acquired by Data Definition Language(DDL) operations. If not managed well, schema changes can result in downtime. Not all operations require the same level of locking, and PostgreSQL offers tools and techniques to minimize locking impact.

Best Practices for Running Databases on Kubernetes

Running open source databases on Kubernetes? Learn best practices for high availability, security, backups, and disaster recovery. Discover key pitfalls to avoid and see how Operators simplify database management for MySQL, MongoDB, and PostgreSQL in Kubernetes environments.

From Search to Insight: Leveraging OpenSearch for Scalable, AI-Driven Search Experiences

Modern applications demand search capabilities that go beyond basic text matching—they need to be fast, accurate, personalized, and context-aware. This session demonstrates how OpenSearch's latest AI/ML enhancements and engine improvements enable organizations to build intelligent, scalable search experiences that meet these evolving needs.

Harnessing AI to strengthen trustworthy information

Lucian Precup, Giovanna Monti

AI can (also) enhance fact-checking and news classification. We developed a platform integrating search, an intelligent assistant, and a RAG system to support reliable journalism. By leveraging diverse data and analytics, we empower everyone with insights for accuracy and transparency, fostering collaboration for trustworthy information.

16:00

16:00

30min

Coffee Break

Kesselhaus

16:00

30min

Coffee Break

Maschinenhaus

16:00

30min

Coffee Break

Palais Atelier

16:00

30min

Coffee Break

Frannz Salon

16:30

Apache Iceberg ingestion with Apache NiFi

With Apache NiFi, a multimodal data pipelining tool, you can assemble existing and/or custom Java & Python processors into a variety of flows. Watch a rich data pipeline be constructed from Kafka, stored using the Apache Iceberg table format and consumed from Trino.

How I Sidestepped ‘Being Glue’

We all do things in our day to day work that are deemed ‘non-promotable’ - these are tasks that are crucial for project success but might not get you promoted. This is commonly known as glue work, a term coined by Tanya Reilly. Being glue doesn’t mean an end to your career, and it isn’t something that you can’t recover from.

Precision farming powered by K3s and TensorRT

Wieneke Keller, Sebastian Lenartowicz

Aurea Imaging is an AgTech scaleup focusing on precision farming in apple orchards. We've build the Treescout, and edge device on top of a tractor which unlocks the potential of each tree. We used a innovative technology stack to meet the requirements of an outdoor rural setup. Our journey was full of failures, learnings and ongoing challenges

Qumat: Apache Mahout Quantum Compute

Andrew Musselman, Trevor Grant

In this talk we present current progress on Mahout's new quantum compute layer named Qumat. We will give an overview of the project, explain why we built Qumat, and show its current state. We will present a demo of Qumat in action, and end with calls to action for researchers and engineers interested in using it and contributing to the project.

17:20

"Do What I Mean": The History of AI and Program Synthesis

We think of generating source code from a prompt as an AI-powered feature of modern IDEs, but the general problem has a rich history in research efforts and domain-specific programming systems. In this session, you'll learn about the history of program synthesis, its relationship to the history of AI, and what lessons this history has for us today.

End-to-End Semantic Search with Apache Solr 9.8 LLM Module

Alessandro Benedetti

Apache Solr 9.8 introduces the LLM module opening the doors of end-to-end natural language query support through vector-backed semantic search (K Nearest Neighbors).
 This talk explores the open source contribution from both the indexing and query angles and what’s coming next for Solr in terms of integrations with Large Language Models.

Exploring reranking depth in modern search pipelines

Athanasios Papaoikonomou

The use of semantic reranking on top of a ‘cheaper’ retrieval step is common in modern search applications. The reranking depth represents the number of documents that we select to retrieve and feed into the reranking model. We experiment with different models and datasets and we present our findings including some counterintuitive ones.

Melting Icebergs: Direct access to Kafka Data via Iceberg

Tom Scott, Roman Kolesnev

Data in organizations is traditionally split between operational and analytical estates. Join us for an account of our journey combining Apache Kafka and Apache Iceberg to create a solution that addresses both estates with one data source.

18:00

Berlin Buzzwords Team

Join us for food and drinks at Palais Kulturbrauerei!

09:30

Breaking Search For Fun and Profit

With a little experience it's easy to find site search queries that don't work. With live examples picked from a variety of high-profile websites, I'll show you how to easily break search - and discuss what we mean by 'broken', the different kinds of failure and what they reveal about the underlying search engine and how we might improve it.

Contexts & Machines: How Document Parsing Shapes RAG results

Alessio Vertemati, Andrea Ponti

How different document parsing and chunking strategies impact RAG pipeline performance? Using real-life documents and LLM-generated question/answer pairs, we assess multiple methods – both open-source and commercial – showing that parsing quality significantly affects response accuracy and that the best approach may depends on the question type.

More Than Just The Tip Of The Iceberg

Michal Gancarski

A comprehensive workshop in which you will gain practical knowledge about how to deploy, configure, interact with and use advanced features of Apache Iceberg. Presented using a local coding environment based on Jupyter notebooks and a Docker Compose stack.

10:00

Evolution of Uber's Search Platform

Search is integral to Uber's core business and user experience. In this talk, we’ll explore the unique challenges of Search at Uber and chart the evolution of Uber’s Search Platform—from leveraging Elasticsearch to developing an in-house solution, and finally, innovating in collaboration with the OpenSearch community.

Hybrid search on hybrid models, at scale

Radu Pop, Pietro Mele

We present an extensible hybrid search solution using Elasticsearch, built on a multi-index architecture and allowing the integration of multiple embedding models. Our approach addresses the challenges of searching a vast and heterogeneous collection, using different chunking granularity and offering an alternative to reciprocal rank fusion.

Observability for All!

Observability is the ability to measure the state of the whole system. OpenTelemetry can be used to instrument applications and diagnose issues. But frontend instrumentation is often an afterthought.

Join me as I show how OTel, RUM agents and Synthetic Monitoring can help us identify and diagnose issues in all layers of our applications.

10:40

10:40

30min

Breakfast Break

Kesselhaus

10:40

30min

Breakfast Break

Maschinenhaus

10:40

30min

Breakfast Break

Palais Atelier

10:40

30min

Breakfast Break

Frannz Salon

11:10

Flavors of PostgreSQL® and you: how to choose a Postgres

Postgres continues to be widely used, and Postgres-derived closed source databases such as AlloyDB and AWS Aurora and have gained popularity in recent years. In this talk, you’ll learn about the architecture of these radically different kinds of systems, what each of these companies means when they say “Postgres-comptatible” and how to choose one!

From Culture to Open Source: Build Value-driven Communities

Marion Nehring, Jessie de Groot

A great open-source community isn’t just about code—it’s about people. A strong company culture fosters engagement and growth. This session explores how values like kindness, collaboration, and developer experience (DX) shape both our company and community. Learn practical insights on fostering inclusion, engagement, and long-term impact.

Vespa.ai’s Personalized Search: Advanced Ranking & Tensor framework

Piotr Kobziakowski

Modern search demands scalable personalization. Discover Vespa’s multi-stage ranking and tensor framework for hybrid queries, multimodal retrieval and real-time ML Learn how to deploy low-latency, high-relevance search systems at petabyte scale.

miniCOIL: Sparse Neural Retrieval Done Right

Evgeniya Sukhodolskaya

In this talk, we present miniCOIL — our attempt to make a sparse neural retrieval model as it should be — combining the benefits of dense and lexical retrieval without propagating their drawbacks. We will share how to design and train a lightweight model that is performant on out-of-domain data and demonstrate its capabilities.

12:00

Data Quality Management: The Good, The Bad, and The Messy

Not all data is good—some is bad, and much is messy. Poor data quality affects customers, employees, and decisions. This session traces issues from symptoms to root causes and explores strategies to fix them. Managing data quality is like battling a seven-headed beast, but with the right approach, you can turn chaos into clarity.

FlinkCDC: Streamlining your data analytics pipelines

Muhammet Orazov

Change Data Capture (CDC) is a powerful technique that enables organisations to react to data changes in real-time. In this talk we will explore FlinkCDC, a component of Apache Flink, and demonstrate how it leverages Flink's robust stream processing capabilities to provide CDC pattern.

How [not] to evaluate your RAG

Roman Grebennikov

How do you know if your RAG system is actually working? We’ll share a real-world case study on evaluating RAG in production—tackling messy data, chunking fails, and unexpected chatbot behavior—so you can measure quality with confidence.

What’s New in the OpenSearch Project and Ecosystem

Discover how OpenSearch powers search and observability at scale! Now part of The Linux Foundation, OpenSearch is evolving with vector search, NLP, and real-time analytics. Join this session to explore its latest innovations, performance boosts, and expanding ecosystem—directly from the project's Chief Evangelist.

12:40

12:40

80min

Lunch Break

Kesselhaus

12:40

80min

Lunch Break

Maschinenhaus

12:40

80min

Lunch Break

Palais Atelier

12:40

80min

Lunch Break

Frannz Salon

14:00

Analysing Public Kafka Data from NASA Satellites

This session builds on the foundational OSS technologies of the modern Lakehouse—Apache Kafka, Spark, Unity Catalog and MLFlow—and shows how everyone can analyze supernova data coming from NASA's satellites and analyze data streams with natural language and plot their own map cosmic events.

Cross Domain Enterprise Search - Content Diversity at Scale

Atri Sharma, Abhishek Singh

This talk will focus on learnings gathered when building an enterprise search platform with multi modal content - ranging from highly domain specific content to images to unstructured content. Problems of extraction, inference and relevance shall be discussed, while showcasing cross domain search at scale.

Mastering real-time anomaly detection with open source tools

With data moving faster than ever, detecting problems as they happen is crucial. This talk covers how to build a real-time anomaly detection system using Apache Kafka for streaming, Apache Flink for processing, and AI for pattern recognition. Plus, we’ll explore Apache Iceberg for storing historical data to refine models.

Why Chatbots Still Fail: The Hidden Pitfalls of RAG

Lewin von Saldern, Jennifer Gaubatz

Many knowledge chatbots and search engines use RAG. Despite their popularity, these chatbots are often worse than ChatGPT and frustrate users by failing to answer even the simplest questions. In my talk, I reveal how ineffective chunking strategies are a key culprit and demonstrate how to refine chunking to build more reliable RAG systems.

14:50

All the DataOps, all the paradigms

Lars Albertsson

Data warehouses, lakes, lakehouses, streams, fabrics, hubs, vaults, and meshes. We sometimes choose deliberately, sometimes influenced by trends, yet often get an organic blend. But the choices have orders of magnitude in impact on operations cost and iteration speed. Let's dissect the paradigms and their operational aspects once and for all.

Go Beyond Basic RAG with Agentic Behavior

RAG revolutionized AI by merging search and generation, and agentic behavior takes this search to the next level by enabling LLMs to make decisions and call tools. This talk covers agentic behavior's key features: tool integration and reasoning, along with a live demo.

Streamlining Search Quality: Search Relevance Workbench

Eric Pugh, Stavros Macrakis

Robust Search Evaluation is both a “must have” for any modern day Search team and an “after thought” that never gets the team’s full attention. This is especially true with the various open source search engines. Most teams build their own data collection and eval tools. Some use standalone open source tools. We present a better solution!

Visual Literacy: Complex Document Retrieval with VLMs

Saba Sturua, Isabelle Mohr

Traditional document retrieval systems struggle with visually rich documents as they discard visual elements during text extraction. This talk shows how vision language models (VLMs) can address these limitations and presents a new benchmark for evaluating document retrieval systems across languages, domains, and document types.

15:30

15:30

30min

Coffee Break

Kesselhaus

15:30

30min

Coffee Break

Maschinenhaus

15:30

30min

Coffee Break

Palais Atelier

15:30

30min

Coffee Break

Frannz Salon

16:00

Delay accounting: an underrated feature of the Linux kernel

This talk delves into delay accounting, an often-overlooked feature that provides valuable insights into CPU time shortages and application latency. Attendees will learn how to leverage these kernel metrics for better performance analysis and system optimization.

How to train a fast LLM for coding tasks

Coding LLMs are now part of our daily work, making coding easier. In this talk, we share how we built an in-house LLM for AI code completion in JetBrains products, covering design choices, data preparation, training, and model’s evaluation.

Siphon : Modern Data Stack with SF-CH & Iceberg

Tired of waiting for batch jobs? See how we transformed our data pipeline using Apache Iceberg to stream quality data into Snowflake and Clickhouse simultaneously. Learn about our battle-tested architecture, performance gains, and how we maintain data consistency across dual analytics engines

Text Search on Images with Quantized ColPali

ColPali is revolutionary—here’s why: it combines document retrieval with a vision-based large language model, allowing you to search directly within images without needing to extract text. However, running the full model on personal hardware can be challenging due to its computational demands. And thus we’ve released a quantized version of ColPali.

16:30

Advancing Multi-Modal Search Capabilities in Search Pipeline

Exploring the integration of machine learning inference processors in OpenSearch pipelines, focusing on multi-modal search capabilities, we demonstrate how these processors enhance ingest, search request, and response processes for text, image, and audio data, significantly improving search and analytical capabilities in multi-modalities worlds.

Flink Jobs as Agents – Stream Processing for Agentic AI

Steffen Hoellinger

Apache Flink is uniquely positioned to serve as the backbone for AI agents, providing them with stream processing as a powerful new tool. We'll explore how Flink jobs can be transformed into "Agents"—autonomous, goal-driven entities that dynamically interact with data streams, trigger actions, and adapt their behavior based on real-time insights.

17:15

Closing Session

Berlin Buzzwords Team

Join us as we wrap up Berlin Buzzwords.