Join us as we kick off Berlin Buzzwords 2026!
Over 25 years, open source has become vital digital infrastructure. However, its future relies on human resilience, not just code. To combat burnout, funding gaps, and new regulations, we must move beyond old methods and address sustainability through global policy, security, and community health.
Low-resource languages expose weaknesses in NLP systems that are often hidden by benchmark data. Drawing on experience annotating fieldwork data, this talk shows how ambiguity and annotation decisions reveal fundamental data quality issues relevant to real-world NLP pipelines.
In this presentation, we will talk through moving a major open source project into a foundation and the benefits of open governance, and a vendor-neutral home has proven through a sustained growth in community contributions.
Retrieval-Augmented Generation (RAG) systems rely on pre-chunked documents, tying retrieval to arbitrary boundaries. This talk explores an experimental approach that surfaces semantically relevant text spans, without chunking. We'll share surprising findings and examine whether this technique points toward a viable chunk-free retrieval paradigm.
With Apache Solr 10 out, there are plenty of goodies coming up for vector-search aficionados.
From scalar and binary quantization to speed up your search and reduce the memory footprint, to early termination and hybrid approaches to navigate the HNSW graph.
Join us if you want to learn about the big steps forward of Apache Solr vector search!
KAFKA-6020 has been open for 7 years. This talk demos broker-side filtering for Kafka with sub-millisecond latency (p99 < 25ms). Live demo with working code shows how it reduces network costs, simplifies consumers, and enables new use cases. Real-world validation from financial services and logistics deployments.
Open source mentorship changed overnight with AI tools. Contributors submitted polished code they couldn’t explain, making learning harder to assess. This talk shares what we learned mentoring Outreachy contributors—what failed, what worked, and what we’re still figuring out.
Discover how attackers can manipulate AI coding assistants through hidden text, typosquatting and code errors. Learn to detect concealed instructions and set up trusted dependencies to keep unsafe code out of your environment.
When does Apache Flink solve real problems versus add complexity? Explore use cases where Flink becomes essential such as fraud detection, CDC, real-time analytics versus when batch or Kafka Streams suffice. Compare stream engines (Flink, Spark) with platforms (Kafka, Pulsar) to confidently decide when streaming delivers value.
Discover how OpenSearch breaks linear scaling. Inspired by Apache Pinot, the Star-Tree index moves performance dependency from document count to field cardinality. Learn how we extended Lucene’s DocValues to build multi-dimensional materialized views that deliver sub-second analytics on billion-scale datasets for observability workloads.
How are security decisions big and small made in a distributed open source community? Come find out at this session where users will gain insights and examples (both good and bad) to take back to their own projects.
AI agents have quietly become some of the most demanding users of modern data platforms and most weren't built with them in mind. In this panel, leaders from Snowflake, Elastic, ClickHouse, and Xata share what agentic workloads actually look like in production: what broke, what had to be rebuilt, and where the architecture is heading.
Relevance feedback loops used to take months. AI agents can now compress the process to seconds. This talk explores agentic retrieval: systems where agents adjust scoring models, schema, and indexing in real time. Learn how to build retrieval infrastructure with verifiable APIs that enable agents to optimize their own search context.
Learn Spark 4.1's brand-new Declarative Pipelines, a paradigm shift replacing imperative code with simple declarations. We'll build a real-time data pipeline together, processing streaming ADS-B flight data from tens of thousands of aircraft overhead.
Streamlit is an open source Python library that lets you present data to people, without having to become a frontend developer. It's easy to learn, fast to build with, and should be in every data-wrangler's toolkit. In this talk you'll learn everything you need to know to get started.
We rethink data systems by putting streams at the center. Expanding on Martin Kleppmann's: Turning the Database Inside Out, this talk shows how Apache Kafka and Apache Iceberg together provide durable storage, indexing, and rich views that eliminate brittle ETL and unify real-time and historical analysis. A new way to see databases—and streams.
Can PostgreSQL become a serious analytics engine? With emerging columnar extensions, PostgreSQL is pushing beyond OLTP into OLAP territory. This talk explores the current columnar landscape, architectural trade-offs, and how far PostgreSQL can go compared to analytical engines like ClickHouse.
A brief look at how scientific data came under political pressure during the presidency of Donald Trump, and how scientists and data repositories in Europe worked to protect public access to evidence-based research, ensuring access for data science in a well-structured way.
Every PDF hides a world of structure, metadata and embedded signals that can silently influence AI based processing. With ultraviolets, we reveal how those can be exploited for malicious purposes and even become powerful tools for smarter applications. Designing for both humans and machines become a vital aspect of AI experience design.
the session will cover operational best practices, including metadata management, file sizing, compaction strategies, and performance tuning at scale. Attendees will leave with practical guidance for designing &operating open, flexible, multi-engine data architectures built on Apache Iceberg, enabling faster analytics, lower operational flexibility
What if, instead of learning only from examples, Language Models could explore crafted Environments, little worlds where they can act and improve autonomously?
Join me to see how Reinforcement Learning Environments work, how to build them with open-source tools, and how to use them to evaluate and train LLMs/Agents.
Apache DataFusion is moving beyond batch into streaming. We built Streamling, a Rust streaming engine that uses DataFusion planning and Arrow RecordBatch streams for real-time SQL/WASM transforms. This talk covers how we built it, highlights key features (FFI plugins, WASM transforms, and dynamic tables), and shares production lessons.
Choosing the best engine for each data task sounds right, but in modern data stacks doing so requires expertise and effort. Apache Wayang, a recently graduated TLP, addresses this by decoupling logical dataflows from execution engines. From big data platforms to SQL and ML engines, Wayang enables cross-platform execution that maximizes performance.
All software benchmarks and claims of performance are carefully crafted lies and this talk is no different. Instead of giving you a quick “do steps one, two, three for a magic speedup”, we aim to explain how we arrived at the changes we made and how we rigorously tested those changes to make sure we understand their impact.
Event-driven Agents calling LLMs can be combined with Pattern Recognition and Anomaly Detection in Apache Flink in smart ways to increase cost efficiency, avoid hallucinations and enforce predictable, deterministic behavior. Specifically in a business process context, this architecture provides opportunities for continuous real-time process mining.
GreenOps adoption is stalled by missing data from cloud providers. SPRUCE is an open-source, scalable platform built on Apache Spark that enriches cloud usage reports with open models to quantify carbon impact, build insights, and help teams reduce both emissions and cloud spend.
Why do smart agents make dumb mistakes? The culprit is context, an old problem with new solutions. Let's fast-forward through 20 years of search evolution to fix the missing link in today’s Agentic AI.
We’ll demonstrate how to combine Knowledge Graphs and Vector Search to build reliable, context-aware applications using open-source tools.
Why do we even need traditional search when AI can do everything? Or is it foolish to ignore simple, proven techniques for delivering great results? What's the best way to combine old and new? Join our panel of experts for a fun and provocative debate!
We will show you how we build a local newspaper rag and all the problems that came along the way. From trustworthiness to customer wishes, search optimization and generation problems. Local villages, that LLMs know nothing about, content that is semantically the same and outdated information are only a part of the journey we made.
Iceberg maintenance procedures work. Orchestrating them across hundreds of tables is the problem. Floe is an open-source system that treats maintenance as policy: glob patterns, schedules, and health-driven triggers that gate operations on real table metrics. Supports 7 catalogs, executes via Spark or Trino.
In this talk, we look at anomaly detection as a complementary way of working with metrics. Instead of relying on predefined limits, anomaly detection focuses on identifying behavior that deviates from what is normally observed over time. The focus is on how developers can interpret these signals, where anomaly detection is useful, where it is not.
Join us for food and drinks at Palais Kulturbrauerei!
For a golden set, you need queries. Even if you have them, you can’t judge all docs for each query. Only the top N. How do we rank the top N? See the circular dependency? We’ll talk about ways to untangle it: lexical search, significant terms, training an embedder from scratch, etc. By iteratively refining data and queries, we'll get there.
Fuzzy search is a double-edged sword: it fixes typos but drowns users in noise on large corpora. At INA, we revived ancient phonetic algorithms to improve relevance. This session compares fuzzy vs. phonetic search on a massive archive, showing how "sounding right" beats "spelling close."
Learning new things is hard, but a useful way to think about new things is by comparing them to things you already know. In this talk, we'll compare writes between 3 different popular data services: Postgres, Apache Kafka and Apache Iceberg. In doing so, we'll learn a bit about the evolution of how we've thought of data storage as developers.
Search engines are converging with analytical data systems. This talk explores how columnar data layouts, SIMD-accelerated execution, and bulk-oriented processing are reshaping search internals. We examine where traditional models fall short and how hardware-aware techniques from analytics engines are defining the next search infrastructure.
n8n-gitops is an open-source CLI that applies GitOps principles to n8n workflows. This talk shows how workflows can be exported, reviewed, versioned, and deployed from Git instead of manually promoted via the UI. Through a live demo, we explore safer deployments, rollbacks, and lessons learned operating automation as code.
You can finally stop caring about co-partitioning, state stores and eventual consistency. Kafi Streams, built on (Py)DBSP, treats streaming like batch — strongly consistent, no special concepts. An Open Source Python library for the 80% of use cases that don't need extreme scale. Fully incremental stream processing for everyone, from day one.
Hybrid search fails on complex intent: vector search misses constraints, keywords miss nuance. This talk explores fine-tuning SLMs for 'Query Understanding'—transforming vague inputs into structured requests. Learn to extract metadata, expand terms, and route intent to build a search engine that does the hard work for your users.
Traditional OpenSearch segments are context-blind, scattering data across multiple segments. We introduce Context-Aware Segments (CAS), an architecture that brings "sharding" logic to the segment level. By enforcing document locality during indexing, we slashed query latency and minimized data footprint through superior pruning and compression.
Most people know DuckDB as a fast analytics tool for notebooks and scripts. But embedded OLAP enables much more: browser-based analytics via WebAssembly, serverless data processing, and lightweight data apps — without heavy infrastructure. This talk shows how DuckDB changes the way we build data-driven applications.
Modern ML applications demand features computed in near real-time with sub-100ms latencies. This talk dives into Chronon, an OSS feature platform bridging streaming data infrastructure and production ML. Using a two-tower search pipeline example, we'll show how we can chain embeddings with tabular features while minimizing hot-path computation.
Agentic systems can break not because information is missing, but because persuasively wrong context gets promoted into action. We examine a recurring pattern: retrieval metrics improve while agent behavior degrades as distractors enter multi-step loops. We show why relevance, reliability, and security are tightly connected in agentic retrieval.
IResearch is an Apache 2.0 C++ search engine built to live inside databases. We'll benchmark it against leading open-source search engines, show why vectorized scoring is the next frontier for information retrieval engines, share the mistakes we made over a decade of development and explore how database-native search fits modern query execution.
Real-time data is awesome… until you realize it’s leaking names, emails, and locations. In this talk, you’ll learn how to keep streaming data private, from simple masking to tricks that beat re-identification. All with live demos and some juicy real-world stories.
Observability is moving from vendor stacks to open standards. This talk presents a design where OpenTelemetry provides collection and semantic context, and Apache Iceberg is the data layer for logs, metrics, and traces. We cover portability, governance, agent investigation, and write-path pitfalls: drift, small files, compaction.
This talk takes four reliability patterns from distributed systems and shows what they look like inside an agent architecture. How to shadow-test an agent. Why your circuit breakers need confidence thresholds. What an eval harness looks like when your system is non-deterministic. And why human oversight degrades faster than anyone admits.
Many so-called “agent failures” are actually context failures in disguise. In this session, we’ll explore how to tell whether your agent truly saw and used the right context, using techniques like tracing and attribution, golden datasets for context-aware evaluation, and targeted probes to test retrieval quality.
Competitive search now needs dense embeddings, sparse vectors, ColBERT, and cross-encoder reranking. Most teams run four separate containers. This talk shows how to serve all four from one process, walks through building a hybrid retrieval pipeline with real benchmark data, and covers where each retrieval mode wins and where it wastes compute.
When users expect alerts for new products matching an uploaded image, the problem becomes inverse hybrid search. Unlike top-K search, alerting must guarantee fetch-all semantics: zero missed matches across all saved searches, combining vector similarity, boolean filters, and lexical signals. We show why this breaks traditional scaling intuition.
We’ve normalised extraordinary inefficiency in stream processing. Thousands of events/sec don't justify repartition storms, serialization overhead, state migration. This talk explores a different path: Kafka Streams DSL, adopt Flink-like exactly-once semantics, Project Loom, and challenging the assumption that stream processing must be distributed.
Coding agents succeed in verifiable loops (compiler + tests), but large repos still expose retrieval weaknesses.
This session explores how lexical, structural, and semantic search can provide cleaner context for LLMs. We compare tradeoffs and evaluation approaches to improve reliability without inflating token cost.
Hidden bias in datasets silently breaks machine learning systems in production. This talk shows how to detect data imbalance, leakage, and coverage gaps early using practical metrics, visualizations, and open-source tools—before misleading offline metrics turn into costly real-world failures.
Many novel machine learning techniques started as clever hacks that just happened to work, but the demands of building real systems can be at odds with this creative culture. Learn about our open-source stack to improve quality-of-life for ML researchers and infrastructure teams alike — and how their concerns aren't as different as you might think.
In this talk we'll explore how Apache Solr introduced the capability to upgrade an index in-place with zero downtime. This upgrade path helps prepare the index for a future Solr major version upgrade without needing to recreate the index from source as is the case with Lucene based search engines today.
RDF was designed for the semantic web, but it turns out to be a perfect fit for systems where structure emerges from user interaction, not upfront design. This talk covers how to build applications entirely on RDF triples, translate natural language to SPARQL with small, open source language models, and discover implicit knowledge in user input.
Formal methods are powerful tools to verify software systems' correctness and reliability. However, manually writing system specs is time-consuming and hard to maintain. LLMs can help with this burden.
We'll share new research into tools to automate formal methods workflows and learnings from how LLMs currently perform.
Server-side inference is the bottleneck of modern AI, creating costs and privacy hurdles. But what if the solution is scaling down to the browser? This session investigates Client-Side AI using WebGPU, ONNX Runtime, and Transformers.js. We’ll explore the reality of hardware access, model size, and the 2026 trade-offs of browser based execution.
This talk explains how constraints work in Postgres by exploring the pg_constraint catalog and core concepts like table vs. column constraints, constraint triggers, domains and constraint deferrability through SQL queries. It then covers what’s new in Postgres 18 including temporal keys, NOT NULL as a first-class constraint, NOT ENFORCED and more.
In 2026, AI is moving beyond digital tasks into the physical world. It increasingly interacts with instruments, experiments, and real-world data. Physicists stand at this frontier, using deep learning, LLMs, and agents to analyze nature itself. What have we learned about AI when it meets reality?
For years, Germany’s largest classifieds website relied on a search-first relevance approach because structured data was sparse. This talk shares how we introduced Vespa in the Motors category, enriched signals with embeddings and extracted attributes, and migrated step by step; what worked, what failed, and which lessons only a real PoC reveals.
The LLM ecosystem changes faster than most teams can adapt. This talk shares our experience and the practical lessons we’ve learned while building an intelligent search product in a world where models, tools, and best practices constantly evolve.
Join us as we wrap up Berlin Buzzwords.