To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
09:15
09:15
15min
Opening Session

Join us as we kick off Berlin Buzzwords 2023

Kesselhaus
09:35
09:35
40min
What defines the “open” in “open AI”?
Jennifer Ding

This talk focuses on unpacking this year’s big buzzwords of “open AI” and “responsible AI” to highlight the range of (sometimes contradictory) activities that exist under these umbrella terms and how

Kesselhaus
10:15
10:15
20min
BREAK
Kesselhaus
10:15
20min
BREAK
Maschinenhaus
10:15
20min
Break
Palais Atelier
10:15
20min
BREAK
Frannz Salon
10:40
10:40
20min
Declarative Data Collections for Portable Parallelism
Zhibo Li

This talk introduces a novel programming model - the user declares data collections with the properties, and and these declarations can be transparently ported to multiple platforms including GPUs.

Maschinenhaus
10:40
20min
Migrate Data, <Mesh> in mind
Aydan Rende

For quite some time, Hadoop served as the data warehouse for Kleinanzeigen. In this presentation, our objective is to provide an overview of our approach, which involves implementing a cloud-based data pipeline with the help of dbt and Airflow.

Palais Atelier
10:40
20min
Vectorize Your Open Source Search Engine
Atita Arora

Fascinated by vector search but don't know where to start?
Join us to crack the code and leverage the potential of vector search to delight your users.

Kesselhaus
11:10
11:10
40min
How to train your general purpose document retriever model
Tom Veasey, Quentin Herreros

A practical guide for training learned sparse models to outperform BM25 on zero-shot document retrieval tasks

Maschinenhaus
11:10
40min
Kaldb: serverless lucene at petabyte scale
Suman Karumuri

In this talk, we share our experiences, best practices, and lessons learned in designing and operating a serverless Lucene serving system at PB scale.

Kesselhaus
11:10
40min
Using TensorFlow in a Solr Query Parser
Radu Gheorghe, Rafał Kuć

Tutorial for writing Solr Query Parser that use TensorFlow for Java to augment queries.

Frannz Salon
12:00
12:00
40min
Apache Airflow in Production - Bad vs Best Practices
Bhavani Ravi

This talk will explore the bad and best practices when deploying Apache Airflow in a production environment. From common pitfalls such as misconfigured tasks and lack of scalability,

Palais Atelier
12:00
40min
Boosting Ranking Performance with Minimal Supervision
Jo Kristian Bergum

Using generative Large Language Models (LLMs) to generate synthetic labeled data to train in-domain ranking models. Distilling the knowledge and power of generative LLMs into effective ranking models.

Kesselhaus
12:00
40min
ClickHouse: what is behind the fastest columnar database
Olena Kutsenko

Columnar databases seem to be full of mysteries and confusion. In this introduction for ClickHouse, we'll take apart its building blocks to see how it achieves its remarkable performance.

Maschinenhaus
12:00
40min
When Probably is Good Enough
Savannah Norem

Probabilistic data structures give developers room to massively cut down on space requirements while sacrificing a bit of accuracy, so when is probably good enough?

Frannz Salon
12:40
12:40
80min
LUNCH
Kesselhaus
12:40
80min
LUNCH
Maschinenhaus
12:40
80min
LUNCH
Palais Atelier
12:40
80min
LUNCH
Frannz Salon
14:00
14:00
20min
Cooking up a new search system: Recipe search at Cookpad
Matt Williams

How we successfully transitioned the search system for the world's largest recipe sharing platform to a modern stack – including the wins, fails, team structures, and processes along the way.

Frannz Salon
14:00
40min
Data Ops on Kubernetes with Kubernetes Operators
George Hantzaras

The rise of cloud-native technologies has revolutionised the way organisations store, process, and manage their data. We'll explore the power that Kubernetes give us to efficiently manage them.

Palais Atelier
14:00
40min
Introducing Multi-valued Vector Fields in Apache Lucene
Alessandro Benedetti

Multiple vectors in a field dedicated to K-nearest-neighbors search has been a fundamental problem for Apache Lucene for long.
This talk describes how this has been finally designed and implemented.

Kesselhaus
14:00
40min
Tip of the Iceberg
Fokko Driesprong

Apache Iceberg is an open table format that has wide support among open-source and cloud vendors. After this talk, you'll be comfortable with all the concepts and how to use Iceberg.

Maschinenhaus
14:50
14:50
20min
Big data in the service of reliable news
Benjamin Dauvissat, Radu Pop

Data vs. Fake news : using available data to offer a critical view of the world

Frannz Salon
14:50
20min
Joining Dozens of Data Streams in Distributed Stream Processing Systems
Yingjun Wu

This talk will explore the techniques and best practices for joining dozens of data streams, focusing on different joining mechanisms, such as binary joins and delta joins, as well as pros and cons.

Palais Atelier
14:50
20min
No Mean Feat: Upgrading a Customized Solr to Upstream Solr
Shikhar Srivastava

Learn how the News Search Infrastructure Team at Bloomberg migrated from a customized implementation of Apache Solr to the upstream Apache Solr

Maschinenhaus
14:50
20min
Privacy-Preserving Web Search
Lara Perinetti

An ethical overview of how a privacy-focused search engine has to adapt its behavior from crawling to ranking web documents without knowing anything about the user and still be as relevant as possible

Kesselhaus
15:20
15:20
40min
Building On-Ramps for Non-Code Contributors in Open Source
Natali Vlatko, Celeste Horgan

Open source software is so much more than code – docs, community and infra need maintaining. How do you attract and keep non-code contributors? Let two experienced practitioners show you the way!

Frannz Salon
15:20
40min
Laptop-sized ML for Text, with Open Source
Nick Burch

Advanced ML models for text may need hundreds of machines, but with open source tools and pre-trained models, you can do a lot just on your laptop or docker container. Discover what and how!

Palais Atelier
15:20
40min
Model Fine-tuning For Search: From Algorithms to Infra
Bo Wang, Maximilian Werk

Deep learning for search has become a hot topic, while pre-trained neural nets do not function well as expected. We will discuss the algorithms behind model fine-tuning, and how to scale it up.

Maschinenhaus
15:20
40min
Towards a decentralized and collaborative search engine
Aline Paponaud, Lucian Precup

In this session we will share our vision towards an alternative, decentralized and collaborative search engine, from social considerations to technical implementation.

Kesselhaus
16:00
16:00
30min
BREAK
Kesselhaus
16:00
30min
BREAK
Maschinenhaus
16:00
30min
BREAK
Palais Atelier
16:00
30min
BREAK
Frannz Salon
16:30
16:30
40min
How to not kill people
Lars Albertsson

As AI grows, software manages more risks to humans. Moving fast and breaking things won't do. We will look at aviation to learn how successful risk management structures might look in software & AI.

Kesselhaus
16:30
40min
Ingesting over 4 million rows a second on a single instance
Javier Ramirez

When we set up to write an open source fast time series database, we realised we would need every trick in the book to make it as performant as possible. This talk will show what's inside.

Maschinenhaus
16:30
40min
ML with Domain-Specific Ontology for IT Security Industry
Qi Wu, Bertram Sändig

The BSI provides actual data on acute IT threat situations. We developed a system for detecting threats: crawling, automatic analysis with NER, NEL, provision and use of dedicated tools for evaluating

Palais Atelier
16:30
40min
Who broke the build? -Using Kuttl to test and Release faster
Ram Mohan Rao Chukka

No one wants to be responsible for breaking the build. But what can you do as a developer to avoid being the bad guy? How can project leads enable their teams to reduce the occurrence of builds?

Frannz Salon
17:20
17:20
40min
A Kafka Client’s Request: There and Back Again
Danica Fine

Understand how data moves into and out of Apache Kafka® by taking a look at the producer and consumer request life cycle. Follow a request from an initial call to send() or poll(), all the way to disk

Maschinenhaus
17:20
40min
Building Real-Time Applications: Cyclist Crash Detection
Tomáš Neubauer

In this talk, we will explore common problems faced when building real-time applications at scale, with a focus on a specific use case: detecting and responding to cyclist crashes.

Palais Atelier
17:20
40min
Creating chaos in containers
Maish Saidel-Keesing

Chaos engineering is hard, in containers it is even harder.
This session will show attendees the considerations and get them started on their way to making more resilient applications in the cloud

Frannz Salon
17:20
60min
The Debate Returns (with more vectors) Which Search Engine?
Charlie Hull

It's that old question - which search engine should I choose for my project? Elasticsearch, Solr, Opensearch (all based on Lucene), or Vespa, or maybe one of the new vector search engines?

Kesselhaus
18:00
18:00
120min
Get Together
Kesselhaus
18:00
120min
Get Together
Maschinenhaus
18:00
120min
Get Together
Palais Atelier
18:00
120min
Get Together
Frannz Salon
09:30
09:30
70min
Advanced Search Plays with GraphQL
Stanimira Vlaeva

This demo-heavy workshop scores a hat trick by combining Apache Lucene, MongoDB, and GraphQL to easily build search functionality across data collections and 3rd party APIs into applications.

Frannz Salon
09:30
20min
Avoiding Anti-patterns in Technical Communication
Sophie Watson

Communicating technical knowledge effectively is a core skill for practitioners, but one which is often neglected. We’ll give practical advice on how to (and not to!) communicate technical ideas.

Maschinenhaus
09:30
20min
What's coming next with Apache Lucene?
Uwe Schindler

This talk will discuss the ways Apache Lucene might go in the next years. From the perspective of a full-text search engine, it looks like it is feature-complete. So what comes next?

Kesselhaus
09:30
20min
When ms matter: Maximizing query performance in CrateDB
Marija Selakovic

Achieving optimal execution plans in distributed databases is a challenging task. This talk will focus on CrateDB: a distributed SQL database, and key strategies for optimizing its query performance.

Palais Atelier
10:00
10:00
40min
Building MLOps Infrastructure at Japan's Largest C2C E-Commerce Site
Ryan Ginstrom, Teo Narboneta Zosa

The MLOps infrastructure we built to support ML in search at Mercari, Japan’s largest C2C e-commerce platform.

Kesselhaus
10:00
40min
Connect GPT with your data: Retrieval-augmented Generation
Malte Pietsch

Learn how to build with LLMs, like ChatGPT, and avoid typical pitfalls like hallucination and outdated information. Accompanied by practical code examples using the open source framework Haystack.

Maschinenhaus
10:00
40min
Deep dive into an Elasticsearch plugin for query-time joins
Stéphane Campinas

Siren Federate is an Elasticsearch plugin for joining inverted indices at query-time. Learn in this talk about its inner workings and how it complements features of Elasticsearch like runtime fields.

Palais Atelier
10:40
10:40
20min
BREAK
Kesselhaus
10:40
20min
BREAK
Maschinenhaus
10:40
20min
BREAK
Palais Atelier
10:40
20min
BREAK
Frannz Salon
11:00
11:00
40min
A Fresh Start? The Path Toward Apache Solr's v2 API
Jason Gerlowski

Modernization efforts face particular hurdles in large, established OSS projects. Come learn about the community and technical challenges encountered on Apache Solr's path towards revamped HTTP APIs.

Maschinenhaus
11:00
20min
From keyword to vector
Byron Voorbach

During this talk, I will take you on my over-a-decade-long journey in search. Starting from having witnessed the inception of Elasticsearch to my current endeavors with Weaviate, I will share my first-hand experience of the evolution, challenges, and lessons learned along the way.

Palais Atelier
11:00
40min
How to Implement Online Search Quality Evaluation with Kibana
Ilaria Petreti, Anna Ruggero

Conducting online testing is crucial for assessing a model’s performance in a real-world scenario. This talk explores a customized approach for evaluating ranking models using Kibana.

Frannz Salon
11:00
40min
Synthetic data: when, why, and how
William Benton

This talk will cover several use cases in which generating synthetic data is useful (or even essential) and introduce a toolbox of practical techniques for synthesizing data in these situations.

Kesselhaus
11:30
11:30
20min
Semantic vs keyword search as context for GPT
Tudor Golubenco

If you want to build a chat bot like ChatGPT on your own data, you need to use search to provide the context. Usually semantic search is used, but we've found that keyword search has some pros.

Palais Atelier
11:50
11:50
40min
Apache Kafka® & ksqlDB in Action, Build a Streaming Pipeline
Geetha Anne

In this talk, we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using Kafka Connect, and ksqlDB

Frannz Salon
11:50
40min
Rethinking Autoscaling for Apache Solr using Kubernetes
Houston Putman

Apache Solr’s built-in autoscaling is gone, but the need for autoscaling persists. Using Kubernetes’ HPA, the Solr Operator and new Solr APIs, we re-introduce autoscaling for Solr on Kubernetes.

Maschinenhaus
11:50
40min
Search saves lives: solving healthcare problems with search
Charlie Davies

During covid the pressure was on for search. I’ll discuss the challenges of building a search engine matching people to covid test facilities and how the lessons learned can solve healthcare issues.

Kesselhaus
12:00
12:00
40min
Alexa, is The Smart Home vision failing?
Steve Loughran

Amazon's Alexa team has lost billions. Google and Apple's hub aren't great successes. Is the Smart home failing? How can you keep your lights on when they depend on cloud infrastructure to work?

Palais Atelier
12:40
12:40
80min
LUNCH
Kesselhaus
12:40
80min
LUNCH
Maschinenhaus
12:40
80min
LUNCH
Palais Atelier
12:40
80min
LUNCH
Frannz Salon
14:00
14:00
40min
A Crash Course in Error Handling for Streaming Data Pipeline
Stefan Sprenger

Learn how to handle errors in streaming data pipelines using concepts, such as dead-letter queues.

Palais Atelier
14:00
40min
ChatGPT is lying, how can we fix it?
Kacper Łukawski

Large Language Models are great in grammar but tend to confabulate. Building a reliable knowledge base might be a way to solve it. Here is how.

Kesselhaus
14:00
40min
Platform Engineering is All About Product
Gal Bashan

“Platform Engineering,” the latest buzzword, means building an internal platform to improve your SDLC in a way your developers will want to use. Can this be done with engineering skills alone?

Maschinenhaus
14:00
40min
Using Dense Vector search at the EU Publications Office
Phil Lewis

How dense vector functionality was used to provide several ‘Google-like’ capabilities such as Extractive Answers and knowledge graph search over a large dataset at the EU Publications Office.

Frannz Salon
14:50
14:50
40min
Cross Data Center Replication in Solr - A new approach
Anshum Gupta, Mark Miller

Learn about the motivation that led to the development of the new Cross Data Center (XDC) Replication module in Apache Solr and discover the capabilities it offers making it disaster ready.

Kesselhaus
14:50
40min
Fact Checking Rocks: how to build a fact-checking system
Stefano Fiorucci

In this infodemic era, fact-checking is becoming a vital task.
In this talk, we’ll discover how to build a simple fact-checking system for rock music, leveraging the power of open-source libraries.

Palais Atelier
14:50
40min
Hadoop Vectored IO: your data just got faster!
Steve Loughran

We are introducing a new Hadoop Filesystem API called "vectored read" using which we can achieve significant speedups for all big data applications, especially in cloud storage like S3 and ABFS.

Maschinenhaus
14:50
40min
Searching large data sets in (near) constant time
Torsten Bøgh Köster, Dennis Berger

Tackle large search results by estimating hit count, interpolating a first phase ranking and limiting
the returned result set to the most relevant documents in a multi-million document index.

Frannz Salon
15:30
15:30
30min
BREAK
Kesselhaus
15:30
30min
BREAK
Maschinenhaus
15:30
30min
BREAK
Palais Atelier
15:30
30min
BREAK
Frannz Salon
16:00
16:00
40min
Column-level lineage is coming to the rescue
Paweł Leszczyński, Maciej Obuchowski

How are the columns containing sensitive data used across the data ecosystem? What input columns were used to produce a given report field? Openlineage can answers those questions automatically.

Kesselhaus
16:00
40min
Highly Available Search at Shopify
Khosrow Ebrahimpour

This talk shares the story of how Shopify implemented seamless storage autoscaling for Elasticsearch that powers search for millions of merchants without data loss.

Frannz Salon
16:00
40min
Learning to hybrid search
Roman Grebennikov, Vsevolod Goloviznin

Combining BM25, neural embeddings and customer behavior with Learning-to-Rank into an ultimate ranking ensemble, with examples on Amazon ESCI e-commerce search dataset.

Palais Atelier
16:00
40min
Scalable distributed messaging&streaming with Apache Pulsar
Julien Jakubowski

In this session, you'll discover seven Apache Pulsar features that enable you to build amazing event-driven applications and how Apache Pulsar differs from traditional message brokers.

Maschinenhaus
16:50
16:50
20min
Catch the fraud — with observability and analytics
Philipp Krenn

This is the story of how to catch cheaters by combining observability and analytics data through the power of search.

Maschinenhaus
16:50
20min
Tiny Flink — Minimizing the memory footprint of Apache Flink
Robert Metzger

We will explore options to run Apache Flink with a very low resource footprint, allowing users to run full streaming SQL queries or custom streaming applications on JVMs with less than 500mb

Kesselhaus
17:15
17:15
15min
Closing Session
Kesselhaus