Berlin Buzzwords 2023 :: pretalx

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.

Sunday, June 18, 2023

Monday, June 19, 2023

Tuesday, June 20, 2023

15:00

Barcamps are informal sessions, a kind of "un-conference", with a schedule decided on the day.

09:15

Opening Session

Join us as we kick off Berlin Buzzwords 2023

09:35

What defines the “open” in “open AI”?

This talk focuses on unpacking this year’s big buzzwords of “open AI” and “responsible AI” to highlight the range of (sometimes contradictory) activities that exist under these umbrella terms and how

10:15

10:15

20min

BREAK

Kesselhaus

10:15

20min

BREAK

Maschinenhaus

10:15

20min

Break

Palais Atelier

10:15

20min

BREAK

Frannz Salon

10:40

Declarative Data Collections for Portable Parallelism

This talk introduces a novel programming model - the user declares data collections with the properties, and and these declarations can be transparently ported to multiple platforms including GPUs.

Migrate Data, <Mesh> in mind

For quite some time, Hadoop served as the data warehouse for Kleinanzeigen. In this presentation, our objective is to provide an overview of our approach, which involves implementing a cloud-based data pipeline with the help of dbt and Airflow.

Vectorize Your Open Source Search Engine

Fascinated by vector search but don't know where to start?
Join us to crack the code and leverage the potential of vector search to delight your users.

11:10

How to train your general purpose document retriever model

Tom Veasey, Quentin Herreros

A practical guide for training learned sparse models to outperform BM25 on zero-shot document retrieval tasks

Kaldb: serverless lucene at petabyte scale

In this talk, we share our experiences, best practices, and lessons learned in designing and operating a serverless Lucene serving system at PB scale.

Supercharging your transformers with synthetic query generation and lexical search

This talk will explore dramatic gains in ranking performance from small transformer models, fine-tuned with synthetic query generation and combined with lexical search, and will equip the audience to pursue the same approach using open-source tools.

Using TensorFlow in a Solr Query Parser

Radu Gheorghe, Rafał Kuć

Tutorial for writing Solr Query Parser that use TensorFlow for Java to augment queries.

12:00

Apache Airflow in Production - Bad vs Best Practices

This talk will explore the bad and best practices when deploying Apache Airflow in a production environment. From common pitfalls such as misconfigured tasks and lack of scalability,

Boosting Ranking Performance with Minimal Supervision

Jo Kristian Bergum

Using generative Large Language Models (LLMs) to generate synthetic labeled data to train in-domain ranking models. Distilling the knowledge and power of generative LLMs into effective ranking models.

ClickHouse: what is behind the fastest columnar database

Columnar databases seem to be full of mysteries and confusion. In this introduction for ClickHouse, we'll take apart its building blocks to see how it achieves its remarkable performance.

When Probably is Good Enough

Probabilistic data structures give developers room to massively cut down on space requirements while sacrificing a bit of accuracy, so when is probably good enough?

12:40

12:40

80min

LUNCH

Kesselhaus

12:40

80min

LUNCH

Maschinenhaus

12:40

80min

LUNCH

Palais Atelier

12:40

80min

LUNCH

Frannz Salon

14:00

A Kafka Client’s Request: There and Back Again

Understand how data moves into and out of Apache Kafka® by taking a look at the producer and consumer request life cycle. Follow a request from an initial call to send() or poll(), all the way to disk

Cooking up a new search system: Recipe search at Cookpad

How we successfully transitioned the search system for the world's largest recipe sharing platform to a modern stack – including the successes, failures, team structures, and processes along the way.

Introducing Multi-valued Vector Fields in Apache Lucene

Alessandro Benedetti

Multiple vectors in a field dedicated to K-nearest-neighbors search has been a fundamental problem for Apache Lucene for long.
This talk describes how this has been finally designed and implemented.

Tip of the Iceberg

Fokko Driesprong

Apache Iceberg is an open table format that has wide support among open-source and cloud vendors. After this talk, you'll be comfortable with all the concepts and how to use Iceberg.

14:50

Big data in the service of reliable news

Benjamin Dauvissat, Radu Pop

Data vs. Fake news : using available data to offer a critical view of the world

Joining Dozens of Data Streams in Distributed Stream Processing Systems

This talk will explore the techniques and best practices for joining dozens of data streams, focusing on different joining mechanisms, such as binary joins and delta joins, as well as pros and cons.

No Mean Feat: Upgrading a Customized Solr to Upstream Solr

Shikhar Srivastava

Learn how the News Search Infrastructure Team at Bloomberg migrated from a customized implementation of Apache Solr to the upstream Apache Solr

Privacy-Preserving Web Search

An ethical overview of how a privacy-focused search engine has to adapt its behavior from crawling to ranking web documents without knowing anything about the user and still be as relevant as possible

15:20

Building On-Ramps for Non-Code Contributors in Open Source

Natali Vlatko, Celeste Horgan

Open source software is so much more than code – docs, community and infra need maintaining. How do you attract and keep non-code contributors? Let two experienced practitioners show you the way!

Laptop-sized ML for Text, with Open Source

Advanced ML models for text may need hundreds of machines, but with open source tools and pre-trained models, you can do a lot just on your laptop or docker container. Discover what and how!

Model Fine-tuning For Search: From Algorithms to Infra

Bo Wang, Maximilian Werk

Deep learning for search has become a hot topic, while pre-trained neural nets do not function well as expected. We will discuss the algorithms behind model fine-tuning, and how to scale it up.

Towards a decentralized and collaborative search engine

Aline Paponaud, Lucian Precup

In this session we will share our vision towards an alternative, decentralized and collaborative search engine, from social considerations to technical implementation.

16:00

16:00

30min

BREAK

Kesselhaus

16:00

30min

BREAK

Maschinenhaus

16:00

30min

BREAK

Palais Atelier

16:00

30min

BREAK

Frannz Salon

16:30

How to not kill people

Lars Albertsson

As AI grows, software manages more risks to humans. Moving fast and breaking things won't do. We will look at aviation to learn how successful risk management structures might look in software & AI.

Ingesting over 4 million rows a second on a single instance

When we set up to write an open source fast time series database, we realised we would need every trick in the book to make it as performant as possible. This talk will show what's inside.

ML with Domain-Specific Ontology for IT Security Industry

Qi Wu, Bertram Sändig

The BSI provides actual data on acute IT threat situations. We developed a system for detecting threats: crawling, automatic analysis with NER, NEL, provision and use of dedicated tools for evaluating

Who broke the build? -Using Kuttl to test and Release faster

Ram Mohan Rao Chukka

No one wants to be responsible for breaking the build. But what can you do as a developer to avoid being the bad guy? How can project leads enable their teams to reduce the occurrence of builds?

17:20

Building Real-Time Applications: Cyclist Crash Detection

Tomáš Neubauer

In this talk, we will explore common problems faced when building real-time applications at scale, with a focus on a specific use case: detecting and responding to cyclist crashes.

Creating chaos in containers

Maish Saidel-Keesing

Chaos engineering is hard, in containers it is even harder.
This session will show attendees the considerations and get them started on their way to making more resilient applications in the cloud

The Debate Returns (with more vectors) Which Search Engine?

It's that old question - which search engine should I choose for my project? Elasticsearch, Solr, Opensearch (all based on Lucene), or Vespa, or maybe one of the new vector search engines?

18:00

18:00

120min

Get Together

Kesselhaus

18:00

120min

Get Together

Maschinenhaus

18:00

120min

Get Together

Palais Atelier

18:00

120min

Get Together

Frannz Salon

09:30

Advanced Search Plays with GraphQL

Stanimira Vlaeva

This demo-heavy workshop scores a hat trick by combining Apache Lucene, MongoDB, and GraphQL to easily build search functionality across data collections and 3rd party APIs into applications.

Avoiding Anti-patterns in Technical Communication

Communicating technical knowledge effectively is a core skill for practitioners, but one which is often neglected. We’ll give practical advice on how to (and not to!) communicate technical ideas.

What's coming next with Apache Lucene?

This talk will discuss the ways Apache Lucene might go in the next years. From the perspective of a full-text search engine, it looks like it is feature-complete. So what comes next?

When ms matter: Maximizing query performance in CrateDB

Marija Selakovic

Achieving optimal execution plans in distributed databases is a challenging task. This talk will focus on CrateDB: a distributed SQL database, and key strategies for optimizing its query performance.

10:00

Building MLOps Infrastructure at Japan's Largest C2C E-Commerce Site

Ryan Ginstrom, Teo Narboneta Zosa

The MLOps infrastructure we built to support ML in search at Mercari, Japan’s largest C2C e-commerce platform.

Connect GPT with your data: Retrieval-augmented Generation

Learn how to build with LLMs, like ChatGPT, and avoid typical pitfalls like hallucination and outdated information. Accompanied by practical code examples using the open source framework Haystack.

Deep dive into an Elasticsearch plugin for query-time joins

Stéphane Campinas

Siren Federate is an Elasticsearch plugin for joining inverted indices at query-time. Learn in this talk about its inner workings and how it complements features of Elasticsearch like runtime fields.

10:40

10:40

20min

BREAK

Kesselhaus

10:40

20min

BREAK

Maschinenhaus

10:40

20min

BREAK

Palais Atelier

10:40

20min

BREAK

Frannz Salon

11:00

A Fresh Start? The Path Toward Apache Solr's v2 API

Jason Gerlowski

Modernization efforts face particular hurdles in large, established OSS projects. Come learn about the community and technical challenges encountered on Apache Solr's path towards revamped HTTP APIs.

From keyword to vector

During this talk, I will take you on my over-a-decade-long journey in search. Starting from having witnessed the inception of Elasticsearch to my current endeavors with Weaviate, I will share my first-hand experience of the evolution, challenges, and lessons learned along the way.

How to Implement Online Search Quality Evaluation with Kibana

Ilaria Petreti, Anna Ruggero

Conducting online testing is crucial for assessing a model’s performance in a real-world scenario. This talk explores a customized approach for evaluating ranking models using Kibana.

Synthetic data: when, why, and how

This talk will cover several use cases in which generating synthetic data is useful (or even essential) and introduce a toolbox of practical techniques for synthesizing data in these situations.

11:30

Semantic vs keyword search as context for GPT

Tudor Golubenco

If you want to build a chat bot like ChatGPT on your own data, you need to use search to provide the context. Usually semantic search is used, but we've found that keyword search has some pros.

11:50

Highly Available Search at Shopify

Khosrow Ebrahimpour

This talk shares the story of how Shopify implemented seamless storage autoscaling for Elasticsearch that powers search for millions of merchants without data loss.

Rethinking Autoscaling for Apache Solr using Kubernetes

Apache Solr’s built-in autoscaling is gone, but the need for autoscaling persists. Using Kubernetes’ HPA, the Solr Operator and new Solr APIs, we re-introduce autoscaling for Solr on Kubernetes.

Search saves lives: solving healthcare problems with search

Chris Hutchinson

During covid the pressure was on for search. I’ll discuss the challenges of building a search engine matching people to covid test facilities and how the lessons learned can solve healthcare issues.

12:00

Alexa, is The Smart Home vision failing?

Amazon's Alexa team has lost billions. Google and Apple's hub aren't great successes. Is the Smart home failing? How can you keep your lights on when they depend on cloud infrastructure to work?

12:40

12:40

80min

LUNCH

Kesselhaus

12:40

80min

LUNCH

Maschinenhaus

12:40

80min

LUNCH

Palais Atelier

12:40

80min

LUNCH

Frannz Salon

14:00

A Crash Course in Error Handling for Streaming Data Pipeline

Stefan Sprenger

Learn how to handle errors in streaming data pipelines using concepts, such as dead-letter queues.

ChatGPT is lying, how can we fix it?

Kacper Łukawski

Large Language Models are great in grammar but tend to confabulate. Building a reliable knowledge base might be a way to solve it. Here is how.

Platform Engineering is All About Product

“Platform Engineering,” the latest buzzword, means building an internal platform to improve your SDLC in a way your developers will want to use. Can this be done with engineering skills alone?

Using Dense Vector search at the EU Publications Office

How dense vector functionality was used to provide several ‘Google-like’ capabilities such as Extractive Answers and knowledge graph search over a large dataset at the EU Publications Office.

14:50

Cross Data Center Replication in Solr - A new approach

Anshum Gupta, Mark Miller

Learn about the motivation that led to the development of the new Cross Data Center (XDC) Replication module in Apache Solr and discover the capabilities it offers making it disaster ready.

Fact Checking Rocks: how to build a fact-checking system

Stefano Fiorucci

In this infodemic era, fact-checking is becoming a vital task.
In this talk, we’ll discover how to build a simple fact-checking system for rock music, leveraging the power of open-source libraries.

Hadoop Vectored IO: your data just got faster!

We are introducing a new Hadoop Filesystem API called "vectored read" using which we can achieve significant speedups for all big data applications, especially in cloud storage like S3 and ABFS.

Searching large data sets in (near) constant time

Torsten Bøgh Köster, Dennis Berger

Tackle large search results by estimating hit count, interpolating a first phase ranking and limiting
the returned result set to the most relevant documents in a multi-million document index.

15:30

15:30

30min

BREAK

Kesselhaus

15:30

30min

BREAK

Maschinenhaus

15:30

30min

BREAK

Palais Atelier

15:30

30min

BREAK

Frannz Salon

16:00

Column-level lineage is coming to the rescue

Paweł Leszczyński, Maciej Obuchowski

How are the columns containing sensitive data used across the data ecosystem? What input columns were used to produce a given report field? Openlineage can answers those questions automatically.

Learning to hybrid search

Roman Grebennikov, Vsevolod Goloviznin

Combining BM25, neural embeddings and customer behavior with Learning-to-Rank into an ultimate ranking ensemble, with examples on Amazon ESCI e-commerce search dataset.

Scalable distributed messaging&streaming with Apache Pulsar

Julien Jakubowski

In this session, you'll discover seven Apache Pulsar features that enable you to build amazing event-driven applications and how Apache Pulsar differs from traditional message brokers.

16:50

Catch the fraud — with observability and analytics

This is the story of how to catch cheaters by combining observability and analytics data through the power of search.

Tiny Flink — Minimizing the memory footprint of Apache Flink

We will explore options to run Apache Flink with a very low resource footprint, allowing users to run full streaming SQL queries or custom streaming applications on JVMs with less than 500mb

17:15

Closing Session