Berlin Buzzwords 2024

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
09:30
09:30
5min
Opening Session

Join us as we kick off Berlin Buzzwords 2024!

Kesselhaus
09:35
09:35
45min
The Paradox of Open: Can Digital Commons Offer a Way Forward?
Zuzanna Warso

At the turn of the century, the internet ushered in the era of the digital commons - a realm of collectively created and managed resources that are open to the public. Over the past decade, the digital landscape has undergone a profound transformation.

Kesselhaus
10:20
10:20
20min
Breakfast Break
Kesselhaus
10:20
20min
Breakfast Break
Maschinenhaus
10:20
20min
Breakfast Break
Palais Atelier
10:20
20min
Breakfast Break
Frannz Salon
10:40
10:40
20min
Accelerating TopK Queries
Juliane Waack

TopK Queries are SQL queries that contain both an order by and a limit clause. The optimization presented in this talk uses runtime information to skip scanning data that could not be a part of the final result, thereby significantly accelerating these types of queries.

Frannz Salon
10:40
20min
Better search relevance using Learning to Rank at mobile.de
Manish Saraswat

At mobile.de we are striving to provide a better and faster search. We use a backend ML system to learn changing user interests and optimise search experience. Based on learning to rank using XGBoost, we discuss current search relevance ranking framework and how it ranks millions of searches daily.

Palais Atelier
10:40
20min
Heap sizing and GC tuning for Solr and friends
Radu Gheorghe, Rafał Kuć

We’re taking a stab at a definitive - yet nuanced! - answer to “how much heap do I need?” and “is GC a problem and how can I fix it?” for Solr/Elasticsearch/OpenSearch deployments.

Maschinenhaus
10:40
20min
Synergy of Signals: Traffic Logs Meet LLM Labels
Stefana Serban

Discover how melding user-traffic signals with LLM-derived relevance labels can significantly improve learning to rank models. This talk unveils a novel approach to enhancing search relevance for query-product pairs, offering a glimpse into the future of e-commerce search technology.

Kesselhaus
11:10
11:10
40min
Brick-by-Brick: Exploring the Elements of Apache Kafka®
Danica Fine

Let’s rebuild the world of Apache Kafka® brick-by-brick through the lens of LEGO®. You’ll explore Kafka’s inventory of pieces, learn the ins and outs of core elements, and see just how they ‘click’ together with other APIs and tools. The only question left is: what will YOU make with Kafka?

Palais Atelier
11:10
40min
From Natural Language to Structured Solr Queries using LLMs
Ilaria Petreti, Anna Ruggero

We explore the usage of AI, especially NLP techniques and LLM, to enhance Apache Solr data accessibility. We propose translating natural language queries into structured Solr queries using LLM and metadata to improve search and user experience. We’ll discuss the results and future directions.

Maschinenhaus
11:10
40min
Learning to Rank for Reddit Search - A Project Retro
Doug Turnbull, Charles Njoroge

The successes (and failures) of applying Machine Learning to improve Reddit's search ranking

Kesselhaus
11:10
40min
`New` Workflow Orchestrator in town: "Apache Airflow 2.x"
Jarek Potiuk

What's the role of Orchestrators in the Modern Data Stack?. Trying new orchestrators? Do we need Orchestrators at all? Or what "Orchestrator" can do for you had been redefined ?
Meet "new" orchestrator in town ..... "Apache Airlfow 2.x" with ways of orchestration you had not realized you can do.

Frannz Salon
12:00
12:00
40min
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Alessandro Benedetti

Hybrid search combines traditional keyword-based search with vector-based search.
The result sets are merged and a single ranked list of items is returned.
Reciprocal Rank Fusion is one of the most popular algorithms for such a task.
This talk presents the work done to bring it to Apache Solr.

Palais Atelier
12:00
40min
Practical introduction to OpenTelemetry tracing
Nicolas Fränkel

OpenTelemetry is a standard for tracing across multiple components. Let’s see how to set it up across different stacks

Frannz Salon
12:00
40min
S3 as the state store for stream processing systems
Yingjun Wu

S3 is cheap; but S3 is slow. Stream procesisng systems want to maintain internal states at low cost, but cannot tolerate high latency. How could we build a stream processing system on top of S3? I'll tell you what we learned over the last three years.

Maschinenhaus
12:00
40min
cuVS and Lucene: GPU-based Vector Search
Corey J. Nolet, Vivek Narang

This talk focuses on how we leveraged Nvidia’s open source cuVS library to accelerate Apache Lucene’s vector search capabilities on the GPU. cuVS contains several algorithms for approximate nearest neighbors and clustering on the GPU and we show how it pairs well with Lucene, which has long been the

Kesselhaus
12:40
12:40
80min
Lunch
Kesselhaus
12:40
80min
Lunch
Maschinenhaus
12:40
80min
Lunch
Palais Atelier
12:40
80min
Lunch
Frannz Salon
14:00
14:00
40min
Back to the Future! Time Travel with Bitemporal Databases
Tim Zöller

The talk explores managing transaction and validity times in data storage, using examples like retroactive creation of shipping documents. It highlights how bitemporal databases aid in this process and discusses the complexities of implementing such solutions.

Frannz Salon
14:00
40min
End-to-end pipeline agility
Lars Albertsson

We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.

Palais Atelier
14:00
40min
Fixin the Hard Bits of Event Processing with Restate & Kafka
Stephan Ewen

Discover Restate: A new tool for event-driven apps, blending Kafka's power with simplicity for handling complex transactions, async tasks, and microservices. Dive into its unique approach to tackle scalability, resilience, debuggability, and state management challenges in our session.

Kesselhaus
14:00
40min
Large language models are not a paradigm shift
William Benton

Many of the putatively novel challenges of building systems around LLMs are analogous to problems we've solved for conventional ML systems. This talk will show you why the things you already know about building ML systems are still relevant for LLM systems — and where the true novelty of LLMs lies.

Maschinenhaus
14:50
14:50
20min
End-to-End Encryption for Streaming Data Pipelines
Hans-Peter Grahsl

This talk explains what it takes to bring end-to-end encryption to streaming data pipelines built on top of Apache Kafka and Flink. A live demo illustrates how to encrypt/decrypt sensitive payload fields by means of single message transformations and user-defined functions without any custom code.

Palais Atelier
14:50
20min
Monitoring your home, with DevOps observability tools
Nick Burch

Looking to reduce your home energy consumption, to move to more ecological heating, or just want to know what's going on in your house? You need data, and graphs! Learn how cheap hardware plus devops observability tools let you do that!

Maschinenhaus
14:50
20min
Moving from Offline to Online Machine Learning with River
Tun Shwe

The foundations of machine learning were built on offline batch processing techniques for model training and inference. As organisations become more dependent on real-time data, the technological trend for machine learning in production is moving towards adding an online stream processing approach.

Frannz Salon
14:50
20min
Shattering the Limits of Search with Domain Specific Computing
Ohad Levi

The demand for advanced search and data retrieval capabilities is growing exponentially: The rise of AI applications, along with the unprecedented scaling of data is leading to workloads that are pushing traditional software based search to its limits.
As cloud computing costs are skyrocketing, and queries are becoming more complex, organizations are often forced to compromise results relevancy given the strict real-time latency requirements of 100ms.
To address this issue, a new dimension of search must be introduced: domain-specific computing; it focuses on designing a dedicated chip for search, to consistently achieve ten times faster search at a billion-scale, all at a fraction of the infrastructure cost.

Kesselhaus
15:20
15:20
40min
Evolving Yelp's Streaming at Scale
Ashish Khatkar

In this talk we detail how Yelp evolved its Streaming Data Platform at scale while supporting its growing business needs coupled with the fast changing streaming landscape.

Palais Atelier
15:20
40min
Flink's SQL Engine: Let's open the engine room!
Timo Walther

Flink's SQL engine is the workhorse behind many event processing platforms. We take a deep look into the internals and take the stack apart! Optimizer phases, streaming primitives, watermarks, CDC, and upsert keys. Let me give you a feeling for the power of a simple streaming SQL query.

Kesselhaus
15:20
40min
Improve LLM-based Applications with Fallback Mechanisms
Bilge Yücel

This session explores different fallback strategies for RAG applications, enabling you to build robust LLM systems for various scenarios, using open source LLM framework Haystack.

Maschinenhaus
15:20
40min
Improving Search @scale with efficient query experimentation
Andreas Wagner

From relevance guesswork to granular search experimentation at scale. Evaluate modifications to your search, such as incorporating synonyms, adjusting field-boosts, adding vector search, or product assortment modifications, on real data rather than relying solely on intuition.

Frannz Salon
16:00
16:00
30min
Coffee Break
Kesselhaus
16:00
30min
Coffee Break
Maschinenhaus
16:00
30min
Coffee Break
Palais Atelier
16:00
30min
Coffee Break
Frannz Salon
16:30
16:30
40min
Can ChatGPT build a Data Platform faster than a developer?
Chloé Caron

Gear up for a speed competition, it's ChatGPT versus yours truly, racing to build a data visualisation platform from scratch at top speed. Who will win? We’ll be taking on the challenge of reconstructing an energy generation visualisation platform, drawing inspiration from the likes of GridWatch.

Kesselhaus
16:30
40min
Let's Do Data Lineage in Kafka, Flink and Druid!
Hellmar Becker

How do you track data lineage in an open-source, realtime analytics pipeline? I'll show in a demo!

Maschinenhaus
16:30
40min
Rediscover your keyword search: Expand, Enrich and Rewrite
Praveen Mohan Prasad, Hajer Bouafif

Dense Vector Search is not the only route to improve your search relevance. Empower your existing sparse keyword search with semantic search capabilities by leveraging text expansion, metadata enrichment, and query re-writing techniques.

Palais Atelier
16:30
40min
Remote work is here to stay - and what's next?
Radovan Bacovic

A great chance to hear the experience about the future of work from the biggest all-remote company in the world. Arm yourself with the best-practices toolbox for the async work tips and tricks and leverage the context from the distributed work leaders.

Frannz Salon
17:20
17:20
40min
Advancements in Evaluating Large Language Model Applications
Petr Polezhaev

This talk will discuss approaches to evaluate LLMs at the end-to-end and task levels, focusing on use cases such as question-answering (RAG). We will also cover metric selection and ways to generate datasets using LLMs.

Frannz Salon
17:20
40min
Build your 8-bit computer from scratch
Olivier HUBER

Yes it's possible to create an 8-bit computer from scratch on breadboards at home with basic components and that's what I will demonstrate live with lots of blinking LEDs ;)

Kesselhaus
17:20
40min
Can Apache Pinot replace your OLTP database?
Chinmay Soman

Apache Pinot is a really fast database that's powering real-time applications across various industries, much akin to how people use OLTP databases. So what's the difference ? Are there scenarios where Pinot can replace an existing OLTP database ? In this talk we will explore this idea.

Palais Atelier
17:20
40min
How we isolate streaming ingest from search using RocksDB
Igor Canadi

Learn a new cloud architecture that separates streaming ingest compute from query compute for real-time applications and how it addresses the age-old issue of compute contention.

Maschinenhaus
18:00
18:00
180min
Get-Together

Join us for snacks & drinks in an relaxed atmosphere. Meet old and new friends, business contacts or get to know other participants of Berlin Buzzwords! The Get-Together takes place directly in front of Palais Kulturbrauerei.

Kesselhaus
09:30
09:30
20min
Harnessing Spare Cores to Breeze Through Cloud Compute
Gergely Daroczi

Spare Cores, a Python-based open-source ecosystem, offers standardized inventory and performance evaluations for compute resources of public cloud and server providers, helping with selecting and starting the optimal instance type for containerized tasks, such as ML model training or ETL processes.

Frannz Salon
09:30
20min
Learning to Apply Generative AI at Enterprise Level
Sebastian Arnold

GenAI is transforming organizations rapidly. But how can we scale individual experimentation to enterprise-grade business products? This talk will cover learnings from hundreds of use cases in pharma about user intentions, information access patterns and quality metrics.

Kesselhaus
09:30
20min
Streaming DataFrames: A New Way to Process Streaming Data
Tomáš Neubauer

Introducing an open source library in Python: Quix Streams. It solves all the complexities of stream processing in a cloud native package with a familiar Pandas DataFrame API interface. This library lets you work with streaming data similarly to static data in your Jupyter Notebook!

Maschinenhaus
09:30
20min
The Unsung Hero of Vector Database -- Metric Learning
Sonam Pankaj

We all know there are several vector databases, and once Andrej Karpathy said that even an array can do the same job, true but not true. Vector database provide you the infra to store the embeddings but how those embeddings are made are the most innovative part of all.

Palais Atelier
10:00
10:00
40min
A Practical Approach To Semantic Search
Kentaro Takiguchi

Exploring and enhancing lexical search and semantic search in practical scenarios, we assess various optimization methods and their varied effects on metrics. The focus is on the integration of semantic search into an established lexical search system, addressing potential challenges and pitfalls.

Maschinenhaus
10:00
40min
Deep Learning plays Handball
Pere Urbon Bayes

Curious about where unexpectedly Deep Learning is being used these days? Because not everything is ChatGPT, AI for search or to detect fraud in e-commerce, this talk will review, and build together state of the art DL architectures as the ones used by the Bundesliga. Help your local team with AI!

Kesselhaus
10:00
40min
Elevating AI Applications with OpenSearch's Flow Framework and RAG Tool
Mingshi Liu, Owais Kazi

This talk introduces OpenSearch's Flow Framework and RAG (Retrieval-Augmented Generation) Tool, enabling developers to create AI-augmented agent, queries, and ingestion flows, integrate ML models, and streamline app development through a no-code/low-code builder

Palais Atelier
10:00
40min
Streaming doesn’t have to be hard
Chloe He

Nearly half of data scientists find it difficult to adopt streaming technologies. The audience will learn about the technological barrier and the operational burden that inspired us to look for better solutions, and how we built a stream-batch unified Python dataframe API.

Frannz Salon
10:40
10:40
30min
Coffee Break
Kesselhaus
10:40
30min
Coffee Break
Maschinenhaus
10:40
30min
Coffee Break
Palais Atelier
10:40
30min
Coffee Break
Frannz Salon
11:10
11:10
40min
Advanced Retrieval-Augmented Generation Techniques
Zain Hasan

I will discuss advanced retrieval augmented generation(RAG) techniques such as hybrid search, re-ranking, query generation, and semantic text chunking.

Frannz Salon
11:10
40min
Apache Lucene: From Text Indexing to Artificial Intelligence
Lucian Precup

Celebrating Apache Lucene's 22nd anniversary, this conference explores its pivotal role in search and data technologies, from powering platforms like Solr, Elasticsearch and OpenSearch to recent AI synergies with vector indexing/search. Discover Lucene's evolution and future horizons.

Kesselhaus
11:10
40min
Kafka on the Fly: A Serverless Approach to Data Streaming
Desmond Obisi

This talk will focus on the exploration of serverless Kafka and real-time data streaming with Node.js microservices. In this talk, I will delve into Upstash's serverless Kafka architecture and how it can be leveraged to build scalable and cost-effective solutions for microservices.

Palais Atelier
11:10
40min
Under the hood of vector search with JVector
Joel Knighton

This is a speedrun through the past ten years of R&D on approximate nearest-neighbor search algorithms and vector databases, covering the major advances, the current state of the art, and possible future directions. This will be informed by Joel's experience contributing to JVector, a Java embedded vector search library.

Maschinenhaus
12:00
12:00
40min
Enhancing RAG with Neo4j Knowledge Graph
Djordje Benn-Maksimovic

Explore how leveraging a knowledge graph combined with vector indices can enhance semantic search and retrieval-augmented generation (RAG), enabling global insights beyond document boundaries

Frannz Salon
12:00
40min
Hardcoding airpods (and other stories from NLP in insurance)
Murhaf Fares, Emanuele Lapponi

Curious about NLP beyond the startup hype? Join us to explore NLP in a 'traditional' setting. Tackle challenges like data scarcity and domain specificity using e.g. data augmentation and zero-shot classification, and learn some tips and tricks to address concrete and relatable NLP problems

Maschinenhaus
12:00
40min
Lessons learned writing 10+ Kubernetes Operators
Lars Francke, Jannik Heyl

I'll talk about all the mistakes we made and all the lessons we learned developing a data platform on top of a dozen or so Kubernetes operators so you don't have to go through the same ordeal.

Palais Atelier
12:00
40min
Recognizing the driving forces of impactful OS projects
Maryblessing Okolie

This talk aims to challenge the usual "code-centric" narrative around open source projects. That impact doesn't solely rest on lines of code but thrives within a diverse ecosystem of non-technical contributions.

Kesselhaus
12:40
12:40
80min
Lunch
Kesselhaus
12:40
80min
Lunch
Maschinenhaus
12:40
80min
Lunch
Palais Atelier
12:40
80min
Lunch
Frannz Salon
14:00
14:00
40min
Jina Embeddings V2: From Raw Data to Bilingual Hybrid Search
Bo Wang, Isabelle Mohr

Embeddings transform text into numerical vectors, capturing semantic relationships. This talk explores the data preparation, training processes, evaluation, and a demo of Jina Embeddings V2 in a hybrid search pipeline, showcasing its practical applications.

Palais Atelier
14:00
40min
Open-Source Generative AI: A Product Manager’s Blueprint
Saahil Ognawala

We explore the evolving role of product management in open-source generative AI, focusing particularly on Retrieval Augmented Generation's (RAG) industry impact. We cover community-commercial balance, innovative AI product discovery, and sustainable monetization strategies.

Frannz Salon
14:00
40min
Robust AI Search Ranking for Radical C2C Marketplace Growth
Teo Narboneta Zosa, Chingis Oinar

Key practical insights across dataset construction, custom metrics, model building, and model de-biasing techniques behind the success of our real-time AI search ranking system which drives significant business impact and now serves as a centerpiece of Mercari’s Search & Discovery platform.

Kesselhaus
14:00
40min
Standing on the Shoulders of Giants
Bryan Burkholder, Varun Thacker

You've heard the wisdom of choosing boring technology, but how do you balance this with new usage patterns and ever-increasing scale? Learn how Slack developed a petabyte-scale log search engine with a team of just three engineers, supporting over one million queries per day.

Maschinenhaus
14:50
14:50
40min
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Daniele Antuzi

This talk presents an experimental serverless Solr indexer where the records from multiple database tables are merged using the map-reduce approach. The merge is implemented in AWS using Step Function Distributed Map to orchestrate many lambda functions.

Palais Atelier
14:50
40min
Sequential Testing Simplified with Basemath
Alexander Weiss, Konrad Richter

Learn how sequential testing has cut our experiment durations by nearly 40%. We'll delve into the core principles, tackle real-world implementation issues, and unveil Basemath, our Python-driven open-source tool. Gain insights on refining your experimental analysis for smarter feature deployment.

Frannz Salon
14:50
40min
The Power of the Bonus Card: Road to Personalised Search
Luuk Kaandorp, Vincent Peijnenburg

Everyone is unique, and that is especially true for what they eat. At Albert Heijn, we are moving away from popularity-based search for the broader customer base, in favor of tailoring search to our unique customers' tastes. We will tell you all about this transition.

Kesselhaus
14:50
40min
The “C” in CI/CD is not for “Closed”
Josh Reed

Most of the foundational tools we use in this industry are open source. But CI/CD is a domain where open source has stagnated and most innovation is happening in proprietary code. This talk discusses why this is, why it's dangerous, and what we can do about it.

Maschinenhaus
15:30
15:30
30min
Coffee Break
Kesselhaus
15:30
30min
Coffee Break
Maschinenhaus
15:30
30min
Coffee Break
Palais Atelier
15:30
30min
Coffee Break
Frannz Salon
16:00
16:00
40min
Applications of Tries in Apache Cassandra
Branimir Lambov

Log-structured merge-tree databases rely on a combination of in-memory buffers and hierarchies of immutable on-disk structures, which are two cases where a trie data structure is most efficient. Join us for a glimpse at the ways in Apache Cassandra applies and benefits from this combination.

Kesselhaus
16:00
40min
From Text to Context: How We Introduced Hybrid Search
Ansgar Gruene, Dharin Shah, Alexander Butenko

We present the lessons learnt from improving the search of our global online marketplace with 20 million products sold per year. We successfully moved from a traditional word-match based approach (BM25) to a modern hybrid solution by adding a semantic vector model which we fine-tuned to our domain.

Maschinenhaus
16:00
40min
Search in the Cloud: separation of compute and storage
Ilan Ginzburg

Modifying SolrCloud to separate compute and storage presents challenges but is cost effective and improves scalability and availability. Nodes become stateless and durability is delegated to the shared storage implementation.
Salesforce runs such clusters at scale. The code is open source.

Palais Atelier
16:00
40min
Taming the cost of Kafka workloads in the cloud
Stefan Sprenger

This talk equips developers with everything they need to understand and optimize the cost of Kafka workloads in the era of cloud computing.

Frannz Salon
16:50
16:50
20min
A journey in geospatial timeseries
Nils Larsgård

Have you unlocked the geospatial powers of PostgreSQL?
This talk will show 3 neat features of how to use postgresql in effective ways to handle geospatial timeseries in a new domain. The data in subject is live marine traffic from the Norwegian territory.

Kesselhaus
16:50
20min
Comparing vector implementations in generic databases
Tudor Golubenco

We've recently seen a boom of specialized vector databases. At the same time, almost all popular database projects have added support for vectors. So a lot of people are asking themselves if and when do they really need a specialized vector databases, and when could use an already deployed tool.

Frannz Salon
16:50
20min
Cracking the Code: Deciphering Evaluation Essentials for RAG
Atita Arora

Explore RAG systems: Understand evaluation criteria, tools, and frameworks.
Discover essential components and compare them with existing frameworks.
Gain insights for enhanced control in your path toward RAG optimization.

Maschinenhaus
16:50
20min
Open Web Search - a platform for a free European Web Index
Michael Dinzinger

The Open Web Search Initiative (OWS.eu) is a European initiative providing an open platform to foster innovation in search and AI applications. Join us for an introduction to OWS, its public datasets and an intro to two of the open source projects that underpin it - StormCrawler and URLFrontier.

Palais Atelier
17:15
17:15
5min
Closing session

Join us as we wrap up Berlin Buzzwords 2024!

Kesselhaus