Berlin Buzzwords 2024

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.

Sunday, June 9, 2024

Monday, June 10, 2024

Tuesday, June 11, 2024

15:00

Barcamps are informal sessions, a kind of "un-conference", with a schedule decided on the day. It is all driven by the interests and expertise of those who attend so each one is different, but ours are always great!

09:30

Opening Session

Join us as we kick off Berlin Buzzwords 2024!

09:35

The Paradox of Open: Can Digital Commons Offer a Way Forward?

At the turn of the century, the internet ushered in the era of the digital commons - a realm of collectively created and managed resources that are open to the public. Over the past decade, the digital landscape has undergone a profound transformation.

10:20

10:20

20min

Breakfast Break

Kesselhaus

10:20

20min

Breakfast Break

Maschinenhaus

10:20

20min

Breakfast Break

Palais Atelier

10:20

20min

Breakfast Break

Frannz Salon

10:40

Accelerating TopK Queries

TopK Queries are SQL queries that contain both an order by and a limit clause. The optimization presented in this talk uses runtime information to skip scanning data that could not be a part of the final result, thereby significantly accelerating these types of queries.

Better search relevance using Learning to Rank at mobile.de

Manish Saraswat

At mobile.de we are striving to provide a better and faster search. We use a backend ML system to learn changing user interests and optimise search experience. Based on learning to rank using XGBoost, we discuss current search relevance ranking framework and how it ranks millions of searches daily.

Heap sizing and GC tuning for Solr and friends

Radu Gheorghe, Rafał Kuć

We’re taking a stab at a definitive - yet nuanced! - answer to “how much heap do I need?” and “is GC a problem and how can I fix it?” for Solr/Elasticsearch/OpenSearch deployments.

Synergy of Signals: Traffic Logs Meet LLM Labels

Discover how melding user-traffic signals with LLM-derived relevance labels can significantly improve learning to rank models. This talk unveils a novel approach to enhancing search relevance for query-product pairs, offering a glimpse into the future of e-commerce search technology.

11:10

Brick-by-Brick: Exploring the Elements of Apache Kafka®

Let’s rebuild the world of Apache Kafka® brick-by-brick through the lens of LEGO®. You’ll explore Kafka’s inventory of pieces, learn the ins and outs of core elements, and see just how they ‘click’ together with other APIs and tools. The only question left is: what will YOU make with Kafka?

From Natural Language to Structured Solr Queries using LLMs

Anna Ruggero, Ilaria Petreti

We explore the usage of AI, especially NLP techniques and LLM, to enhance Apache Solr data accessibility. We propose translating natural language queries into structured Solr queries using LLM and metadata to improve search and user experience. We’ll discuss the results and future directions.

Learning to Rank for Reddit Search - A Project Retro

Doug Turnbull, Charles Njoroge

The successes (and failures) of applying Machine Learning to improve Reddit's search ranking

`New` Workflow Orchestrator in town: "Apache Airflow 2.x"

What's the role of Orchestrators in the Modern Data Stack?. Trying new orchestrators? Do we need Orchestrators at all? Or what "Orchestrator" can do for you had been redefined ?
Meet "new" orchestrator in town ..... "Apache Airlfow 2.x" with ways of orchestration you had not realized you can do.

12:00

Hybrid Search with Apache Solr Reciprocal Rank Fusion

Alessandro Benedetti

Hybrid search combines traditional keyword-based search with vector-based search.
The result sets are merged and a single ranked list of items is returned.
Reciprocal Rank Fusion is one of the most popular algorithms for such a task.
This talk presents the work done to bring it to Apache Solr.

Practical introduction to OpenTelemetry tracing

Nicolas Fränkel

OpenTelemetry is a standard for tracing across multiple components. Let’s see how to set it up across different stacks

S3 as the state store for stream processing systems

S3 is cheap; but S3 is slow. Stream procesisng systems want to maintain internal states at low cost, but cannot tolerate high latency. How could we build a stream processing system on top of S3? I'll tell you what we learned over the last three years.

cuVS and Lucene: GPU-based Vector Search

Corey J. Nolet, Vivek Narang

This talk focuses on how we leveraged Nvidia’s open source cuVS library to accelerate Apache Lucene’s vector search capabilities on the GPU. cuVS contains several algorithms for approximate nearest neighbors and clustering on the GPU and we show how it pairs well with Lucene, which has long been the

12:40

12:40

80min

Lunch

Kesselhaus

12:40

80min

Lunch

Maschinenhaus

12:40

80min

Lunch

Palais Atelier

12:40

80min

Lunch

Frannz Salon

14:00

Back to the Future! Time Travel with Bitemporal Databases

The talk explores managing transaction and validity times in data storage, using examples like retroactive creation of shipping documents. It highlights how bitemporal databases aid in this process and discusses the complexities of implementing such solutions.

End-to-end pipeline agility

Lars Albertsson

We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.

Fixin the Hard Bits of Event Processing with Restate & Kafka

Discover Restate: A new tool for event-driven apps, blending Kafka's power with simplicity for handling complex transactions, async tasks, and microservices. Dive into its unique approach to tackle scalability, resilience, debuggability, and state management challenges in our session.

Large language models are not a paradigm shift

Many of the putatively novel challenges of building systems around LLMs are analogous to problems we've solved for conventional ML systems. This talk will show you why the things you already know about building ML systems are still relevant for LLM systems — and where the true novelty of LLMs lies.

14:50

End-to-End Encryption for Streaming Data Pipelines

Hans-Peter Grahsl

This talk explains what it takes to bring end-to-end encryption to streaming data pipelines built on top of Apache Kafka and Flink. A live demo illustrates how to encrypt/decrypt sensitive payload fields by means of single message transformations and user-defined functions without any custom code.

Monitoring your home, with DevOps observability tools

Looking to reduce your home energy consumption, to move to more ecological heating, or just want to know what's going on in your house? You need data, and graphs! Learn how cheap hardware plus devops observability tools let you do that!

Moving from Offline to Online Machine Learning with River

The foundations of machine learning were built on offline batch processing techniques for model training and inference. As organisations become more dependent on real-time data, the technological trend for machine learning in production is moving towards adding an online stream processing approach.

Shattering the Limits of Search with Domain Specific Computing

The demand for advanced search and data retrieval capabilities is growing exponentially: The rise of AI applications, along with the unprecedented scaling of data is leading to workloads that are pushing traditional software based search to its limits.
As cloud computing costs are skyrocketing, and queries are becoming more complex, organizations are often forced to compromise results relevancy given the strict real-time latency requirements of 100ms.
To address this issue, a new dimension of search must be introduced: domain-specific computing; it focuses on designing a dedicated chip for search, to consistently achieve ten times faster search at a billion-scale, all at a fraction of the infrastructure cost.

15:20

Evolving Yelp's Streaming at Scale

In this talk we detail how Yelp evolved its Streaming Data Platform at scale while supporting its growing business needs coupled with the fast changing streaming landscape.

Flink's SQL Engine: Let's open the engine room!

Flink's SQL engine is the workhorse behind many event processing platforms. We take a deep look into the internals and take the stack apart! Optimizer phases, streaming primitives, watermarks, CDC, and upsert keys. Let me give you a feeling for the power of a simple streaming SQL query.

Improve LLM-based Applications with Fallback Mechanisms

This session explores different fallback strategies for RAG applications, enabling you to build robust LLM systems for various scenarios, using open source LLM framework Haystack.

Improving Search @scale with efficient query experimentation

From relevance guesswork to granular search experimentation at scale. Evaluate modifications to your search, such as incorporating synonyms, adjusting field-boosts, adding vector search, or product assortment modifications, on real data rather than relying solely on intuition.

16:00

16:00

30min

Coffee Break

Kesselhaus

16:00

30min

Coffee Break

Maschinenhaus

16:00

30min

Coffee Break

Palais Atelier

16:00

30min

Coffee Break

Frannz Salon

16:30

Can ChatGPT build a Data Platform faster than a developer?

Gear up for a speed competition, it's ChatGPT versus yours truly, racing to build a data visualisation platform from scratch at top speed. Who will win? We’ll be taking on the challenge of reconstructing an energy generation visualisation platform, drawing inspiration from the likes of GridWatch.

Let's Do Data Lineage in Kafka, Flink and Druid!

How do you track data lineage in an open-source, realtime analytics pipeline? I'll show in a demo!

Rediscover your keyword search: Expand, Enrich and Rewrite

Praveen Mohan Prasad, Hajer Bouafif

Dense Vector Search is not the only route to improve your search relevance. Empower your existing sparse keyword search with semantic search capabilities by leveraging text expansion, metadata enrichment, and query re-writing techniques.

Remote work is here to stay - and what's next?

Radovan Bacovic

A great chance to hear the experience about the future of work from the biggest all-remote company in the world. Arm yourself with the best-practices toolbox for the async work tips and tricks and leverage the context from the distributed work leaders.

17:20

Advancements in Evaluating Large Language Model Applications

This talk will discuss approaches to evaluate LLMs at the end-to-end and task levels, focusing on use cases such as question-answering (RAG). We will also cover metric selection and ways to generate datasets using LLMs.

Build your 8-bit computer from scratch

Yes it's possible to create an 8-bit computer from scratch on breadboards at home with basic components and that's what I will demonstrate live with lots of blinking LEDs ;)

Can Apache Pinot replace your OLTP database?

Apache Pinot is a really fast database that's powering real-time applications across various industries, much akin to how people use OLTP databases. So what's the difference ? Are there scenarios where Pinot can replace an existing OLTP database ? In this talk we will explore this idea.

How we isolate streaming ingest from search using RocksDB

Learn a new cloud architecture that separates streaming ingest compute from query compute for real-time applications and how it addresses the age-old issue of compute contention.

18:00

Join us for snacks & drinks in an relaxed atmosphere. Meet old and new friends, business contacts or get to know other participants of Berlin Buzzwords! The Get-Together takes place directly in front of Palais Kulturbrauerei.

09:30

Harnessing Spare Cores to Breeze Through Cloud Compute

Gergely Daroczi

Spare Cores, a Python-based open-source ecosystem, offers standardized inventory and performance evaluations for compute resources of public cloud and server providers, helping with selecting and starting the optimal instance type for containerized tasks, such as ML model training or ETL processes.

Learning to Apply Generative AI at Enterprise Level

Sebastian Arnold

GenAI is transforming organizations rapidly. But how can we scale individual experimentation to enterprise-grade business products? This talk will cover learnings from hundreds of use cases in pharma about user intentions, information access patterns and quality metrics.

Streaming DataFrames: A New Way to Process Streaming Data

Tomáš Neubauer

Introducing an open source library in Python: Quix Streams. It solves all the complexities of stream processing in a cloud native package with a familiar Pandas DataFrame API interface. This library lets you work with streaming data similarly to static data in your Jupyter Notebook!

The Unsung Hero of Vector Database -- Metric Learning

We all know there are several vector databases, and once Andrej Karpathy said that even an array can do the same job, true but not true. Vector database provide you the infra to store the embeddings but how those embeddings are made are the most innovative part of all.

10:00

A Practical Approach To Semantic Search

Kentaro Takiguchi

Exploring and enhancing lexical search and semantic search in practical scenarios, we assess various optimization methods and their varied effects on metrics. The focus is on the integration of semantic search into an established lexical search system, addressing potential challenges and pitfalls.

Deep Learning plays Handball

Pere Urbon Bayes

Curious about where unexpectedly Deep Learning is being used these days? Because not everything is ChatGPT, AI for search or to detect fraud in e-commerce, this talk will review, and build together state of the art DL architectures as the ones used by the Bundesliga. Help your local team with AI!

Elevating AI Applications with OpenSearch's Flow Framework and RAG Tool

Owais Kazi, Mingshi Liu

This talk introduces OpenSearch's Flow Framework and RAG (Retrieval-Augmented Generation) Tool, enabling developers to create AI-augmented agent, queries, and ingestion flows, integrate ML models, and streamline app development through a no-code/low-code builder

Streaming doesn’t have to be hard

Nearly half of data scientists find it difficult to adopt streaming technologies. The audience will learn about the technological barrier and the operational burden that inspired us to look for better solutions, and how we built a stream-batch unified Python dataframe API.

10:40

10:40

30min

Coffee Break

Kesselhaus

10:40

30min

Coffee Break

Maschinenhaus

10:40

30min

Coffee Break

Palais Atelier

10:40

30min

Coffee Break

Frannz Salon

11:10

Advanced Retrieval-Augmented Generation Techniques

I will discuss advanced retrieval augmented generation(RAG) techniques such as hybrid search, re-ranking, query generation, and semantic text chunking.

Apache Lucene: From Text Indexing to Artificial Intelligence

Celebrating Apache Lucene's 22nd anniversary, this conference explores its pivotal role in search and data technologies, from powering platforms like Solr, Elasticsearch and OpenSearch to recent AI synergies with vector indexing/search. Discover Lucene's evolution and future horizons.

Kafka on the Fly: A Serverless Approach to Data Streaming

This talk will focus on the exploration of serverless Kafka and real-time data streaming with Node.js microservices. In this talk, I will delve into Upstash's serverless Kafka architecture and how it can be leveraged to build scalable and cost-effective solutions for microservices.

Under the hood of vector search with JVector

This is a speedrun through the past ten years of R&D on approximate nearest-neighbor search algorithms and vector databases, covering the major advances, the current state of the art, and possible future directions. This will be informed by Joel's experience contributing to JVector, a Java embedded vector search library.

12:00

Enhancing RAG with Neo4j Knowledge Graph

Djordje Benn-Maksimovic

Explore how leveraging a knowledge graph combined with vector indices can enhance semantic search and retrieval-augmented generation (RAG), enabling global insights beyond document boundaries

Hardcoding airpods (and other stories from NLP in insurance)

Murhaf Fares, Emanuele Lapponi

Curious about NLP beyond the startup hype? Join us to explore NLP in a 'traditional' setting. Tackle challenges like data scarcity and domain specificity using e.g. data augmentation and zero-shot classification, and learn some tips and tricks to address concrete and relatable NLP problems

Lessons learned writing 10+ Kubernetes Operators

Lars Francke, Jannik Heyl

I'll talk about all the mistakes we made and all the lessons we learned developing a data platform on top of a dozen or so Kubernetes operators so you don't have to go through the same ordeal.

Recognizing the driving forces of impactful OS projects

Maryblessing Okolie

This talk aims to challenge the usual "code-centric" narrative around open source projects. That impact doesn't solely rest on lines of code but thrives within a diverse ecosystem of non-technical contributions.

12:40

12:40

80min

Lunch

Kesselhaus

12:40

80min

Lunch

Maschinenhaus

12:40

80min

Lunch

Palais Atelier

12:40

80min

Lunch

Frannz Salon

14:00

Jina Embeddings V2: From Raw Data to Bilingual Hybrid Search

Isabelle Mohr, Bo Wang

Embeddings transform text into numerical vectors, capturing semantic relationships. This talk explores the data preparation, training processes, evaluation, and a demo of Jina Embeddings V2 in a hybrid search pipeline, showcasing its practical applications.

Open-Source Generative AI: A Product Manager’s Blueprint

Saahil Ognawala

We explore the evolving role of product management in open-source generative AI, focusing particularly on Retrieval Augmented Generation's (RAG) industry impact. We cover community-commercial balance, innovative AI product discovery, and sustainable monetization strategies.

Robust AI Search Ranking for Radical C2C Marketplace Growth

Chingis Oinar, Teo Narboneta Zosa

Key practical insights across dataset construction, custom metrics, model building, and model de-biasing techniques behind the success of our real-time AI search ranking system which drives significant business impact and now serves as a centerpiece of Mercari’s Search & Discovery platform.

Standing on the Shoulders of Giants

Bryan Burkholder, Varun Thacker

You've heard the wisdom of choosing boring technology, but how do you balance this with new usage patterns and ever-increasing scale? Learn how Slack developed a petabyte-scale log search engine with a team of just three engineers, supporting over one million queries per day.

14:50

Blazing-Fast Serverless MapReduce Indexer for Apache Solr

This talk presents an experimental serverless Solr indexer where the records from multiple database tables are merged using the map-reduce approach. The merge is implemented in AWS using Step Function Distributed Map to orchestrate many lambda functions.

Sequential Testing Simplified with Basemath

Alexander Weiss, Konrad Richter

Learn how sequential testing has cut our experiment durations by nearly 40%. We'll delve into the core principles, tackle real-world implementation issues, and unveil Basemath, our Python-driven open-source tool. Gain insights on refining your experimental analysis for smarter feature deployment.

The Power of the Bonus Card: Road to Personalised Search

Luuk Kaandorp, Vincent Peijnenburg

Everyone is unique, and that is especially true for what they eat. At Albert Heijn, we are moving away from popularity-based search for the broader customer base, in favor of tailoring search to our unique customers' tastes. We will tell you all about this transition.

The “C” in CI/CD is not for “Closed”

Most of the foundational tools we use in this industry are open source. But CI/CD is a domain where open source has stagnated and most innovation is happening in proprietary code. This talk discusses why this is, why it's dangerous, and what we can do about it.

15:30

15:30

30min

Coffee Break

Kesselhaus

15:30

30min

Coffee Break

Maschinenhaus

15:30

30min

Coffee Break

Palais Atelier

15:30

30min

Coffee Break

Frannz Salon

16:00

Applications of Tries in Apache Cassandra

Branimir Lambov

Log-structured merge-tree databases rely on a combination of in-memory buffers and hierarchies of immutable on-disk structures, which are two cases where a trie data structure is most efficient. Join us for a glimpse at the ways in Apache Cassandra applies and benefits from this combination.

From Text to Context: How We Introduced Hybrid Search

Ansgar Gruene, Dharin Shah, Alexander Butenko

We present the lessons learnt from improving the search of our global online marketplace with 20 million products sold per year. We successfully moved from a traditional word-match based approach (BM25) to a modern hybrid solution by adding a semantic vector model which we fine-tuned to our domain.

Search in the Cloud: separation of compute and storage

Modifying SolrCloud to separate compute and storage presents challenges but is cost effective and improves scalability and availability. Nodes become stateless and durability is delegated to the shared storage implementation.
Salesforce runs such clusters at scale. The code is open source.

Taming the cost of Kafka workloads in the cloud

Stefan Sprenger

This talk equips developers with everything they need to understand and optimize the cost of Kafka workloads in the era of cloud computing.

16:50

A journey in geospatial timeseries

Have you unlocked the geospatial powers of PostgreSQL?
This talk will show 3 neat features of how to use postgresql in effective ways to handle geospatial timeseries in a new domain. The data in subject is live marine traffic from the Norwegian territory.

Comparing vector implementations in generic databases

Tudor Golubenco

We've recently seen a boom of specialized vector databases. At the same time, almost all popular database projects have added support for vectors. So a lot of people are asking themselves if and when do they really need a specialized vector databases, and when could use an already deployed tool.

Cracking the Code: Deciphering Evaluation Essentials for RAG

Explore RAG systems: Understand evaluation criteria, tools, and frameworks.
Discover essential components and compare them with existing frameworks.
Gain insights for enhanced control in your path toward RAG optimization.

Open Web Search - a platform for a free European Web Index

Michael Dinzinger

The Open Web Search Initiative (OWS.eu) is a European initiative providing an open platform to foster innovation in search and AI applications. Join us for an introduction to OWS, its public datasets and an intro to two of the open source projects that underpin it - StormCrawler and URLFrontier.

17:15

Closing session

Join us as we wrap up Berlin Buzzwords 2024!