Berlin Buzzwords 2025

0.19 Berlin Buzzwords 2025 bbuzz25 2025-06-15 2025-06-17 3 00:05 https://program.berlinbuzzwords.de https://program.berlinbuzzwords.de/media/bbuzz25/img/bbuzz-logo_g9fhkQQ.svg Europe/Berlin Palais Atelier Barcamp #BBuzz 2025-06-15T14:30:00+02:00 14:30 03:00 Barcamps are informal sessions, a kind of "un-conference", with a schedule decided on the day. It is all driven by the interests and expertise of those who attend so each one is different, but ours are always great! bbuzz25-68453-barcamp Nick Burch en Although the barcamp doesn't have a strict schedule, it won't be completely devoid of structure! #bbuzz barcamps are dynamic events, focused on the overall Berlin Buzzwords topics, tackling the same challenges but in a different format. At the barcamp each session runs for 30 minutes giving enough time to get into the meat of a topic, but without a chance of anyone getting bored. These are participatory sessions and more inclusive than regular conference talks, with everyone taking part. You can help by leading the session, by giving some insights, by asking some great questions, or maybe just with your enthusiasm. The barcamp will be coordinated and moderated by Nick Burch. Registration starts from 2:30pm false https://program.berlinbuzzwords.de/bbuzz25/talk/E9FJMZ/ Kesselhaus Opening Session #BBuzz 2025-06-16T09:30:00+02:00 09:30 00:05 Join us as we kick off Berlin Buzzwords 2025! bbuzz25-68466-opening-session Berlin Buzzwords Team en false https://program.berlinbuzzwords.de/bbuzz25/talk/3NRGVL/ Kesselhaus Unpacking Digital Sovereignty: How to avoid fueling the nationalist rise Keynote 2025-06-16T09:35:00+02:00 09:35 00:45 This talk shows that digital sovereignty is prone to open the door to a nationalist agenda which favours the power concentration that led to Big Tech, and it easily slips into alt-right narratives that put colonising space over the needs of most our planet's population bbuzz25-70692-unpacking-digital-sovereignty-how-to-avoid-fueling-the-nationalist-rise /media/bbuzz25/submissions/JA9LZL/Entwurf_2_3_RoM5BZG.png Aline Blankertz en With the new ruling coalition of Trump, Musk and Big Tech, European digital sovereignty seems to be the widely accepted solution to break free from US companies that are used to threaten European governments. The EU is reviving industrial policy and is willing to invest a lot of money into digital sovereignty - but which problems exactly should it address, and what should be part of the solution? This talk shows that digital sovereignty is prone to open the door to a nationalist agenda which favours the power concentration that led to Big Tech, and it easily slips into alt-right narratives that put colonising space over the needs of most our planet's population. After an overview of current approaches to digital sovereignty and the role of open-source within those, I will discuss what is needed to reclaim digital sovereignty to defend and strengthen democratic practice both in and through technologies. false https://program.berlinbuzzwords.de/bbuzz25/talk/JA9LZL/ Kesselhaus Which GPU for Local LLMs? Short Talk 2025-06-16T10:40:00+02:00 10:40 00:20 You’re using local LLMs. For example, to power RAG. You want to deploy them in production, but you don’t know where: which type of GPU? How large should it be? Should you use a larger model but quantize more aggressively? Our benchmark results and their interpretation will give you some answers. bbuzz25-61401-which-gpu-for-local-llms /media/bbuzz25/submissions/AWJYJH/Entwurf_1_52_2yyDZne.png Radu GheorgheRafał Kuć en It’s easy to offload the LLM - in solutions such as RAG - to external services like OpenAI. This is great for PoCs, but if you have a lot of requests, a local LLM makes more sense from both a cost and a latency point of view. Especially in the context of RAG, where the query itself adds latency and the context to be shifted can be significant. For this session, we’ll use llama.cpp - which supports inference on many models for many platforms - and benchmark some LLMs on various GPUs. We’ll focus on cost, throughput (tokens/s), and memory usage when presenting results. Memory usage is the same for the same model, but we’ll explore quantization and how it influences throughput, especially since we can fit a larger context. A larger context means we can process more queries in parallel. Participants will get a better sense of how to deploy their RAG/LLM in production from a hardware, model, and quantization perspective. false https://program.berlinbuzzwords.de/bbuzz25/talk/AWJYJH/ Kesselhaus Shipping Lucene 10.0, 25 years in the making Talk 2025-06-16T11:10:00+02:00 11:10 00:40 The fascinating journey towards releasing version 10.0 of the popular java search engine Apache Lucene. An inspiring and challenging venture seen through the eyes of its release manager, made possible by the vibrant Lucene community, culminated in deploying the new major to production in record times. bbuzz25-65576-shipping-lucene-10-0-25-years-in-the-making /media/bbuzz25/submissions/TDNGRG/Entwurf_1_9_l2rd4LS.png Luca CavannaAdrien Grand en Preparation and a real team effort: that’s what it takes. Releasing a major is an involved process, especially when it comes to a 25 years old project, with such a wide and diverse user base as Lucene. This talk will cover the purpose of shipping a new major version, the implications and benefits that derive from it for Lucene users, as well as specifics of the 10.0 release process. We will go through the ups and downs of the release manager as well as the team effort that it took to pull it off: bugs and performance regressions were uncovered in the process. Four release candidates were built along the way. We will expand on how the team performed thorough testing and benchmarking, which contributed to the success of the release, culminating in the deployment of Lucene 10.0 to production in record times. false https://program.berlinbuzzwords.de/bbuzz25/talk/TDNGRG/ Kesselhaus Building a knowledge graph for climate policy Talk 2025-06-16T12:00:00+02:00 12:00 00:40 At Climate Policy Radar, we're building an open-source knowledge graph for climate policy. In this talk, we'll share how we combine in-house expertise with scalable data infrastructure to identify key concepts in thousands of global climate policy documents. We'll also touch on ontology design, equitable evaluation, and the climate impacts of AI. bbuzz25-65389-building-a-knowledge-graph-for-climate-policy /media/bbuzz25/submissions/YJHRK8/Entwurf_1_7_jeHQSwb.png Harrison PimFred O'Loughlin en We'll take you on a technical deep-dive into how we've built and scaled a knowledge graph which maps the relationships between thousands of climate policy concepts, and identifies where those concepts appear in our corpus of climate policy and other climate-relevant documents. We'll share the high-level methodology, infrastructure decisions, and evaluation framework which have allowed our small team to process millions of passages of text while maintaining high standards for fairness and accuracy. After covering the basics of what a knowledge graph is, and why you might want to build one, we'll cover: 1. **Knowledge Graph Architecture & Methodology** - An ontology which can handle the complexity of the climate policy domain - Interoperability considerations with existing sub-domain taxonomies - Why we're building in the open with Wikibase - The value of real human expertise 2. **Classifier Development & Evaluation** - A common model for classifiers, which can encompass a range of architectures from straightforward regexes, to fine-tuned BERT-based models, to optimised calls to third-party LLMs - Sampling strategies for building representative evaluation datasets - Quantitative metrics vs qualitative vibe-checks for classifier selection 3. **Production Infrastructure & Scaling** - A modular pipeline design separating model management, inference, and indexing - Prefect-based orchestration for distributed inference - Infrastructure as code with Pulumi - Planned integration with our existing search and RAG systems The audience should leave the talk with a clear understanding of: - Practical considerations when building domain-specific, high-impact knowledge graphs - Methods for evaluating NLP classifier performance in technical domains - Approaches to scaling inference pipelines, from local experimentation to routine cloud-based deployments - How we plan to use our knowledge graph to power a climate policy research platform, including integrations with RAG and other LLM-driven systems This talk should be particularly stimulating for data scientists and engineers working on information retrieval systems, knowledge graphs, or other high-impact natural language processing systems. false https://program.berlinbuzzwords.de/bbuzz25/talk/YJHRK8/ Kesselhaus Accelerating QuestDB: Lessons from a 6x Performance Boost Talk 2025-06-16T14:00:00+02:00 14:00 00:40 In this talk, I share our journey in making QuestDB, an Apache 2.0-licensed open-source time-series database, a significantly faster analytical database. In this session, I'll walk through how we identified opportunities for improvement, the key changes we implemented, and how those changes delivered dramatic performance improvements. bbuzz25-64984-accelerating-questdb-lessons-from-a-6x-performance-boost /media/bbuzz25/submissions/8ZGVVT/Entwurf_1_50_ESg6IOv.png Javier Ramirez en In this talk, I share our journey in making QuestDB, an Apache 2.0-licensed open-source time-series database, a significantly faster analytical database. Over the course of just one year, we achieved query performance gains of up to 6x by implementing specialised data structures, SIMD-based optimisations, scalable aggregation algorithms, and parallel execution pipelines. QuestDB is designed for high-performance ingestion—processing millions of rows per second—and efficient queries over billions of rows. While it excelled in time-based queries, we found that certain generic analytical queries were slower than expected. In this session, I'll walk through how we identified opportunities for improvement, the key changes we implemented, and how those changes delivered dramatic performance improvements in a relatively short timeframe. I’ll demonstrate before-and-after queries to showcase the impact of these optimisations. All the code is freely available in QuestDB's GitHub repository for anyone to explore or contribute to. false https://program.berlinbuzzwords.de/bbuzz25/talk/8ZGVVT/ Kesselhaus Self-hosting AI LLMs - a beginners guide Short Talk 2025-06-16T14:50:00+02:00 14:50 00:20 Want to avoid cloud-hosted AIs, and run your LLMs on your own systems, but not sure where to start? Or even what everything means? Join us to see how easy it can be, and what a beginner needs to know! bbuzz25-65564-self-hosting-ai-llms-a-beginners-guide /media/bbuzz25/submissions/CU8ZPP/Entwurf_1_41_i8wpbU7.png Nick Burch en There are some great cloud-hosted AI systems out there, but they aren't right for everyone. Maybe it's cost, or environmental impact. Data sovereignty, privacy or control. Whatever your reason, it can be daunting to figure out how to get started. Luckily, it doesn't have to be! We'll guide you through the key terms to know, where to find good AI / LLM models (it isn't github!), and how to run them. We'll see what influences memory and processing needs, and how fine-tuning can help. And even what that is! We'll help kick-start your journey to running models you control on your own hardware. We'll even have some live demos of models running on a laptop! false https://program.berlinbuzzwords.de/bbuzz25/talk/CU8ZPP/ Kesselhaus Harnessing AI to strengthen trustworthy information Talk 2025-06-16T15:20:00+02:00 15:20 00:40 AI can (also) enhance fact-checking and news classification. We developed a platform integrating search, an intelligent assistant, and a RAG system to support reliable journalism. By leveraging diverse data and analytics, we empower everyone with insights for accuracy and transparency, fostering collaboration for trustworthy information. bbuzz25-65583-harnessing-ai-to-strengthen-trustworthy-information Lucian PrecupGiovanna Monti en Misinformation spreads rapidly in the digital age, making it increasingly difficult to ensure information integrity. Brandolini’s Law highlights this challenge, but AI presents new opportunities to support fact-checking and news classification. In this session, we will share insights from our work on developing an AI-driven platform that integrates search capabilities, an intelligent assistant, and a Retrieval-Augmented Generation (RAG) system. By leveraging diverse data sources and advanced analytics, our approach empowers journalists and editors with tools to enhance accuracy and transparency in reporting. Attendees will gain an understanding of how AI can provide actionable insights, streamline fact-checking, and promote responsible journalism. While this technology continues to evolve, we aim to foster discussion on AI’s role in strengthening critical thinking and building a more trustworthy information ecosystem. false https://program.berlinbuzzwords.de/bbuzz25/talk/XFHXYP/ Kesselhaus Precision farming powered by K3s and TensorRT Talk 2025-06-16T16:30:00+02:00 16:30 00:40 Aurea Imaging is an AgTech scaleup focusing on precision farming in apple orchards. We've build the Treescout, and edge device on top of a tractor which unlocks the potential of each tree. We used a innovative technology stack to meet the requirements of an outdoor rural setup. Our journey was full of failures, learnings and ongoing challenges bbuzz25-61408-precision-farming-powered-by-k3s-and-tensorrt /media/bbuzz25/submissions/AYHCBK/Entwurf_1_53_e2OLdPe.png Wieneke KellerSebastian Lenartowicz en Aurea Imaging is an AgTech scaleup focusing on precision farming in orchards. By retrofitting our TreeScout sensor package to their tractors, farmers are able to collect data about their orchard down to the tree level as they perform other tasks. Using onboard stereo cameras, the tractor’s high-precision GPS, a machine-vision pipeline running in real time on the device, and cloud-based analytics, this data is turned into maps used by other agricultural machines that enable the grower to utilize less labour and chemical products to produce more food. To run the edge part of this process, we opted against a traditional embedded/edge architecture based on Robot Operating System (ROS) and C/C++, and instead chose to build Python microservices orchestrated with K3s. This has brought the usual benefits of cloud-native tooling, but building cloud-native software for a far-edge, occasionally-airgapped application in an ecosystem based on decades-old standards comes with a myriad of challenges not faced in traditional cloud environments. Overall, Kubernetes on edge has brought us a very high development velocity and a reliable, maintainable codebase, and we hope to both explore the challenges this brings as well as inspire others to try this approach for their next edge project. On top of this we are running object detection models in tensorRT which are able to detect tree specifics at a driving speed of 8km/hr. false https://program.berlinbuzzwords.de/bbuzz25/talk/AYHCBK/ Kesselhaus End-to-End Semantic Search with Apache Solr 9.8 LLM Module Talk 2025-06-16T17:20:00+02:00 17:20 00:40 Apache Solr 9.8 introduces the LLM module opening the doors of end-to-end natural language query support through vector-backed semantic search (K Nearest Neighbors).  This talk explores the open source contribution from both the indexing and query angles and what’s coming next for Solr in terms of integrations with Large Language Models. bbuzz25-65295-end-to-end-semantic-search-with-apache-solr-9-8-llm-module /media/bbuzz25/submissions/7NBQPB/Entwurf_1_57_ucliWR2.png Alessandro Benedetti en Dense vector search was introduced in Apache Solr 9.0 in 2022 and since then it has received substantial adoption from the community. Text vectorisation had to happen outside Solr, as there was no support to encode text to vector within the search engine transparently. Apache Solr 9.8 changes this, introducing a module that allows interaction with well-known large language model providers such as OpenAI, Cohere, HuggingFace and Mistral AI via the open-source library  LangChain4j. Expect to learn how to configure Solr to access external text vectorisation services and use them to encode and run your queries through the 'knn_text_to_vector' query parser and vectorise your documents’ textual fields through the 'Text To Vector Update Request Processor'.  This is a foundational enabler that speeds up the design and development of end-to-end semantic search solutions.  The talk wraps up with future directions and how the introduction of the LLM module opens the doors for exciting new integrations.  Join us as we dive into the AI future of Apache Solr! false https://program.berlinbuzzwords.de/bbuzz25/talk/7NBQPB/ Kesselhaus Get-Together #BBuzz 2025-06-16T18:00:00+02:00 18:00 03:00 Join us for food and drinks at Palais Kulturbrauerei! bbuzz25-60009-get-together Berlin Buzzwords Team en Berlin Buzzwords is a good place to meet great people, and while there is ample time to chat and discuss during the conference, we also want to provide an opportunity for everyone to meet up outside of the regular conference program with our Get-Together on Monday! Our Get-Together is a casual meet up in an relaxed atmosphere and provides a perfect opportunity to meet old and new friends, business contacts or get to know other participants of Berlin Buzzwords. Thanks to our partner Search Guard there will be food and drinks available: We will offer a range of vegetarian and vegan food options as well as alcoholic and non-alcoholic drinks. false https://program.berlinbuzzwords.de/bbuzz25/talk/RAWGVW/ Maschinenhaus Zero to Scale: Telemetry pipeline with Apache Cassandra Short Talk 2025-06-16T10:40:00+02:00 10:40 00:20 Picture billions of messages pouring in daily from thousands of data providers around the globe, which are then processed and published to customers. How can one design a telemetry system to capture, publish, and then index essential information about the data flowing through the system to give internal teams visibility to aid in troubleshooting? bbuzz25-65332-zero-to-scale-telemetry-pipeline-with-apache-cassandra /media/bbuzz25/submissions/YR3T98/Entwurf_1_10_F5lsUHa.png Shikhar SrivastavaNomin-Erdene Oyun en A core part of our business is to receive and then process humongous amounts of financial data from all over the globe. This pipeline scales to tens of billions of pricing messages every day, in which each message carries highly valuable information. Getting visibility into what was sent to us versus what was published to our customers is of utmost importance to enable internal teams to quickly troubleshoot issues reported by any of the data providers. But how do we capture essential information about the data flowing through such a massive and high throughput system scaling to more than tens of thousands of processes running on close to a hundred machines, in which traffic peaks at more than a million messages per minute? In this talk, we will talk about how we built a high throughput telemetry system for streaming, storing, and searching such a high volume of data, starting from scratch using open source technologies like ZeroMQ, Apache Kafka, Kubernetes, and Apache Cassandra. You will gain valuable insights into the system’s design and performance, as well as the lessons we learnt along the way. We will cover everything from schema design and load testing to incremental deployment in order to manage such high data throughput. false https://program.berlinbuzzwords.de/bbuzz25/talk/YR3T98/ Maschinenhaus AI-Powered Search Results Navigation with LLMs & JSON Schema Talk 2025-06-16T11:10:00+02:00 11:10 00:40 Struggling to identify relevant filters among too many facets and frustrating results navigation? We explore an AI Filter Assistant for statistical data (SDMX) showing how LLMs can be leveraged to suggest the best filters for your natural language query, helping you refine the results in Apache Solr. We share wins, fails, and lessons learned. bbuzz25-65361-ai-powered-search-results-navigation-with-llms-json-schema /media/bbuzz25/submissions/3PXQZ8/Entwurf_1_55_ELypWJC.png Anna RuggeroEdward LambeIlaria Petreti en In this talk, we explore an AI-powered Filter Assistant, designed for the Statistical Data and Metadata eXchange (SDMX) to improve User eXperience in navigating search results efficiently and effectively. We discuss how LLMs enhance filter suggestions by analyzing both user queries and indexed data. On the architecture side, we break down: 1) Data retrieval – how we collected and processed the input SDMX data to build taxonomies used by the model to reconcile the concepts in the natural language query 2) API structure – a deep dive into our endpoints, what they do, and the responses they return. 3) Model choice – the process of identifying the best LLM for the task, including our motivations and studies 4) Structured output & JSON Schema – key benefits, limitations, and lessons learned from extensive testing. We showcase different test results and insights on what works best. 5) Solr query optimization – how to integrate the assistant’s output into a search query, using different boolean strategies to handle the refinement of both too-many and zero-result scenarios. Expect real-world insights, practical takeaways, and a discussion on the future of AI-driven filtering! false https://program.berlinbuzzwords.de/bbuzz25/talk/3PXQZ8/ Maschinenhaus Airflow 3 - the new beginning Talk 2025-06-16T12:00:00+02:00 12:00 00:40 If you were living under the rock and have not heard that Airlfow 3 is out, and solves most of the pain points that you had with Airflow 2, this talk is for you. You will learn how you can boost your Data Engineering and AI/ML workflows (without having to rewrite your DAGs) with what Airlfow 3 community worked for last 12 months. bbuzz25-64930-airflow-3-the-new-beginning /media/bbuzz25/submissions/HS8PFX/Entwurf_1_28_skuftIR.png Jarek Potiuk en Airflow 3 is out. Spring last year Airlfow community came to conclusion that in order to respond to a number of users of ours, we have to again reinvent ourselves and release Airflow 3. Berlin Buzzwords is at the right time and the right place to talk about it, as at the time of Buzzwords, Airflow 3 will already be out for a while and we will know not only what we planned, but how our users already use Airflow 3 and what benefit it brings for them. In this talk you will learn from Jarek, one of the top maintainers of Airflow all about you want to know about Airflow 3: * why we decided to change to Airflow 3 * what are the architectural changes and improvements that lay foundation under Airflow 3 being modern and applicable to more workflows * what are the features you always wanted and you can use now: Versioning, Enterprise level security isolation, better dependency management, execution isolation, dataset as first class citizen, modern react-based UI, schedulable and UI-controlled backfills, ML/AI worfklows including inference workflows, almost streaming experience, and more * what early users of Airlfow 3 say and how their workflow management improved * what is coming next (yes! we are not nearly done yet and more things are coming!) Last year, I presented "new orchestrator in town" - Airflow 2, but that was only a Prelude. If you want to hear the whole Symphony come to see the talk. false https://program.berlinbuzzwords.de/bbuzz25/talk/HS8PFX/ Maschinenhaus What you see is what you mean; intent based ecommerce search Talk 2025-06-16T14:00:00+02:00 14:00 00:40 Intent based clustering is our approach to overcome some limitations of modern hybrid search systems. We show how an upfront LLM-supported in-depth query understanding can leverage steps like retrieval, clustering, validation and presentation. We address various aspects from prototype to production in a large-scale high-volume e-commerce search. bbuzz25-65409-what-you-see-is-what-you-mean-intent-based-ecommerce-search /media/bbuzz25/submissions/PQJRNM/Entwurf_1_24_EDpDfbK.png Dennis BergerMarco PetrisVolker Carlguth en In today's e-commerce landscape, hybrid search systems represent the market standard, yet they often struggle with highly inspirational queries, lack of precision in semantic recall, and presentation of diversity. Ambiguity in search terms can lead to disorganized results or overshadowed relevant products. This session will demonstrate how an upfront LLM-supported in depth query understanding can increase recall meaningfully without sacrificing precision. We show how LLM-driven query understanding can enhance retrieval even for highly inspirational searches and how a final alignment between retrieved products, the query intent and the query context can compensate for indexing shortcomings. Additionally, we'll explore how query intent-based clustering can visually organize results by disambiguating meanings, thus providing users with a clearer path to relevant products. Our discussion will take attendees through our journey from prototype to production-ready implementation, sharing insights and challenges encountered in large-scale e-commerce environments with high query volume. Intent-based clustering represents a novel approach in e-commerce, reducing the "paradox of choice" and aiding customers in navigating product diversity. We'll explore various presentation forms for query intent-based clusters and share findings from UX tests that evaluated these approaches. Join us to gain a comprehensive understanding of how these advanced techniques can transform search experiences and drive better outcomes in e-commerce settings. false https://program.berlinbuzzwords.de/bbuzz25/talk/PQJRNM/ Maschinenhaus Performance Tuning Apache Solr for Dense Vectors Short Talk 2025-06-16T14:50:00+02:00 14:50 00:20 While powerful, dense vector search is not a plug-and-play feature that will scale straight out-of-the-box, particularly when it comes to extracting the maximum performance from limited compute resources. Come learn how we tuned dense vector indexes for our 100M+ document dataset, and drastically sped up our queries. bbuzz25-65399-performance-tuning-apache-solr-for-dense-vectors /media/bbuzz25/submissions/WLNLKY/Entwurf_1_1_TEzi9gQ.png Kevin Liang en With the recent boom in AI, many organizations are in the process of building semantic search stacks from scratch powered by Apache Solr and dense vectors. What many quickly learn when dealing with dense vectors is just how heavy the compute requirements are for vector search compared with lexical search. If not well-tuned, vector search query latency can quickly skyrocket, even with an otherwise reasonably sized dataset. We experienced this pain firsthand when we started vectorizing a 100M+ document dataset. While one can certainly approach this problem head-on by throwing hardware resources at it, this is neither a cheap nor fully-effective solution. This talk will cover a brief introduction to how Apache Solr/Lucene builds dense vector indexes, the journey of how we optimized our dense vector setup, as well as highlight the pitfalls/best practices we learned. Whether you’re a company building out full RAG pipelines or an enthusiast playing around with a novel alternative to standard lexical search, you’re going to want to squeeze the most performance out of your limited compute resources. Let us help you hit the ground running. false https://program.berlinbuzzwords.de/bbuzz25/talk/WLNLKY/ Maschinenhaus Anatomy of Table-Level Locks in PostgreSQL Talk 2025-06-16T15:20:00+02:00 15:20 00:40 This talks explains locking mechanisms (MVCC, lock queue) in PostgreSQL, focusing on table-level locks that are acquired by Data Definition Language(DDL) operations. If not managed well, schema changes can result in downtime. Not all operations require the same level of locking, and PostgreSQL offers tools and techniques to minimize locking impact. bbuzz25-64873-anatomy-of-table-level-locks-in-postgresql /media/bbuzz25/submissions/DXNPMT/Entwurf_1_44_A5n8liv.png Gülçin Yıldırım Jelinek en In PostgreSQL, managing schema changes without downtime can be a challenging task. Table-level locks, which control access during Data Definition Language (DDL) operations like ALTER or DROP TABLE, can result in unintended application slowdowns or even service interruptions when not fully understood. This talk will provide a comprehensive dive into table-level locking and lock queueing in PostgreSQL, helping attendees gain the insights they need to perform efficient schema changes. We’ll start by explaining the various types of table-level locks in PostgreSQL such as Access Share, Exclusive, and Access Exclusive and how they are automatically managed during common DDL operations. Then, we’ll break down lock queuing: how PostgreSQL organizes lock requests, what happens when transactions wait for locks, and how deadlocks can arise in complex environments. Next, we’ll focus on practical approaches to managing table-level locks for near-zero downtime. Attendees will learn techniques to minimize locking impact, including understanding lock conflicts, using online schema migration patterns, and identifying lock-heavy queries. We’ll introduce open-source tools like pgroll, which utilizes the expand/contract pattern to make schema changes in small, lock-free steps. By the end of this session, attendees will be equipped with practical strategies and knowledge to control lock behavior during schema changes, ensuring data integrity and reducing operational disruptions. This talk will provide the tools needed to manage PostgreSQL schema changes with confidence and minimal impact on production environments. false https://program.berlinbuzzwords.de/bbuzz25/talk/DXNPMT/ Maschinenhaus How I Sidestepped ‘Being Glue’ Talk 2025-06-16T16:30:00+02:00 16:30 00:40 We all do things in our day to day work that are deemed ‘non-promotable’ - these are tasks that are crucial for project success but might not get you promoted. This is commonly known as glue work, a term coined by Tanya Reilly. Being glue doesn’t mean an end to your career, and it isn’t something that you can’t recover from. bbuzz25-58330-how-i-sidestepped-being-glue /media/bbuzz25/submissions/U3GWMN/Entwurf_1_11_RaqHexO.png Fatima Taj en Abstract: We all do things in our day to day work that are deemed ‘non-promotable’ - these are tasks that are crucial for project success but might not get you promoted. This is commonly known as glue work, a term coined by Tanya Reilly. Your first instinct about this could be to drop these tasks immediately, but this might not be the best approach. Being glue doesn’t mean an end to your career, and it isn’t something that you can’t recover from. During this talk, I will use a personal experience to illustrate how I narrowly avoided the trap of being permanently stuck with glue work and how to salvage a situation where you might be in a similar predicament. Outline: 1. Introduce the concept of glue work 2. Talk about the specific work I was doing and why I started getting concerned - I didn’t want to do work that wasn’t deemed promotable - As a woman in engineering, I needed to be extra careful about not doing ‘administrative’ or ‘secretarial’ work 3. How I dealt with it - Realizing that this could become an issue for my career progression. - Instead of being reactive and immediately stopping doing these tasks, thinking things through. - Dealing with it proactively: I immediately talked to my manager and tried understanding how they perceived this work to be. - Talked to a mentor at my company (a staff engineer) on what their take was. They helped identify a couple of things: a. If the work that you’re doing isn’t being reflected in your performance evaluations, then that’s a red flag. b. Try to think about why you’re constantly doing this work? Is it because roles aren’t well defined in your company? They talked about how this job was being done by the project lead in other teams. This helped me realize that the project lead role was very loosely defined within my team, which was part of the reason I was experiencing this. 4. Long term fixes: - Moving forward, make sure I was as heavily invested in the implementation stages of the project (which requires more technical skills than soft skills). - Knowledge sharing and mentoring other engineers so the glue work I was doing could be done on a rotational basis within the team. Key takeaways: 1. Learn to reflect on your day to day work and identify ‘glue work’. 2. Instead of being reactive to ‘glue work’, learn to reflect on how you can mitigate the risks that come with glue work and stay on track with getting promoted. 3. Learn how to leverage your manager, and team members to reduce the ‘glue work’ you’re doing. false https://program.berlinbuzzwords.de/bbuzz25/talk/U3GWMN/ Maschinenhaus Melting Icebergs: Direct access to Kafka Data via Iceberg Talk 2025-06-16T17:20:00+02:00 17:20 00:40 Data in organizations is traditionally split between operational and analytical estates. Join us for an account of our journey combining Apache Kafka and Apache Iceberg to create a solution that addresses both estates with one data source. bbuzz25-62561-melting-icebergs-direct-access-to-kafka-data-via-iceberg /media/bbuzz25/submissions/GLM8BQ/Entwurf_1_25_ChXWvpk.png Tom ScottRoman Kolesnev en An organisation's data has traditionally been split between the operational estate, for daily business operations, and the analytical estate for after-the-fact analysis and reporting. The journey from one side to the other is today a long and torturous one. But does it have to be? In the modern data stack Apache Kafka is your defacto standard operational platform and Apache Iceberg has emerged as the champion of table formats to power analytical applications. Can we leverage the best of Iceberg and Kafka to create a powerful solution greater than the sum of its parts? Yes you can and we did! This isn't a typical story of connectors, ELT, and separate data stores. We've developed an advanced projection of Kafka data in an Iceberg-compatible format, allowing direct access from warehouses and analytical tools. In this talk, we'll cover: * How we presented Kafka data for Iceberg processors without moving or transforming data upfront—no hidden ETL! * Integrating Kafka's ecosystem into Iceberg, leveraging Schema Registry, consumer groups, and more. * Meeting Iceberg's performance and cost reduction expectations while sourcing data directly from Kafka. Expect a technical deep dive into the protocols, formats, and services we used, all while staying true to our core principles: * Kafka as the single source of truth—no separate stores. * Analytical processors shouldn't need Kafka-specific adjustments. * Operational performance must remain uncompromised. * Kafka's mature ecosystem features, like ACLs and quotas, should be reused, not reinvented. Join us for a thrilling account of the highs and lows of merging two data giants and stay tuned for the surprise twist at the end! false https://program.berlinbuzzwords.de/bbuzz25/talk/GLM8BQ/ Palais Atelier Mixture of Encoders: A Vector-Native Approach to Search Short Talk 2025-06-16T10:40:00+02:00 10:40 00:20 Mixture of Encoders is a vector-native alternative that models both structured and unstructured data in a unified embedding space. We will introduce the method, show how it powers natural language search and real-time recommendations, and share open-source tools and benchmarks for replacing complex hybrid stacks. bbuzz25-70522-mixture-of-encoders-a-vector-native-approach-to-search /media/bbuzz25/submissions/3XUUYV/Entwurf_1_56_HEamUyl.png Filip MakraduliBen Gutkovich en Filters, hybrid search, rank fusion, re-ranking. Most retrieval systems today are stitched together from separate components, each tuned in isolation. There is no systematic way to integrate structured data into vector search. Ask anyone maintaining a mature Elasticsearch deployment with 100+ boosts and hand-written scoring logic whether they can still evaluate retrieval quality end to end and iterate quickly. The answer is almost always no. To address this, you need models that understand both your unstructured and your structured data. That includes numeric, categorical, relational, spatial, and temporal metadata, all of which are critical for powering modern search, recommendations, and agentic retrieval systems. These signals drive both end-user precision and business impact. At M&S(Marks and Spencer), we solved this problem using a set of custom pipelines, but the process required significant development effort and lacked a unified framework. There is a better way. We call our approach the Mixture of Encoders. It is a vector-native alternative to hybrid search that brings structure to retrieval by embedding each data type with a specialised encoder and composing them into a unified vector space. Text, images, categories, numerical features, and contextual signals all become searchable through a single query. This enables nuanced, real-time retrieval across modalities without relying on filters or post-processing stages. In this talk, we will introduce the technique and show how it supports natural language query decomposition, dynamic modality weighting, and session-aware ranking, all within a single retrieval step. We will share how this approach has been deployed in production, powering retrieval in high-churn environments and contributing over $10M in incremental revenue through improved discovery and recommendation quality. To support adoption, we are also releasing open source datasets for benchmarking real-world information retrieval tasks, along with open source demo implementations that show how to apply the Mixture of Encoders to your own data and use cases. --- This talk is sponsored by <a href="https://www.superlinked.com">Superlinked</a>. false https://program.berlinbuzzwords.de/bbuzz25/talk/3XUUYV/ Palais Atelier Quiet on Set: Building an On-Air Sign with Open Source Tech Talk 2025-06-16T11:10:00+02:00 11:10 00:40 Learn how to build a custom On-Air sign using Apache Kafka®, Apache Flink®, and Apache Iceberg™! See how to capture events like Zoom meetings and camera usage with Python, process data with FlinkSQL, analyze trends from your Iceberg tables, and bring it all together with a practical IoT project that easily scales out. bbuzz25-65485-quiet-on-set-building-an-on-air-sign-with-open-source-tech /media/bbuzz25/submissions/HXUTN9/Entwurf_1_30_7SHTJT0.png Danica Fine en While many of us have adapted to work from home life, one major problem remains: finding an easy way to keep folks in your home away from your workspace when you’re on an important call. Dust off your Raspberry Pi––let’s build a custom on-air sign with Apache Kafka®, Apache Flink®, and Apache Iceberg™! We’ll begin by writing Python scripts to capture key events––such as when a Zoom meeting is running and when a camera is being used––and produce it into Kafka. The live data are then consumed by a Raspberry Pi script to drive the operation of a custom designed on-air sign. From there, you’ll be introduced to the ins and outs of FlinkSQL for stream processing as we wrangle the data into a better format for downstream use. And, finally, we’ll see Iceberg in action and learn how to use query engines to analyze meeting and recording trends. By the end of the session, you’ll be well-acquainted with this powerful trio of open source technologies and know how you could use the same scaffolding and scale out a simple, at-home project to millions of users and simultaneous events. false https://program.berlinbuzzwords.de/bbuzz25/talk/HXUTN9/ Palais Atelier Performance & Fault Tolerance: Building a Modern Database Talk 2025-06-16T12:00:00+02:00 12:00 00:40 What are some key concepts and design decisions behind modern, scalable, highly performant databases? Learn how a database delivers sub-millisecond 99 percentile latency at throughputs of millions of operations per second, at scale, and how you can use it. bbuzz25-61593-performance-fault-tolerance-building-a-modern-database /media/bbuzz25/submissions/LQK7AE/Entwurf_1_20_Wv4fpRz.png Guy Shtub en What are some key concepts and design decisions behind modern, scalable, highly performant databases? ScyllaDB was initially inspired by Apache Cassandra. Instead of using Java, it’s written in C++, which gives it fine control over hardware and operating system resources. These design decisions, using a shard-per-core architecture and more, allow it to achieve x10 performance compared to other databases, with sub-millisecond 99-percentile latency at throughputs of millions of operations per second. In this deep dive, you will learn about the core concept and architecture of a modern, close-to-the-hardware, distributed, scalable, fault-tolerant database and see some examples of its use. The talk covers autotuning, scalability, elasticity, high availability, and more. false https://program.berlinbuzzwords.de/bbuzz25/talk/LQK7AE/ Palais Atelier A decade of lessons in Open Source licensing Talk 2025-06-16T14:00:00+02:00 14:00 00:40 Drawing from experience reviewing over 1,000 open-source releases, I'll address common misconceptions, frequent compliance issues, and the evolution of policies to mitigate these challenges.. Attendees will gain practical insights to ensure smoother project releases and foster a compliant, collaborative, open-source community. bbuzz25-64372-a-decade-of-lessons-in-open-source-licensing /media/bbuzz25/submissions/SEM8GV/Entwurf_1_29_8uoUW1l.png Justin Mclean en In this talk, I will provide insights into how developers and community members in the open-source community navigate legal and licensing policies. Over the past decade, I have reviewed over 1000 open-source releases for compliance with various licensing and distribution policies. I will discuss common misconceptions that open-source community members have about licensing, highlight frequent issues encountered during releases, and share how our policies and processes have evolved over time to help catch these issues. I will also outline areas that need improvement to align with emerging legislation and industry standards. This talk aims to equip developers and organisations with practical knowledge to navigate the legal landscape of open-source software more effectively, ensuring smoother project releases and fostering a more collaborative and legally compliant community. false https://program.berlinbuzzwords.de/bbuzz25/talk/SEM8GV/ Palais Atelier gamma_flow: Denoise, classify and disentangle spectral data! Short Talk 2025-06-16T14:50:00+02:00 14:50 00:20 gamma_flow is an open-source Python package for real-time spectral data analysis. Designed for speed and efficiency, it avoids large models, opting instead for a novel supervised dimensionality reduction approach. This enables seamless denoising, classification, and disentangling of single-label or multi-label spectra. bbuzz25-64383-gammaflow-denoise-classify-and-disentangle-spectral-data /media/bbuzz25/submissions/YAX9UA/Entwurf_1_5_szDyi0t.png Viola RädleRaphael Franke en In many research fields, spectral measurements help to assess material properties. In this context, an area of interest for many researchers is the classification (automated labelling) of the measured spectra. Additionally, there may be a need to decompound multi-label spectra (linear combinations of different substances) and identify their constituents. As proprietary spectral analysis software are often limited in their functionality and adaptability, a Python package was developed and will be presented in this talk. gamma_flow (Guided Analysis of Multi-label spectra by Matrix Factorization for Lightweight Operational Workflows) includes the - classification of test spectra to predict their constituents - denoising of test spectra for better recognizability - outlier detection to evaluate the model's applicability to test spectra It is based on a dimensionality reduction model that constitutes a novel, supervised approach to non-negative matrix factorization (NMF). Hence, it exploits and adapts conventional data science methods rather than using extensive, energy-intensive models like neural networks. This results in a fast, robust and reliable automated analysis, leading to classification accuracies above 90%. false https://program.berlinbuzzwords.de/bbuzz25/talk/YAX9UA/ Palais Atelier From Search to Insight: Leveraging OpenSearch for Scalable, AI-Driven Search Experiences Talk 2025-06-16T15:20:00+02:00 15:20 00:40 Modern applications demand search capabilities that go beyond basic text matching—they need to be fast, accurate, personalized, and context-aware. This session demonstrates how OpenSearch's latest AI/ML enhancements and engine improvements enable organizations to build intelligent, scalable search experiences that meet these evolving needs. bbuzz25-71750-from-search-to-insight-leveraging-opensearch-for-scalable-ai-driven-search-experiences /media/bbuzz25/submissions/TBYJME/Entwurf_2_wi0wu7g.png Saurabh Singh en Observability, log analytics, GenAI systems, and RAG pipelines must query massive volumes of semantic embeddings to retrieve relevant content instantly. Today’s search systems often fall short in handling high-dimensional vector data and similarity search. OpenSearch 3.0 brings significant architectural improvements to address these challenges. Integrating Apache Lucene 10 and JVM 21, the platform delivers 20% faster queries than its 2.x predecessor and 10× the throughput of 1.x versions. New features like GPU-accelerated vector indexing and concurrent segment search dramatically improve k-NN query performance while reducing operational costs. The platform's expanded AI capabilities now include an advanced Vector Engine for approximate k-NN searches and neural sparse search for efficient text indexing. These innovations, combined with optimized embedding ingestion and query-time pruning, enable organizations to build performant, cost-effective search solutions that scale with their needs. We'll explore practical applications of these features, demonstrating how OpenSearch 3.0 powers the next generation of AI-driven search experiences. --- This session is sponsored by <a href="https://opensearch.org">OpenSearch</a>. false https://program.berlinbuzzwords.de/bbuzz25/talk/TBYJME/ Palais Atelier Qumat: Apache Mahout Quantum Compute Talk 2025-06-16T16:30:00+02:00 16:30 00:40 In this talk we present current progress on Mahout's new quantum compute layer named Qumat. We will give an overview of the project, explain why we built Qumat, and show its current state. We will present a demo of Qumat in action, and end with calls to action for researchers and engineers interested in using it and contributing to the project. bbuzz25-64496-qumat-apache-mahout-quantum-compute /media/bbuzz25/submissions/SBVMSZ/Entwurf_1_27_Q9g9miP.png Andrew MusselmanTrevor Grant en Following Mahout's core values of interoperability and providing tools for matrix arithmetic at scale, we have added a new layer (qumat) alongside our existing distributed matrix math framework (Samsara), that allows quantum researchers and developers to write code once and run it on any back-end available. As with distributed compute systems like Spark and Flink, moving from one platform to another typically requires a complete code rewrite. This is prohibitive in most cases, but Samsara allows machine learning researchers and developers one unified interface to write code once and port instantly to another platform if it is deemed necessary. Similarly for quantum computing, multiple vendors (IBM, GCP, and AWS to name a few) have their own libraries for accessing their cloud quantum compute services, such as qiskit, cirq, and braket. To give the same flexibility in the quantum area, qumat corrals all these libraries under one interface, allowing users to focus on building circuits and writing algorithms rather than adapting to one particular library. false https://program.berlinbuzzwords.de/bbuzz25/talk/SBVMSZ/ Palais Atelier "Do What I Mean": The History of AI and Program Synthesis Talk 2025-06-16T17:20:00+02:00 17:20 00:40 We think of generating source code from a prompt as an AI-powered feature of modern IDEs, but the general problem has a rich history in research efforts and domain-specific programming systems. In this session, you'll learn about the history of program synthesis, its relationship to the history of AI, and what lessons this history has for us today. bbuzz25-65581-do-what-i-mean-the-history-of-ai-and-program-synthesis /media/bbuzz25/submissions/ZGPRFQ/Entwurf_1_12_6THE0e4.png William Benton en Many programmers today rely on an AI-powered assistant in their editor, and many savvy users of language models have observed that LLMs are often better at writing code to solve a problem than at reasoning directly. However, relatively few developers know that generating correct programs from human specifications has been an active research area for over fifty years. In this session, you'll learn about the fascinating history of this cross-disciplinary effort and see how it brings together topics from statistical machine learning, classical symbolic AI, programming language theory, program verification, and combinatorial search. We'll cover fundamental approaches, challenges, and historically-important applications; we'll also show some interesting parallels between the history of AI systems and the history of program synthesis. We'll conclude with vital lessons from the history of program synthesis that can inform how we should build tomorrow's coding assistants — and how we can better use the ones available to us today. false https://program.berlinbuzzwords.de/bbuzz25/talk/ZGPRFQ/ Frannz Salon Going Local-First: A Primer Short Talk 2025-06-16T10:40:00+02:00 10:40 00:20 The local-first paradigm promises transformative benefits — user-owned data, seamless offline capabilities, and instant interactions. But how do you get started? In this lightning talk, we’ll cover the key concepts and show you how to begin your local-first journey. bbuzz25-63488-going-local-first-a-primer /media/bbuzz25/submissions/H38JKG/Entwurf_1_26_d4yvXlk.png Miloš Sutanovac en Not every new trend in web development is destined to stick, but the local-first paradigm feels different. It’s about more than offline capability — it’s a shift toward user-owned data, instant interactions, and apps that work seamlessly, regardless of network conditions. This talk offers a short, approachable introduction to building local-first applications today, covering core principles, architectures, and tools. false https://program.berlinbuzzwords.de/bbuzz25/talk/H38JKG/ Frannz Salon The Dark Secrets of Stream Processing Talk 2025-06-16T11:10:00+02:00 11:10 00:40 Stream processing systems promise fresh results, strong consistency, and S3-based cost savings. But pitfalls exist: * Backfilling takes too long due to incremental state maintenance. * Consistency causes system stalls during joins. * S3 costs spike with cache misses. This talk explores these issues, mitigations, and hard truths. bbuzz25-59030-the-dark-secrets-of-stream-processing /media/bbuzz25/submissions/EGQNC8/Entwurf_1_46_ZYio5D7.png Yingjun Wu en Stream processing systems seem magical: they deliver much fresher results compared to batch processing, promise the highest levels of consistency, and leverage S3 to reduce state storage costs. But is it too good to be true? In the world of data systems, there’s no such thing as a free lunch. Every benefit comes with trade-offs. Here are three pitfalls that new stream processing practitioners often overlook: * Backfilling Takes Forever Stream processing systems continuously maintain internal states to enable incremental computation. However, this comes at a cost: bootstrapping a streaming job—or creating a materialized view in the database context—can take an arbitrarily long time. The larger the historical data or the more complex the processing, the worse this problem becomes. * Consistency Isn't Free Many stream processing systems claim to offer "strong" consistency, even across multiple streaming jobs. However, this level of consistency has a price: system stalls during events like join amplifications or dependency mismatches. These bottlenecks can significantly impact real-time performance and overall system reliability. * S3 is Cost-Effective, Until It’s Not Modern stream processing systems often use S3 as the primary storage for maintaining states, as it promises lower costs compared to in-memory or on-disk alternatives. But here’s the catch: S3 access costs can skyrocket when cache misses are too frequent. What starts as a cost-saving measure can quickly turn into a major expense. In this talk, I’ll dive deep into these three pitfalls, explaining their causes, possible mitigations, and the hard truths about unsolvable challenges. I’ll share real-world examples of how these issues manifest and the “bloody facts” of how they can bite even the most experienced practitioners. false https://program.berlinbuzzwords.de/bbuzz25/talk/EGQNC8/ Frannz Salon State of native access in Apache Lucene Talk 2025-06-16T12:00:00+02:00 12:00 00:40 Lucene 10 came out last year in October. One of the changes was about the minimum version requirement of Java 21 – this looks like it allows to introduce new features like native access to file system cache and therefore better memory mapping. Is it as easy as it sounds? bbuzz25-65505-state-of-native-access-in-apache-lucene /media/bbuzz25/submissions/9BQWQ8/Entwurf_1_51_7VFf8Va.png Uwe Schindler en This talk will discuss the new features of Java 21 and how they can be used in Lucene 10. But it will also show the challenges that come from the fact that the many of the Lucene-relevant APIs are preview-only in that version. Uwe will introduce the mechanisms used to provide access to the native layer of Java 21 but also the limitations. New features implemented using native APIs are `madvise`/`fadvise` kernel hints for the file system, but also preloading of pages required during search. The limitations of the current system make it impossible to integrate modern Java APIs into the public API of Lucene, so Lucene 10 still has the same limitations like previous versions with regards to Java preview features. Discussions are ongoing to change the main branch to align to the "[tip & tail](https://openjdk.org/jeps/14)" model of development done in the OpenJDK community and start to integrate types like `MemorySegment` as first class citizen to Lucene's public API in the main (development) branch and release new versions with new minimum requirements more often. One example for this is a PR to offload calculations to a graphics card which was submitted to the Lucene project at the time of writing this proposal. false https://program.berlinbuzzwords.de/bbuzz25/talk/9BQWQ8/ Frannz Salon When StatefulSets are not enough Talk 2025-06-16T14:00:00+02:00 14:00 00:40 K8s StatefulSets present significant hurdles for scaling and migrating large-scale cloud database workloads. We'll cover scaling strategies beyond vanilla StatefulSets and share lessons on executing zero-downtime live migrations using custom controllers, durable execution workflows, and tackling complex synchronization problems in ClickHouse Cloud. bbuzz25-64833-when-statefulsets-are-not-enough /media/bbuzz25/submissions/DARWF8/Entwurf_1_42_rTgALHZ.png Manish Gill en This is a densely packed technical talk that teaches you Auto Scaling architecture, Kubernetes StatefulSets and their limitations, various scaling strategies and statefulset alternatives. We also look at building custom kubernetes controllers for the purpose of changing our orchestration code-path, and investigate leveraging durable execution workflows like Temporal for managing zero downtime migrations. You will understand the Pros and Cons of Break-First and Make-First scaling models and which to use when. We focus on the challenges that prevent doing Make-first with traditional StatefulSets. We discuss open source projects such as Advanced StatefulSets, OpenKruise and a custom Multi-StatefulSet approach. We go into the story of moving from one mode of orchestrating StatefulSet to another via a Live Migration, without breaking the running queries. Finally we end with some ClickHouse specific problems we encountered during the migrations and how we solved them. false https://program.berlinbuzzwords.de/bbuzz25/talk/DARWF8/ Frannz Salon Reproducibility in Embedding Benchmarks Short Talk 2025-06-16T14:50:00+02:00 14:50 00:20 Reproducibility in embedding benchmarks is challenging, especially with embedding models that are instruction-tuned and increasingly large. Learn how MTEB tackles prompt variability, scaling issues, and large datasets to ensure fair and consistent evaluations, setting a standard for benchmarking in embeddings. bbuzz25-63251-reproducibility-in-embedding-benchmarks /media/bbuzz25/submissions/J93LCV/Entwurf_1_32_1fElk11.png Isaac Chung en Reproducibility in embedding benchmarks is no small feat. Prompt variability, growing computational demands, and evolving tasks make fair comparisons a challenge. The need for robust benchmarking has never been greater. The Massive Text Embedding Benchmark (MTEB) addresses these challenges with a standardized, open-source framework for evaluating text embedding models. Covering diverse tasks like clustering, retrieval, and classification, MTEB ensures consistent and reproducible results. Extensions like MMTEB (multilingual) and MIEB (image) further expand its capabilities. In this talk, we’ll explore the quirks and complexities of benchmarking embedding models, such as prompt sensitivity, scaling issues, and emergent behaviors. We’ll show how MTEB simplifies reproducibility, making it easier for researchers and industry practitioners to measure progress, choose the right models, and push the boundaries of embedding performance. false https://program.berlinbuzzwords.de/bbuzz25/talk/J93LCV/ Frannz Salon Best Practices for Running Databases on Kubernetes Talk 2025-06-16T15:20:00+02:00 15:20 00:40 Running open source databases on Kubernetes? Learn best practices for high availability, security, backups, and disaster recovery. Discover key pitfalls to avoid and see how Operators simplify database management for MySQL, MongoDB, and PostgreSQL in Kubernetes environments. bbuzz25-65493-best-practices-for-running-databases-on-kubernetes /media/bbuzz25/submissions/CMBUQB/Entwurf_1_40_i3Hx0tw.png Peter Zaitsev en So you’re looking to run your Open Source Database on Kubernetes. What best practices should you follow and what pitfalls should you avoid ? In this presentation we will look at how to run stateful applications on Kubernetes overall as well as what is particularly important for databases - we will cover high availability, security, backups and disaster recovery. Finally we will show how these practices can be implemented with Percona Operators for MySQL, MongoDB, PostgreSQL - one of the leading solutions to run Open Source Databases on Kubernetes false https://program.berlinbuzzwords.de/bbuzz25/talk/CMBUQB/ Frannz Salon Apache Iceberg ingestion with Apache NiFi Talk 2025-06-16T16:30:00+02:00 16:30 00:40 With Apache NiFi, a multimodal data pipelining tool, you can assemble existing and/or custom Java & Python processors into a variety of flows. Watch a rich data pipeline be constructed from Kafka, stored using the Apache Iceberg table format and consumed from Trino. bbuzz25-62364-apache-iceberg-ingestion-with-apache-nifi /media/bbuzz25/submissions/YGPGNY/Entwurf_1_6_Oju7YbT.png Lester Martin en A cornerstone requirement of an Icehouse (Iceberg + Trino) is data ingestion. One approach is to leverage Apache NiFi. NiFi, a multimodal data pipelining tool, has a multitude of processors that can be assembled into a flow to address your specific scenarios. NiFi's low-code/no-code approach allows data engineers to rapidly build, deploy, and monitor their data ingestion & transformation pipelines. NiFi also allows custom processor development with a variety of languages, including Java and Python. This presentation will iterate through a few common approaches and ultimately demonstrate a rich data pipeline that sources data from Kafka, performs typical transformation processing (including enrichment), and loads data into a high-performance Iceberg table that will be consumed via Trino. false https://program.berlinbuzzwords.de/bbuzz25/talk/YGPGNY/ Frannz Salon Exploring reranking depth in modern search pipelines Talk 2025-06-16T17:20:00+02:00 17:20 00:40 The use of semantic reranking on top of a ‘cheaper’ retrieval step is common in modern search applications. The reranking depth represents the number of documents that we select to retrieve and feed into the reranking model. We experiment with different models and datasets and we present our findings including some counterintuitive ones. bbuzz25-65379-exploring-reranking-depth-in-modern-search-pipelines /media/bbuzz25/submissions/JPFG8A/Entwurf_1_34_UKGZ8p2.png Athanasios Papaoikonomou en The use of semantic reranking on top of a ‘cheaper’ retrieval step becomes more and more common in modern search applications. It offers a different cost quality profile to semantic retrieval trading indexing time compute for retrieval time compute. The depth represents the number of documents that we select to retrieve and feed into the reranking model in order to optimise their ordering. Intuitively, there is a “natural” trade-off between the uplift we can achieve by operating on an increased pool of candidates and the associated cost of running “expensive” semantic rerankers for longer. In this presentation we start by investigating the behaviour of different models across different scenarios and we present our observations including some counterintuitive ones. Then, we attempt to explain the emergence of certain patterns and finally we revisit the “efficiency vs effectiveness” trade-off from two different perspectives. Here, is an outline of the talk: First, we analyse the retrieval performance as a function of the reranking depth and we identify three main patterns: - Fast increase followed by saturation: this is the most common scenario where larger reranking depth leads to increased performance - Fast increase to a maximum then decay: this is the first “counter-intuitive” result where reranking is beneficial until a certain depth but then performance degrades - Steady decay: this is the case where the reranker actually worsens the ordering of the results provided by the retriever - it’s the least common scenario but still a counter-intuitive result Second, we dive into these three classes and we attempt to explain the observed behaviour. For the first pattern we design a curve fitting procedure which provides a surprisingly good fit. For the other two cases we discuss some potential underlying causes for the performance decline. Third, we connect our findings to existing works in the industry or academia and we highlight some of the dataset characteristics that seem relevant to the observed results Fourth, we show how the interplay of the scores between positive(or relevant) and negative(or irrelevant) documents can explain the emergence of the patterns Finally, we revisit the “efficiency vs effectiveness” tradeoff . We start with a “latency-free” analysis where we focus only on the evolution of our performance metric and examine the possibility of using a smaller reranking depth without losing much of the gains. We also show how this correlates with the recall performance of the first-stage retriever. Then, we incorporate the latency cost in order to present a more realistic scenario and explain the trade-offs under different budget constraints. This talk is relevant to the audience because: - Retrieval performance remains critical in modern applications such as RAG. - Highlights the importance of domain-specific evaluation false https://program.berlinbuzzwords.de/bbuzz25/talk/JPFG8A/ Kesselhaus Breaking Search For Fun and Profit Short Talk 2025-06-17T09:30:00+02:00 09:30 00:20 With a little experience it's easy to find site search queries that don't work. With live examples picked from a variety of high-profile websites, I'll show you how to easily break search - and discuss what we mean by 'broken', the different kinds of failure and what they reveal about the underlying search engine and how we might improve it. bbuzz25-65316-breaking-search-for-fun-and-profit /media/bbuzz25/submissions/HYMXUP/Entwurf_1_31_Fwr1Nmu.png Charlie Hull en Even the most sophisticated search system will fail in some cases - we simply can't predict all the possible queries the user might try, what they are actually trying to achieve and how they might express their needs. Let's try together to break search on a number of well-known websites - but not just for fun! There are different kinds of 'broken' - zero result searches, irrelevant results, system errors - we'll describe each of these and show examples, creating a classification system for search failures. We'll then talk about what underlying issues with search we might be able to reveal and how they could be fixed to improve overall search quality. false https://program.berlinbuzzwords.de/bbuzz25/talk/HYMXUP/ Kesselhaus Hybrid search on hybrid models, at scale Talk 2025-06-17T10:00:00+02:00 10:00 00:40 We present an extensible hybrid search solution using Elasticsearch, built on a multi-index architecture and allowing the integration of multiple embedding models. Our approach addresses the challenges of searching a vast and heterogeneous collection, using different chunking granularity and offering an alternative to reciprocal rank fusion. bbuzz25-64799-hybrid-search-on-hybrid-models-at-scale /media/bbuzz25/submissions/UAFNDR/Entwurf_1_15_5PEtBSZ.png Radu PopPietro Mele en Over the last few years we have been pushing to the limits our full-text search solution for the French Audiovisual Institute. However, some areas of our immense corpus are still inaccessible, either because the multimedia content lacks textual annotations, or because the automatic transcriptions are not self-sufficient for an efficient full-text search. Semantic search appears as a natural complement, but the scalability of the implementation reveals specific challenges in capacity planning and chunking strategies to accommodate different embedding models. Nevertheless, when it comes to merging the benefits of both text and vector search methods, the success of the hybrid search approach relies essentially on the reranking algorithm. To address this, we developed an alternative to the reciprocal rank fusion based on our needs, specifically tailored for a multi-index architecture and integrating multiple embedding sets. In this talk, we share our experience in building an extensible hybrid search solution, covering everything from complex functional modeling to cluster architecture design. Attendees will gain practical insights into handling billions of vectors in real-world scenarios, such as within large graph data structures. Additionally, we will explore the challenges of hybrid reranking, discussing the limitations of standard fusion techniques and the rationale behind our novel approach. While relevance evaluation is still ongoing, our modular architecture enables continuous iteration, ensuring the adaptability to the rapid evolution of embedding models and vector optimizations. This flexibility positions our solution to remain at the forefront of large-scale semantic search, balancing precision, scalability, and efficiency. false https://program.berlinbuzzwords.de/bbuzz25/talk/UAFNDR/ Kesselhaus miniCOIL: Sparse Neural Retrieval Done Right Talk 2025-06-17T11:10:00+02:00 11:10 00:40 In this talk, we present miniCOIL — our attempt to make a sparse neural retrieval model as it should be — combining the benefits of dense and lexical retrieval without propagating their drawbacks. We will share how to design and train a lightweight model that is performant on out-of-domain data and demonstrate its capabilities. bbuzz25-65394-minicoil-sparse-neural-retrieval-done-right /media/bbuzz25/submissions/YQJEPH/Entwurf_1_8_fcfCyVo.png Evgeniya Sukhodolskaya en Production search solutions often need the benefits of exact matching and semantic similarity — who wouldn’t want to have it all? The most famous to-go approach is hybrid search, which combines old but gold lexical methods with dense retrieval models. Hybrid search is famous for a reason; however, due to its dual component nature, taking the best of both worlds, it also takes the worst — propagates all the intricacies of vector search (heavy vectors, capricious indexes) and limitations of lexical approaches (low recall). A less famous solution is sparse neural retrieval — models, which make exact matching semantically aware, can distinguish “a fruit bat” and “a baseball bat”. You might know sparse neural retrieval for SPLADE, a leader in sparse neural benchmarks & a heavy model creating not-so-sparse vectors with its query/document extension mechanisms. Sparse neural retrieval seems pitch-perfect from afar: inverted indices and semantical understanding combined. It’s perhaps overlooked since many attempts to make it lightweight & performant on out-of-domain data failed. miniCOIL is our shot to give sparse neural retrieval more deserved attention — a lightweight model understanding words’ meaning within the context, performant on out-of-domain datasets and easy to adapt to custom data. In this talk, after an introduction in the context of sparse neural retrieval, we will show the architecture behind miniCOIL and demonstrate its capabilities. false https://program.berlinbuzzwords.de/bbuzz25/talk/YQJEPH/ Kesselhaus FlinkCDC: Streamlining your data analytics pipelines Talk 2025-06-17T12:00:00+02:00 12:00 00:40 Change Data Capture (CDC) is a powerful technique that enables organisations to react to data changes in real-time. In this talk we will explore FlinkCDC, a component of Apache Flink, and demonstrate how it leverages Flink's robust stream processing capabilities to provide CDC pattern. bbuzz25-61248-flinkcdc-streamlining-your-data-analytics-pipelines /media/bbuzz25/submissions/GFBXSY/Entwurf_1_23_hdfAic6.png Muhammet Orazov en Through practical demo, we'll see how FlinkCDC efficiently captures, transforms and loads data change across many systems with minimal latency, enabling seamless data integration and real-time analytics. Moreover, we'll look under the hood to learn more about its fault-tolerance mechanisms. Whether you're dealing with legacy systems or building modern data architectures, you will gain insights to implement efficient, reliable, and robust CDC solutions using FlinkCDC. false https://program.berlinbuzzwords.de/bbuzz25/talk/GFBXSY/ Kesselhaus Mastering real-time anomaly detection with open source tools Talk 2025-06-17T14:00:00+02:00 14:00 00:40 With data moving faster than ever, detecting problems as they happen is crucial. This talk covers how to build a real-time anomaly detection system using Apache Kafka for streaming, Apache Flink for processing, and AI for pattern recognition. Plus, we’ll explore Apache Iceberg for storing historical data to refine models. bbuzz25-64828-mastering-real-time-anomaly-detection-with-open-source-tools /media/bbuzz25/submissions/CHAZHA/Entwurf_1_39_nRhBTK5.png Olena Kutsenko en Detecting problems as they happen is essential in today’s fast-moving world. This talk shows how to build a simple, powerful system for real-time anomaly detection. We’ll use Apache Kafka for streaming data, Apache Flink for processing it, and AI to find unusual patterns. Whether it’s spotting fraud, monitoring systems, or tracking IoT devices, this solution is flexible and reliable. First, we’ll explain how Kafka helps collect and manage fast-moving data. Then, we’ll show how Flink processes this data in real time to detect events as they happen. We’ll also explore how to add AI to the pipeline, using pre-trained models to find anomalies with high accuracy. Finally, we’ll look at how Apache Iceberg can store past data for analysis and model improvements. Combining real-time detection with historical data makes the system smarter and more effective over time. This talk includes clear examples and practical steps to help you build your own pipeline. It’s perfect for anyone who wants to learn how to use open-source tools to spot problems in real-time data streams. false https://program.berlinbuzzwords.de/bbuzz25/talk/CHAZHA/ Kesselhaus Go Beyond Basic RAG with Agentic Behavior Talk 2025-06-17T14:50:00+02:00 14:50 00:40 RAG revolutionized AI by merging search and generation, and agentic behavior takes this search to the next level by enabling LLMs to make decisions and call tools. This talk covers agentic behavior's key features: tool integration and reasoning, along with a live demo. bbuzz25-65991-go-beyond-basic-rag-with-agentic-behavior /media/bbuzz25/submissions/VQJL9U/Entwurf_1_17_sN7kARA.png Bilge Yücel en Retrieval-Augmented Generation (RAG) has transformed how we build Q&A systems with Large Language Models (LLMs) by combining the strengths of search and generation. However, traditional RAG workflows are static and often struggle to handle the dynamic and complex demands of real-world applications, such as answering multi-step queries, integrating external APIs, or gracefully recovering from retrieval failures. Agentic behavior addresses these challenges by extending RAG pipelines, enabling LLMs to make decisions, integrate tools, and dynamically adapt workflows. In this talk, we’ll explore how agentic behavior enhances pipelines. We’ll define what it means for a system to act as an “agent” and cover core concepts like routing, tool calling, and reasoning. Using hands-on examples implemented in Python, we’ll walk through practical use cases, such as integrating external APIs and solving multi-step problems. Finally, we’ll tackle challenges like transparency in complex systems and share how graph-based approaches can make these workflows more interpretable. false https://program.berlinbuzzwords.de/bbuzz25/talk/VQJL9U/ Kesselhaus Text Search on Images with Quantized ColPali Short Talk 2025-06-17T16:00:00+02:00 16:00 00:20 ColPali is revolutionary—here’s why: it combines document retrieval with a vision-based large language model, allowing you to search directly within images without needing to extract text. However, running the full model on personal hardware can be challenging due to its computational demands. And thus we’ve released a quantized version of ColPali. bbuzz25-59528-text-search-on-images-with-quantized-colpali /media/bbuzz25/submissions/UABLUX/Entwurf_1_13_MYBgw14.png Sonam Pankaj en ColPali is a late interaction model, that is the context remain intact. And it's finetuned on vision LLM, Pali Gemma to be able to perform text search on images. But what we did was to be able to bring it more towards consumer by quantizing the model, so you can perform search locally on your laptop. The talk will cover: What is ColPali? What is Late-Interaction? How can you deploy it locally? false https://program.berlinbuzzwords.de/bbuzz25/talk/UABLUX/ Kesselhaus Flink Jobs as Agents – Stream Processing for Agentic AI Talk 2025-06-17T16:30:00+02:00 16:30 00:40 Apache Flink is uniquely positioned to serve as the backbone for AI agents, providing them with stream processing as a powerful new tool. We'll explore how Flink jobs can be transformed into "Agents"—autonomous, goal-driven entities that dynamically interact with data streams, trigger actions, and adapt their behavior based on real-time insights. bbuzz25-65600-flink-jobs-as-agents-stream-processing-for-agentic-ai /media/bbuzz25/submissions/X9Q3Q7/Entwurf_1_3_L3yNLiy.png Steffen Hoellinger en We’ll showcase Flink jobs as AI agents through two key stream processing & AI use cases: 1) financial planning & detection of spending anomalies, as well as 2) forecasting demand & supply chain monitoring for disruptions. AI agents need business context. We’ll discuss embedding foundation models with schema registries and data catalogs for contextual intelligence while ensuring data governance and security. We’ll integrate Apache Kafka event streams with data lakes in open-table formats like Apache Iceberg, enabling AI agents to leverage real-time and historical data for consistency and reasoning. We’ll also cover latency optimization for time-sensitive use cases while preventing hallucinations. Finally, we’ll demonstrate an open-source conversational platform on Apache Kafka, where multiple AI agents are assigned to a business process, continuously process real-time events while optimizing for their individual goals, interacting, and negotiating with each other. By combining Flink and Kafka, we can build systems that are not just reactive but proactive and predictive, paving the way for next-generation agentic AI. false https://program.berlinbuzzwords.de/bbuzz25/talk/X9Q3Q7/ Kesselhaus Closing Session #BBuzz 2025-06-17T17:15:00+02:00 17:15 00:15 Join us as we wrap up Berlin Buzzwords. bbuzz25-68467-closing-session Berlin Buzzwords Team en false https://program.berlinbuzzwords.de/bbuzz25/talk/JJ3FM7/ Maschinenhaus Contexts & Machines: How Document Parsing Shapes RAG results Short Talk 2025-06-17T09:30:00+02:00 09:30 00:20 How different document parsing and chunking strategies impact RAG pipeline performance? Using real-life documents and LLM-generated question/answer pairs, we assess multiple methods – both open-source and commercial – showing that parsing quality significantly affects response accuracy and that the best approach may depends on the question type. bbuzz25-64770-contexts-machines-how-document-parsing-shapes-rag-results /media/bbuzz25/submissions/EGTSQD/Entwurf_1_48_uSx9uyh.png Alessio VertematiAndrea Ponti en Retrieval-Augmented Generation (RAG) pipelines have shown their effectiveness in exploring complex documents. However, their performance hinges on the quality of the retrieved context, which depends on well-structured document inputs. Real-world documents often contain unstructured elements - images, tables, multi-column text, etc. - making parsing and chunking a critical challenge. Poor document processing can degrade retrieval quality, increasing the risk of hallucinations in LLM responses. In our talk, we will report on the results of a study conducted to evaluate different PDF parsing and document chunking strategies – spanning both open-source and commercial-grade solutions – to determine their impact on RAG performance. Using a dataset of complex documents and LLM-generated question/answer pairs, we apply several evaluation metrics to quantify how different parsing techniques affect the relevance of retrieved information and response accuracy. Our findings reveal that parsing and chunking strategies significantly shape RAG output quality and that the most effective approach may depend on the nature of the queries. By highlighting the interplay between document processing and RAG performance, this study provides actionable insights for building more reliable knowledge retrieval systems. false https://program.berlinbuzzwords.de/bbuzz25/talk/EGTSQD/ Maschinenhaus Observability for All! Talk 2025-06-17T10:00:00+02:00 10:00 00:40 Observability is the ability to measure the state of the whole system. OpenTelemetry can be used to instrument applications and diagnose issues. But frontend instrumentation is often an afterthought. Join me as I show how OTel, RUM agents and Synthetic Monitoring can help us identify and diagnose issues in all layers of our applications. bbuzz25-64879-observability-for-all /media/bbuzz25/submissions/FNXN8K/Entwurf_1_19_K0jvY53.png Carly Richmond en ## Background Before joining Elastic as a Developer Advocate, I spent over 10 years working for a bank as a frontend engineer. I have felt the pain of trying to diagnose issues and errors in UIs using logs and diving into minified JavaScript code. In that time, the state of DevOps and SRE has established many practices to help developers instrument their applications to identify unexpected behaviour and performance issues. These practices are generally, backend focused. By combining backend tracing with frontend tracing and metrics, we can better understand how our application behaves and where the issue lies. ## Outline I will discuss how to combine logs, metrics and traces from application services with tools for frontend instrumentation, specifically: 1. An overview of key observability signals (for anyone unfamiliar with them), and why logs are insufficient in diagnosing issues in our UIs. 2. An examination of how RUM agents work, using the Elastic RUM agent as an example, and the metrics and tracing information they capture that relate to the observability pillars including Google Core Web vitals, latency metrics and traces. 3. Examples showing how front-to-back tracing can be achieved using OpenTelemetry instrumentation combined with existing RUM agents. I'll also touch on the state of the Client Implementation (RUM) approach within the CNCF OpenTelemetry community. 4. An overview of RUM metrics that can be captured to help track usage, potentially as KPIs, and how they can be collected. 5. An outline of what Synthetic Monitoring is, using Playwright and Elastic Synthetics as an example. I'll also cover how it can be used with alerting and SLOs to alert SREs of potential issues in our applications. ## Target Audience I believe the following individuals would be interested in this talk: 1. UI Developers interested in observing their applications and unsure how to instrument their applications or the tools currently available. 2. DevOps and SRE engineers looking to monitor frontends as part of a wide system-estate. 3. More experienced frontend engineers or designers looking for tools to measure application performance as a regular best practice compared to ad-hoc profiling of web applications. 4. Tech leads and team leads looking for ways to be alerted to potential application issues and behaviours that impact the user experience. false https://program.berlinbuzzwords.de/bbuzz25/talk/FNXN8K/ Maschinenhaus From Culture to Open Source: Build Value-driven Communities Talk 2025-06-17T11:10:00+02:00 11:10 00:40 A great open-source community isn’t just about code—it’s about people. A strong company culture fosters engagement and growth. This session explores how values like kindness, collaboration, and developer experience (DX) shape both our company and community. Learn practical insights on fostering inclusion, engagement, and long-term impact. bbuzz25-64859-from-culture-to-open-source-build-value-driven-communities /media/bbuzz25/submissions/WYLNU9/Entwurf_1_2_F7d7j6F.png Marion NehringJessie de Groot en A great open-source community isn’t just about code—it’s about people. For us – Weaviators – we believe a strong company culture of kindness and collaboration is essential for building an inclusive and engaged open-source ecosystem. Join us to learn what you can do to ensure your culture extends beyond your organization and how your values translate into external impact. Based on real-life examples and tactics,, we will explore: - The connection between company culture and open-source communities. - How we foster an internal culture that supports (developer) engagement, and what you could copy from us. - Practical steps for creating a welcoming, inclusive, and sustainable community. - The role of developer experience (DX) in driving long-term participation. This session is for you—whether you're an open-source community builder, a company leader, or someone eager to learn how to create a company culture that strengthens both your organization and your community. You'll walk away with actionable insights to help you cultivate a culture that fuels open-source success. false https://program.berlinbuzzwords.de/bbuzz25/talk/WYLNU9/ Maschinenhaus What’s New in the OpenSearch Project and Ecosystem Talk 2025-06-17T12:00:00+02:00 12:00 00:40 Discover how OpenSearch powers search and observability at scale! Now part of The Linux Foundation, OpenSearch is evolving with vector search, NLP, and real-time analytics. Join this session to explore its latest innovations, performance boosts, and expanding ecosystem—directly from the project's Chief Evangelist. bbuzz25-63350-what-s-new-in-the-opensearch-project-and-ecosystem /media/bbuzz25/submissions/FTSXAX/Entwurf_1_21_POXumCN.png Dotan Horovits en Audience that is new to OpenSearch will get a brief introduction to the project. More advanced audience will get Key Updates and Features In addition, the talk will cover Integrations and Ecosystem Growth, highlighting potential for collaboration. The talk will be delivered by the chief advocate of OpenSearch, and will provide an authoritative take on the project and its vision and roadmap. Main topics covered include: -Introduction to OpenSearch -Evolving OpenSearch: Key Updates and Features -Real-Time Analytics and Observability -Integrations and Ecosystem Growth -Future Roadmap and Vision false https://program.berlinbuzzwords.de/bbuzz25/talk/FTSXAX/ Maschinenhaus Cross Domain Enterprise Search - Content Diversity at Scale Talk 2025-06-17T14:00:00+02:00 14:00 00:40 This talk will focus on learnings gathered when building an enterprise search platform with multi modal content - ranging from highly domain specific content to images to unstructured content. Problems of extraction, inference and relevance shall be discussed, while showcasing cross domain search at scale. bbuzz25-58999-cross-domain-enterprise-search-content-diversity-at-scale /media/bbuzz25/submissions/KT9BAG/Entwurf_1_36_H9A3GpJ.png Atri SharmaAbhishek Singh en Cross domain search is a long lasting problem -- from the challenges of ingesting variety of data with different structures, content, noise and extraction strategies to generating multiple ground truth golden data sets to benchmark individual corpus' relevance. Coupled with the challenge of cross domain relevance across multi modal content, with no defined mechanism to normalise scores across individual queries across different content, to the challenges of domain specific terminology, to the challenges of cross modal embedding generators and language specific challenges, the list goes on. This talk will focus on learnings of building an enterprise search system, which literally deals with more than 10 different types of content at the same time, and scales into billions of documents. Attendees can expect to learn novel techniques involved in cross domain ranking, content curation, content extraction and natural language query processing. false https://program.berlinbuzzwords.de/bbuzz25/talk/KT9BAG/ Maschinenhaus Streamlining Search Quality: Search Relevance Workbench Talk 2025-06-17T14:50:00+02:00 14:50 00:40 Robust Search Evaluation is both a “must have” for any modern day Search team and an “after thought” that never gets the team’s full attention. This is especially true with the various open source search engines. Most teams build their own data collection and eval tools. Some use standalone open source tools. We present a better solution! bbuzz25-65521-streamlining-search-quality-search-relevance-workbench /media/bbuzz25/submissions/7YSV83/Entwurf_1_58_ou1MAId.png Eric PughStavros Macrakis en In this talk we will lay out the history of Search Evaluation, why it’s critical in today’s AI powered world, and make the case for why Search Evaluation needs to be part and parcel of any modern Search Engine. We will share some lessons from building multiple Search Evaluation toolsets, including the popular open source tool Quepid, and why we felt we needed to build the Search Relevance Workbench as an integrated suite. We will show how SRW collects user click behavior using the User Behavior Insights open standard, and how click data is combined with labeled data to measure search quality. We’ll show how you can use that information to run optimizers like Learning to Boost and Hybrid Search Optimizers that replace traditional manually tuned algorithms. You will leave understanding how SRW is different from previous tools, and how you can take advantage of it with your own search engine (not just OpenSearch) as well. false https://program.berlinbuzzwords.de/bbuzz25/talk/7YSV83/ Maschinenhaus Siphon : Modern Data Stack with SF-CH & Iceberg Short Talk 2025-06-17T16:00:00+02:00 16:00 00:20 Tired of waiting for batch jobs? See how we transformed our data pipeline using Apache Iceberg to stream quality data into Snowflake and Clickhouse simultaneously. Learn about our battle-tested architecture, performance gains, and how we maintain data consistency across dual analytics engines bbuzz25-62908-siphon-modern-data-stack-with-sf-ch-iceberg /media/bbuzz25/submissions/DNQ8EY/Entwurf_1_43_hucokpp.png Ved Prakash en Ever wondered how to stream data reliably to multiple warehouses without compromising data quality? We'll show you how Siphon uses Apache Iceberg's time travel and ACID properties to ensure data consistency across Snowflake and Clickhouse. Dive into our journey from batch to streaming - covering architecture evolution, data quality frameworks, and performance optimizations. We'll share our battle-tested patterns for handling schema evolution, managing data contracts, and implementing quality gates. Learn how we achieved sub-minute latency while preventing bad data from corrupting our warehouses. Perfect for data engineers and architects looking to modernize their data infrastructure with real-world proven solutions. false https://program.berlinbuzzwords.de/bbuzz25/talk/DNQ8EY/ Palais Atelier Evolution of Uber's Search Platform Talk 2025-06-17T10:00:00+02:00 10:00 00:40 Search is integral to Uber's core business and user experience. In this talk, we’ll explore the unique challenges of Search at Uber and chart the evolution of Uber’s Search Platform—from leveraging Elasticsearch to developing an in-house solution, and finally, innovating in collaboration with the OpenSearch community. bbuzz25-63347-evolution-of-uber-s-search-platform /media/bbuzz25/submissions/ZUXSBZ/Entwurf_1_14_pZZZOES.png Yupeng Fu en Search powers critical functionality across all Uber products, including product discovery in the Uber Eats app, seamless pickup and drop-off experiences in Uber Rides, and real-time geospatial matching for drivers and riders. However, this comes with unique technical challenges such as real-time updates, geospatial awareness, and semantic search at scale. Over the years, Uber’s Search Platform has undergone significant transformation: - Initially built entirely on Elasticsearch, we faced challenges related to scalability and feature limitations. - To address these, Uber developed a custom, in-house solution tailored to meet our unique needs. - Recognizing the importance of open standards and community-driven innovation, we later embraced OpenSearch, collaborating with its vibrant community to contribute enhancements and ensure long-term sustainability. In this talk, we will discuss: - The unique technical requirements of Search at Uber. - The architectural evolution of our platform in response to business growth and new challenges. - The strategic shift toward collaborating with the open-source ecosystem to foster innovation and scalability. false https://program.berlinbuzzwords.de/bbuzz25/talk/ZUXSBZ/ Palais Atelier Vespa.ai’s Personalized Search: Advanced Ranking & Tensor framework Talk 2025-06-17T11:10:00+02:00 11:10 00:40 Modern search demands scalable personalization. Discover Vespa’s multi-stage ranking and tensor framework for hybrid queries, multimodal retrieval and real-time ML Learn how to deploy low-latency, high-relevance search systems at petabyte scale. bbuzz25-71568-vespa-ai-s-personalized-search-advanced-ranking-tensor-framework /media/bbuzz25/submissions/L9HNUE/Entwurf_2_1_R9kLb6R.png Piotr Kobziakowski en Today’s applications require search engines to unify text, vectors, and business logic with millisecond latency at petabyte scale. It’s not easy to balance speed, relevance, and personalization for a large user population and a billion scale item base. Vespa.ai, the open-source engine powering Yahoo, Perplexity, Qwant, Vinted, Spotify addresses this through multi-stage ranking with close to data tensor operations and easy to understand custom functions. Vespa’s phased architecture enables high performance due to the ability to filter candidates via hybrid retrieval (text + multi vector + filters) before applying ML models for precision or logic for personalisation. Its tensor framework enables multimodal (text/image/video) and multivector queries with real-time individual personalization, scaling beyond 100k QPS with milliseconds latency. You will learn Vespa.ai configuration concepts and ideas how all the building blocks (LLMs, VLMs, embedding models, sparse and dense representations for items and users) can be connected together. --- This session is sponsored by <a href="https://vespa.ai">Vespa.ai</a>. false https://program.berlinbuzzwords.de/bbuzz25/talk/L9HNUE/ Palais Atelier How [not] to evaluate your RAG Talk 2025-06-17T12:00:00+02:00 12:00 00:40 How do you know if your RAG system is actually working? We’ll share a real-world case study on evaluating RAG in production—tackling messy data, chunking fails, and unexpected chatbot behavior—so you can measure quality with confidence. bbuzz25-65586-how-not-to-evaluate-your-rag /media/bbuzz25/submissions/JHUSGA/Entwurf_1_33_uDDYrLU.png Roman Grebennikov en Judging search relevance seems straightforward: the higher a relevant product ranks, the better your search system works. But when it comes to RAG, things get complicated—there’s no ranking, no traditional documents, just an LLM-generated response to a query. So how do you know if it’s any good? Is there an objective way to measure progress, or are you just guessing? In this talk, we’ll share a real (if not exactly glamorous) case study of building and evaluating a production RAG system for a fintech company. We’ll cover the headaches of working with a small and noisy corpus, chunking gone wrong, handling low-resource languages (plus users who think your support chatbot is their therapist), and the different frameworks (like RAGAS) to evaluate a RAG system—so you’re not flying blind. false https://program.berlinbuzzwords.de/bbuzz25/talk/JHUSGA/ Palais Atelier Why Chatbots Still Fail: The Hidden Pitfalls of RAG Talk 2025-06-17T14:00:00+02:00 14:00 00:40 Many knowledge chatbots and search engines use RAG. Despite their popularity, these chatbots are often worse than ChatGPT and frustrate users by failing to answer even the simplest questions. In my talk, I reveal how ineffective chunking strategies are a key culprit and demonstrate how to refine chunking to build more reliable RAG systems. bbuzz25-64475-why-chatbots-still-fail-the-hidden-pitfalls-of-rag /media/bbuzz25/submissions/DYUCFN/Entwurf_1_45_qf2yvjX.png Lewin von SaldernJennifer Gaubatz en Large Language Models (LLMs) have experienced a significant boom. Among the most popular use cases are intelligent knowledge search engines, or put simply - chatbots. Whether on an airline's website, facing customers, or as the new search tool in your company's intranet, chatbots are everywhere. However many applications fall short of expectations. The system used behind many knowledge applications is called Retrieval Augmented Generation (RAG), in which a specific database is connected with an LLM. Many strategies have emerged to enhance RAG performance, but the core is often overlooked—the data itself. In my presentation I will explain RAG as the foundation of intelligent knowledge applications, its pitfalls and caveats. Using a RAG's first step - chunking - as an example, I show what is necessary to improve the reliability and robustness of RAG systems and what you absolutely have to do before you can trust your own chatbot. false https://program.berlinbuzzwords.de/bbuzz25/talk/DYUCFN/ Palais Atelier Visual Literacy: Complex Document Retrieval with VLMs Talk 2025-06-17T14:50:00+02:00 14:50 00:40 Traditional document retrieval systems struggle with visually rich documents as they discard visual elements during text extraction. This talk shows how vision language models (VLMs) can address these limitations and presents a new benchmark for evaluating document retrieval systems across languages, domains, and document types. bbuzz25-65357-visual-literacy-complex-document-retrieval-with-vlms /media/bbuzz25/submissions/EHWVZJ/Entwurf_1_49_EiLw86R.png Saba SturuaIsabelle Mohr en The field of document retrieval has traditionally relied on text-based approaches, which have served well for simple text documents but show significant limitations when dealing with visually complex documents. Many real-world documents contain crucial information embedded in diagrams, charts, plots, tables, and intricate layouts that conventional systems fail to properly process. Thus, if we query these systems with information that is only included in visual elements (for example, "How much did the average temperature in Germany increase from 1990 to 2025?"), they will fail to retrieve relevant documents even if they contain plots or charts with the exact answer. Vision Language Models (VLMs) offer a new way to approach document retrieval. By processing both text and visual elements together, these models can better understand documents as a whole, seeing how text works together with graphics and layout. This is especially useful for technical documents, research papers, financial reports, and educational materials where images and diagrams are key to understanding the content. In this talk, we will explore how VLMs can be effectively applied to document retrieval tasks. We'll explain how to fine-tune these models for handling complex documents, including important considerations for data preparation, model architecture choices, and training strategies. We'll also present a new benchmark for testing document retrieval systems across different languages, domains, and document types. This benchmark provides a framework for comparing traditional and VLM-based retrieval systems, enabling practitioners to make informed decisions for their specific use cases. false https://program.berlinbuzzwords.de/bbuzz25/talk/EHWVZJ/ Palais Atelier Delay accounting: an underrated feature of the Linux kernel Short Talk 2025-06-17T16:00:00+02:00 16:00 00:20 This talk delves into delay accounting, an often-overlooked feature that provides valuable insights into CPU time shortages and application latency. Attendees will learn how to leverage these kernel metrics for better performance analysis and system optimization. bbuzz25-65250-delay-accounting-an-underrated-feature-of-the-linux-kernel /media/bbuzz25/submissions/8JNKJC/Entwurf_1_59_wQr2hrY.png Nikolay Sivko en Understanding whether a process is truly starved of CPU time isn’t as simple as looking at traditional metrics like CPU usage or Load Average. Few realize that the Linux kernel has built-in mechanisms to precisely measure how long each task waits for kernel resources. This talk delves into delay accounting, an often-overlooked feature that provides valuable insights into CPU time shortages and application latency. Attendees will learn how to leverage these kernel metrics for better performance analysis and system optimization. false https://program.berlinbuzzwords.de/bbuzz25/talk/8JNKJC/ Palais Atelier Advancing Multi-Modal Search Capabilities in Search Pipeline Talk 2025-06-17T16:30:00+02:00 16:30 00:40 Exploring the integration of machine learning inference processors in OpenSearch pipelines, focusing on multi-modal search capabilities, we demonstrate how these processors enhance ingest, search request, and response processes for text, image, and audio data, significantly improving search and analytical capabilities in multi-modalities worlds. bbuzz25-60038-advancing-multi-modal-search-capabilities-in-search-pipeline /media/bbuzz25/submissions/EZYSGS/Entwurf_1_37_9WwYDhh.png Dhrubo Saha en The integration of machine learning (ML) inference processors within search pipeline architecture represents a significant advancement in search and analytics technology in OpenSearch. This presentation delves into the implementation and impact of these processors across three critical stages: ingest, search request, and search response. We begin by examining the ML inference ingest processor, which allows for real-time enrichment of data as it enters the system. This processor can generate embeddings, classify content, or extract features from various data types, including text, images, and audio. We'll demonstrate how this enhances data quality and searchability from the point of ingestion. Next, we explore the ML inference search request processor, which dynamically modifies search queries based on ML model outputs. This powerful feature enables context-aware query expansion, semantic understanding, and even cross-modal query translation. For instance, we'll show how a text query can be used to search for relevant images or how an audio input can be transformed into a text-based search. The ML inference search response processor is then discussed, highlighting its ability to rerank, filter, or augment search results using ML models. This can significantly improve result relevance, especially in multi-modal scenarios where traditional ranking algorithms may fall short. Throughout the presentation, we'll showcase practical examples of these processors in action, demonstrating their application in various use cases such as: Visual similarity search in e-commerce catalogs Audio transcription and searchability in media archives Cross-lingual document retrieval in multilingual databases Sentiment-based filtering in social media analytics We'll also address the technical considerations of implementing these processors, including model selection, performance optimization, and scalability concerns. The presentation will touch upon the flexibility of using both locally hosted and externally connected ML models, allowing organizations to leverage AI capabilities within their search infrastructure. Finally, we'll discuss the future potential of this technology, including the possibility of more advanced multi-modal interactions, real-time learning models, and the integration of large language models for even more sophisticated search and analytics capabilities. This presentation aims to provide attendees with a comprehensive understanding of how ML inference processors can revolutionize multi-modal search in OpenSearch, offering insights into both the current state of the technology and its future directions. false https://program.berlinbuzzwords.de/bbuzz25/talk/EZYSGS/ Frannz Salon More Than Just The Tip Of The Iceberg Workshop 2025-06-17T09:30:00+02:00 09:30 01:10 A comprehensive workshop in which you will gain practical knowledge about how to deploy, configure, interact with and use advanced features of Apache Iceberg. Presented using a local coding environment based on Jupyter notebooks and a Docker Compose stack. bbuzz25-65548-more-than-just-the-tip-of-the-iceberg /media/bbuzz25/submissions/B9MTTQ/Entwurf_1_38_ydCUQkE.png Michal Gancarski en In recent years, several table formats for large datasets have emerged to help data engineers deal with complexity of handling substantial amounts of data in a flexible, performant and safe way. One of the most popular among those formats is Apache Iceberg. In this workshop, you will gain up-to-date, hands-on experience on how to work with Iceberg. Using a local coding environment based on Jupyter Notebooks and a Docker Compose stack, you are going to: 1. Learn about required components of a data processing system that uses Iceberg. 2. Practice examples of how to update and query Iceberg using several query engines and libraries. 3. Use advanced features of Iceberg, like flexible partitioning scheme, time travel or dataset branching. 4. Learn about optimisation techniques and configuration "levers" you can pull to improve the overall performance and query speed of workloads using Iceberg. 5. Peek under the hood of an Iceberg dataset, to understand its metadata and ways it improves query speed and supports data audits and lineage. This workshop is recommended for Data Engineers, Analytics Engineers and Machine Learning Engineers wanting to improve their data pipelines and data processing workflows. false https://program.berlinbuzzwords.de/bbuzz25/talk/B9MTTQ/ Frannz Salon Flavors of PostgreSQL® and you: how to choose a Postgres Talk 2025-06-17T11:10:00+02:00 11:10 00:40 Postgres continues to be widely used, and Postgres-derived closed source databases such as AlloyDB and AWS Aurora and have gained popularity in recent years. In this talk, you’ll learn about the architecture of these radically different kinds of systems, what each of these companies means when they say “Postgres-comptatible” and how to choose one! bbuzz25-65267-flavors-of-postgresql-and-you-how-to-choose-a-postgres /media/bbuzz25/submissions/VPXKLT/Entwurf_1_16_TZSWhir.png Celeste Horgan en Who this is for: This talk is ideally suited for engineers looking to either migrate an existing database to something new, or those wanting an overview of the Postgres-derived database landscape. Relevance: Nearly all the major cloud computing providers provide some sort of “Postgres-compatible” relational database service, but the choice isn’t as simple as picking whichever one your cloud provider offers. Some provide deep integration for AI/ML workloads, and others are serverless databases that aren’t Postgres-related at all. Combined with low awareness of more recent additions to open source Postgres’ feature set, many developers aren’t sure how to proceed in the Postgres in a way that best reflects their needs. Talk outline: - Intro: What makes Postgres Postgres-y? How has the open source community dealt with forks, rewrites and extensions over time, and how is that relevant to our discussion of ‘modern’ Postgres-derived databases? - The meat: comparing and contrasting various Postgres-derived databases, understanding their feature sets, what makes them unique and what use cases they’re particularly well suited for - Google’s AlloyDB Omni, AI/ML capabilities and columnar engines - Amazon Aurora and Neon, both serverless Postgres-compatible databases, and what we mean by "Postgres compatible' - TimescaleDB, PostGIS and other specialized extensions of Postgres, and why open source is cool and allows for infinite extensibility - And of course open source Postgres, and what makes its most recent features relevant in 2025 Conclusion: How the open source nature of Postgres has led to its continued evolution and relevancy in the data landscape, allowing it to evolve to meet new use cases like realtime data analytics and AI/ML. What the audience will learn: The feature sets of a variety of Postgres alternatives, what features are best suited for certain use cases, how some of those features (for instance, AlloyDB’s columnar engine) stack up against databases dedicated to those features (for instance, vs. ClickHouse for columnar data), and how open source project licensing affects the creation of all these new alternatives. false https://program.berlinbuzzwords.de/bbuzz25/talk/VPXKLT/ Frannz Salon Data Quality Management: The Good, The Bad, and The Messy Talk 2025-06-17T12:00:00+02:00 12:00 00:40 Not all data is good—some is bad, and much is messy. Poor data quality affects customers, employees, and decisions. This session traces issues from symptoms to root causes and explores strategies to fix them. Managing data quality is like battling a seven-headed beast, but with the right approach, you can turn chaos into clarity. bbuzz25-65293-data-quality-management-the-good-the-bad-and-the-messy /media/bbuzz25/submissions/PNQJZQ/Entwurf_1_22_pnbKA1m.png Jan Meskens en We all recognize that high-quality data is essential for driving value in analytics, AI, and business operations. Yet, in reality, not all data is good—some is bad, and much of it is just plain messy. While organizations acknowledge the importance of data quality, choosing the right approach to improve it remains a challenge. How can you systematically turn messy, unreliable data into a trustworthy asset? In this session, we take a fresh approach to data quality management. Instead of tackling issues in isolation, we start downstream—examining the real-world symptoms of poor data quality as experienced by customers and employees. From there, we trace problems back to their upstream root causes and explore practical solutions to address them. However, resolving data quality issues is rarely straightforward—it’s like battling a seven-headed beast, where fixing one issue often reveals several others. To tackle this effectively, we introduce data quality management strategies tailored to different levels of organizational maturity. What You’ll Learn: ✅ The Data Quality Triangle: Symptoms, Root Causes, and Solutions ✅ Why solving data quality issues feels like battling a seven-headed beast ✅ Practical data quality management strategies for different maturity levels false https://program.berlinbuzzwords.de/bbuzz25/talk/PNQJZQ/ Frannz Salon Analysing Public Kafka Data from NASA Satellites Talk 2025-06-17T14:00:00+02:00 14:00 00:40 This session builds on the foundational OSS technologies of the modern Lakehouse—Apache Kafka, Spark, Unity Catalog and MLFlow—and shows how everyone can analyze supernova data coming from NASA's satellites and analyze data streams with natural language and plot their own map cosmic events. bbuzz25-59390-analysing-public-kafka-data-from-nasa-satellites /media/bbuzz25/submissions/KKCRNG/Entwurf_1_35_jNyldvx.png Frank Munz en Experience how cosmic events become streaming data in this tech-focused demo, running on the Databricks Lakehouse. Using foundational OSS technologies (Apache Kafka, Apache Spark™, Unity Catalog, MLflow), we'll capture and analyze supernova data streams in real time. While this is a pure tech talk with reusable open-source code, you'll naturally grasp unified lakehouse concepts along the way. false https://program.berlinbuzzwords.de/bbuzz25/talk/KKCRNG/ Frannz Salon All the DataOps, all the paradigms Talk 2025-06-17T14:50:00+02:00 14:50 00:40 Data warehouses, lakes, lakehouses, streams, fabrics, hubs, vaults, and meshes. We sometimes choose deliberately, sometimes influenced by trends, yet often get an organic blend. But the choices have orders of magnitude in impact on operations cost and iteration speed. Let's dissect the paradigms and their operational aspects once and for all. bbuzz25-65582-all-the-dataops-all-the-paradigms /media/bbuzz25/submissions/UDDJ7T/Entwurf_1_DX8hhH0.png Lars Albertsson en I have seen dozens of data platforms and noticed how architectural choices are often made without regarding the operational consequences, resulting in excessive operational burden and slow development. These choices have huge impact on effectiveness of data-centric organisations and separate disruptive companies from legacy enterprises. I will explain how the common operational procedures – deployment, failure handling, late data, data quality problems, bug remediation – have different impact depending on data processing paradigm, and how to handle them with minimal cost and latency where possible. I will also cover when and how to bridge between the paradigms. I will finally share some innovations that we have discovered further improves development iteration speed and operational efficiency. I have found that the distinction between different data processing paradigms is often not clear, and that their differences in practice is not concisely explained anywhere. This presentation is an attempt to create that explanation. false https://program.berlinbuzzwords.de/bbuzz25/talk/UDDJ7T/ Frannz Salon How to train a fast LLM for coding tasks Talk 2025-06-17T16:00:00+02:00 16:00 00:40 Coding LLMs are now part of our daily work, making coding easier. In this talk, we share how we built an in-house LLM for AI code completion in JetBrains products, covering design choices, data preparation, training, and model’s evaluation. bbuzz25-66184-how-to-train-a-fast-llm-for-coding-tasks /media/bbuzz25/submissions/WFWTFL/Entwurf_1_18_MEGZKXQ.png Ivan Dolgov en In this talk, we present our approach to training a code completion model using Mellum, our new open-source model, as an example. Mellum powers in-file code completion in AI-enabled JetBrains IDEs. We'll walk through the entire process, from designing the model and preparing the dataset — with emphasis on the permissiveness of using data — to the training process and evaluation strategies. Attendees will gain insights into state-of-the-art techniques and the challenges we faced and discover practical approaches to optimizing AI models for real-world coding environments. This talk is relevant for developers and ML Engineers interested in ML feature development and custom model training. false https://program.berlinbuzzwords.de/bbuzz25/talk/WFWTFL/