The local-first paradigm promises transformative benefits — user-owned data, seamless offline capabilities, and instant interactions. But how do you get started? In this lightning talk, we’ll cover the key concepts and show you how to begin your local-first journey.
Mixture of Encoders is a vector-native alternative that models both structured and unstructured data in a unified embedding space. We will introduce the method, show how it powers natural language search and real-time recommendations, and share open-source tools and benchmarks for replacing complex hybrid stacks.
While powerful, dense vector search is not a plug-and-play feature that will scale straight out-of-the-box, particularly when it comes to extracting the maximum performance from limited compute resources. Come learn how we tuned dense vector indexes for our 100M+ document dataset, and drastically sped up our queries.
You’re using local LLMs. For example, to power RAG. You want to deploy them in production, but you don’t know where: which type of GPU? How large should it be? Should you use a larger model but quantize more aggressively?
Our benchmark results and their interpretation will give you some answers.
Struggling to identify relevant filters among too many facets and frustrating results navigation? We explore an AI Filter Assistant for statistical data (SDMX) showing how LLMs can be leveraged to suggest the best filters for your natural language query, helping you refine the results in Apache Solr. We share wins, fails, and lessons learned.
Learn how to build a custom On-Air sign using Apache Kafka®, Apache Flink®, and Apache Iceberg™! See how to capture events like Zoom meetings and camera usage with Python, process data with FlinkSQL, analyze trends from your Iceberg tables, and bring it all together with a practical IoT project that easily scales out.
The fascinating journey towards releasing version 10.0 of the popular java search engine Apache Lucene. An inspiring and challenging venture seen through the eyes of its release manager, made possible by the vibrant Lucene community, culminated in deploying the new major to production in record times.
If you were living under the rock and have not heard that Airlfow 3 is out, and solves most of the pain points that you had with Airflow 2, this talk is for you. You will learn how you can boost your Data Engineering and AI/ML workflows (without having to rewrite your DAGs) with what Airlfow 3 community worked for last 12 months.
At Climate Policy Radar, we're building an open-source knowledge graph for climate policy. In this talk, we'll share how we combine in-house expertise with scalable data infrastructure to identify key concepts in thousands of global climate policy documents. We'll also touch on ontology design, equitable evaluation, and the climate impacts of AI.
What are some key concepts and design decisions behind modern, scalable, highly performant databases?
Learn how a database delivers sub-millisecond 99 percentile latency at throughputs of millions of operations per second, at scale, and how you can use it.
Lucene 10 came out last year in October. One of the changes was about the minimum version requirement of Java 21 – this looks like it allows to introduce new features like native access to file system cache and therefore better memory mapping. Is it as easy as it sounds?
Drawing from experience reviewing over 1,000 open-source releases, I'll address common misconceptions, frequent compliance issues, and the evolution of policies to mitigate these challenges.. Attendees will gain practical insights to ensure smoother project releases and foster a compliant, collaborative, open-source community.
In this talk, I share our journey in making QuestDB, an Apache 2.0-licensed open-source time-series database, a significantly faster analytical database. In this session, I'll walk through how we identified opportunities for improvement, the key changes we implemented, and how those changes delivered dramatic performance improvements.
Intent based clustering is our approach to overcome some limitations of modern hybrid search systems. We show how an upfront LLM-supported in-depth query understanding can leverage steps like retrieval, clustering, validation and presentation. We address various aspects from prototype to production in a large-scale high-volume e-commerce search.
K8s StatefulSets present significant hurdles for scaling and migrating large-scale cloud database workloads. We'll cover scaling strategies beyond vanilla StatefulSets and share lessons on executing zero-downtime live migrations using custom controllers, durable execution workflows, and tackling complex synchronization problems in ClickHouse Cloud.
Reproducibility in embedding benchmarks is challenging, especially with embedding models that are instruction-tuned and increasingly large. Learn how MTEB tackles prompt variability, scaling issues, and large datasets to ensure fair and consistent evaluations, setting a standard for benchmarking in embeddings.
Want to avoid cloud-hosted AIs, and run your LLMs on your own systems, but not sure where to start? Or even what everything means? Join us to see how easy it can be, and what a beginner needs to know!
Picture billions of messages pouring in daily from thousands of data providers around the globe, which are then processed and published to customers. How can one design a telemetry system to capture, publish, and then index essential information about the data flowing through the system to give internal teams visibility to aid in troubleshooting?
gamma_flow is an open-source Python package for real-time spectral data analysis. Designed for speed and efficiency, it avoids large models, opting instead for a novel supervised dimensionality reduction approach. This enables seamless denoising, classification, and disentangling of single-label or multi-label spectra.
This talks explains locking mechanisms (MVCC, lock queue) in PostgreSQL, focusing on table-level locks that are acquired by Data Definition Language(DDL) operations. If not managed well, schema changes can result in downtime. Not all operations require the same level of locking, and PostgreSQL offers tools and techniques to minimize locking impact.
Running open source databases on Kubernetes? Learn best practices for high availability, security, backups, and disaster recovery. Discover key pitfalls to avoid and see how Operators simplify database management for MySQL, MongoDB, and PostgreSQL in Kubernetes environments.
AI can (also) enhance fact-checking and news classification. We developed a platform integrating search, an intelligent assistant, and a RAG system to support reliable journalism. By leveraging diverse data and analytics, we empower everyone with insights for accuracy and transparency, fostering collaboration for trustworthy information.
With Apache NiFi, a multimodal data pipelining tool, you can assemble existing and/or custom Java & Python processors into a variety of flows. Watch a rich data pipeline be constructed from Kafka, stored using the Apache Iceberg table format and consumed from Trino.
We all do things in our day to day work that are deemed ‘non-promotable’ - these are tasks that are crucial for project success but might not get you promoted. This is commonly known as glue work, a term coined by Tanya Reilly. Being glue doesn’t mean an end to your career, and it isn’t something that you can’t recover from.
Aurea Imaging is an AgTech scaleup focusing on precision farming in apple orchards. We've build the Treescout, and edge device on top of a tractor which unlocks the potential of each tree. We used a innovative technology stack to meet the requirements of an outdoor rural setup. Our journey was full of failures, learnings and ongoing challenges
In this talk we present current progress on Mahout's new quantum compute layer named Qumat. We will give an overview of the project, explain why we built Qumat, and show its current state. We will present a demo of Qumat in action, and end with calls to action for researchers and engineers interested in using it and contributing to the project.
We think of generating source code from a prompt as an AI-powered feature of modern IDEs, but the general problem has a rich history in research efforts and domain-specific programming systems. In this session, you'll learn about the history of program synthesis, its relationship to the history of AI, and what lessons this history has for us today.
Apache Solr 9.8 introduces the LLM module opening the doors of end-to-end natural language query support through vector-backed semantic search (K Nearest Neighbors).
This talk explores the open source contribution from both the indexing and query angles and what’s coming next for Solr in terms of integrations with Large Language Models.
The use of semantic reranking on top of a ‘cheaper’ retrieval step is common in modern search applications. The reranking depth represents the number of documents that we select to retrieve and feed into the reranking model. We experiment with different models and datasets and we present our findings including some counterintuitive ones.
Data in organizations is traditionally split between operational and analytical estates. Join us for an account of our journey combining Apache Kafka and Apache Iceberg to create a solution that addresses both estates with one data source.
Join us for food and drinks at Palais Kulturbrauerei!
With a little experience it's easy to find site search queries that don't work. With live examples picked from a variety of high-profile websites, I'll show you how to easily break search - and discuss what we mean by 'broken', the different kinds of failure and what they reveal about the underlying search engine and how we might improve it.
How different document parsing and chunking strategies impact RAG pipeline performance? Using real-life documents and LLM-generated question/answer pairs, we assess multiple methods – both open-source and commercial – showing that parsing quality significantly affects response accuracy and that the best approach may depends on the question type.
A comprehensive workshop in which you will gain practical knowledge about how to deploy, configure, interact with and use advanced features of Apache Iceberg. Presented using a local coding environment based on Jupyter notebooks and a Docker Compose stack.
Search is integral to Uber's core business and user experience. In this talk, we’ll explore the unique challenges of Search at Uber and chart the evolution of Uber’s Search Platform—from leveraging Elasticsearch to developing an in-house solution, and finally, innovating in collaboration with the OpenSearch community.
We present an extensible hybrid search solution using Elasticsearch, built on a multi-index architecture and allowing the integration of multiple embedding models. Our approach addresses the challenges of searching a vast and heterogeneous collection, using different chunking granularity and offering an alternative to reciprocal rank fusion.
Observability is the ability to measure the state of the whole system. OpenTelemetry can be used to instrument applications and diagnose issues. But frontend instrumentation is often an afterthought.
Join me as I show how OTel, RUM agents and Synthetic Monitoring can help us identify and diagnose issues in all layers of our applications.
Postgres continues to be widely used, and Postgres-derived closed source databases such as AlloyDB and AWS Aurora and have gained popularity in recent years. In this talk, you’ll learn about the architecture of these radically different kinds of systems, what each of these companies means when they say “Postgres-comptatible” and how to choose one!
Stream processing systems promise fresh results, strong consistency, and S3-based cost savings. But pitfalls exist:
- Backfilling takes too long due to incremental state maintenance.
- Consistency causes system stalls during joins.
- S3 costs spike with cache misses.
This talk explores these issues, mitigations, and hard truths.
In this talk, we present miniCOIL — our attempt to make a sparse neural retrieval model as it should be — combining the benefits of dense and lexical retrieval without propagating their drawbacks. We will share how to design and train a lightweight model that is performant on out-of-domain data and demonstrate its capabilities.
Not all data is good—some is bad, and much is messy. Poor data quality affects customers, employees, and decisions. This session traces issues from symptoms to root causes and explores strategies to fix them. Managing data quality is like battling a seven-headed beast, but with the right approach, you can turn chaos into clarity.
Change Data Capture (CDC) is a powerful technique that enables organisations to react to data changes in real-time. In this talk we will explore FlinkCDC, a component of Apache Flink, and demonstrate how it leverages Flink's robust stream processing capabilities to provide CDC pattern.
How do you know if your RAG system is actually working? We’ll share a real-world case study on evaluating RAG in production—tackling messy data, chunking fails, and unexpected chatbot behavior—so you can measure quality with confidence.
Discover how OpenSearch powers search and observability at scale! Now part of The Linux Foundation, OpenSearch is evolving with vector search, NLP, and real-time analytics. Join this session to explore its latest innovations, performance boosts, and expanding ecosystem—directly from the project's Chief Evangelist.
This session builds on the foundational OSS technologies of the modern Lakehouse—Apache Kafka, Spark, Unity Catalog and MLFlow—and shows how everyone can analyze supernova data coming from NASA's satellites and analyze data streams with natural language and plot their own map cosmic events.
This talk will focus on learnings gathered when building an enterprise search platform with multi modal content - ranging from highly domain specific content to images to unstructured content. Problems of extraction, inference and relevance shall be discussed, while showcasing cross domain search at scale.
With data moving faster than ever, detecting problems as they happen is crucial. This talk covers how to build a real-time anomaly detection system using Apache Kafka for streaming, Apache Flink for processing, and AI for pattern recognition. Plus, we’ll explore Apache Iceberg for storing historical data to refine models.
Many knowledge chatbots and search engines use RAG. Despite their popularity, these chatbots are often worse than ChatGPT and frustrate users by failing to answer even the simplest questions. In my talk, I reveal how ineffective chunking strategies are a key culprit and demonstrate how to refine chunking to build more reliable RAG systems.
Data warehouses, lakes, lakehouses, streams, fabrics, hubs, vaults, and meshes. We sometimes choose deliberately, sometimes influenced by trends, yet often get an organic blend. But the choices have orders of magnitude in impact on operations cost and iteration speed. Let's dissect the paradigms and their operational aspects once and for all.
RAG revolutionized AI by merging search and generation, and agentic behavior takes this search to the next level by enabling LLMs to make decisions and call tools. This talk covers agentic behavior's key features: tool integration and reasoning, along with a live demo.
Robust Search Evaluation is both a “must have” for any modern day Search team and an “after thought” that never gets the team’s full attention. This is especially true with the various open source search engines. Most teams build their own data collection and eval tools. Some use standalone open source tools. We present a better solution!
Traditional document retrieval systems struggle with visually rich documents as they discard visual elements during text extraction. This talk shows how vision language models (VLMs) can address these limitations and presents a new benchmark for evaluating document retrieval systems across languages, domains, and document types.
Mercari is a Japan-based second-hand e-commerce marketplace. We have been relying on Elastic Search for retrieval and DNN Learning to Rank for ranking for a long time. With the development of deep learning and LLM, In this talk, I would like to share we re-architecture our search system and convince internal stakeholders to take on new technology.
This talk delves into delay accounting, an often-overlooked feature that provides valuable insights into CPU time shortages and application latency. Attendees will learn how to leverage these kernel metrics for better performance analysis and system optimization.
Tired of waiting for batch jobs? See how we transformed our data pipeline using Apache Iceberg to stream quality data into Snowflake and Clickhouse simultaneously. Learn about our battle-tested architecture, performance gains, and how we maintain data consistency across dual analytics engines
ColPali is revolutionary—here’s why: it combines document retrieval with a vision-based large language model, allowing you to search directly within images without needing to extract text. However, running the full model on personal hardware can be challenging due to its computational demands. And thus we’ve released a quantized version of ColPali.
Exploring the integration of machine learning inference processors in OpenSearch pipelines, focusing on multi-modal search capabilities, we demonstrate how these processors enhance ingest, search request, and response processes for text, image, and audio data, significantly improving search and analytical capabilities in multi-modalities worlds.
Apache Flink is uniquely positioned to serve as the backbone for AI agents, providing them with stream processing as a powerful new tool. We'll explore how Flink jobs can be transformed into "Agents"—autonomous, goal-driven entities that dynamically interact with data streams, trigger actions, and adapt their behavior based on real-time insights.
Coding LLMs are now part of our daily work, making coding easier. In this talk, we share how we built an in-house LLM for AI code completion in JetBrains products, covering design choices, data preparation, training, and model’s evaluation.
As real-time data processing becomes increasingly critical across industries, the need for reliable, transparent, and actionable data lineage strategies has never been greater. Understanding where your data comes from, how it flows through your systems, and how it is transformed along the way is essential for maintaining trust, ensuring compliance.