Berlin Buzzwords 2024
Join us as we kick off Berlin Buzzwords 2024!
At the turn of the century, the internet ushered in the era of the digital commons - a realm of collectively created and managed resources that are open to the public. Over the past decade, the digital landscape has undergone a profound transformation.
TopK Queries are SQL queries that contain both an order by and a limit clause. The optimization presented in this talk uses runtime information to skip scanning data that could not be a part of the final result, thereby significantly accelerating these types of queries.
At mobile.de we are striving to provide a better and faster search. We use a backend ML system to learn changing user interests and optimise search experience. Based on learning to rank using XGBoost, we discuss current search relevance ranking framework and how it ranks millions of searches daily.
We’re taking a stab at a definitive - yet nuanced! - answer to “how much heap do I need?” and “is GC a problem and how can I fix it?” for Solr/Elasticsearch/OpenSearch deployments.
Discover how melding user-traffic signals with LLM-derived relevance labels can significantly improve learning to rank models. This talk unveils a novel approach to enhancing search relevance for query-product pairs, offering a glimpse into the future of e-commerce search technology.
Let’s rebuild the world of Apache Kafka® brick-by-brick through the lens of LEGO®. You’ll explore Kafka’s inventory of pieces, learn the ins and outs of core elements, and see just how they ‘click’ together with other APIs and tools. The only question left is: what will YOU make with Kafka?
We explore the usage of AI, especially NLP techniques and LLM, to enhance Apache Solr data accessibility. We propose translating natural language queries into structured Solr queries using LLM and metadata to improve search and user experience. We’ll discuss the results and future directions.
The successes (and failures) of applying Machine Learning to improve Reddit's search ranking
What's the role of Orchestrators in the Modern Data Stack?. Trying new orchestrators? Do we need Orchestrators at all? Or what "Orchestrator" can do for you had been redefined ?
Meet "new" orchestrator in town ..... "Apache Airlfow 2.x" with ways of orchestration you had not realized you can do.
Hybrid search combines traditional keyword-based search with vector-based search.
The result sets are merged and a single ranked list of items is returned.
Reciprocal Rank Fusion is one of the most popular algorithms for such a task.
This talk presents the work done to bring it to Apache Solr.
OpenTelemetry is a standard for tracing across multiple components. Let’s see how to set it up across different stacks
S3 is cheap; but S3 is slow. Stream procesisng systems want to maintain internal states at low cost, but cannot tolerate high latency. How could we build a stream processing system on top of S3? I'll tell you what we learned over the last three years.
This talk focuses on how we leveraged Nvidia’s open source cuVS library to accelerate Apache Lucene’s vector search capabilities on the GPU. cuVS contains several algorithms for approximate nearest neighbors and clustering on the GPU and we show how it pairs well with Lucene, which has long been the
The talk explores managing transaction and validity times in data storage, using examples like retroactive creation of shipping documents. It highlights how bitemporal databases aid in this process and discusses the complexities of implementing such solutions.
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
Discover Restate: A new tool for event-driven apps, blending Kafka's power with simplicity for handling complex transactions, async tasks, and microservices. Dive into its unique approach to tackle scalability, resilience, debuggability, and state management challenges in our session.
Many of the putatively novel challenges of building systems around LLMs are analogous to problems we've solved for conventional ML systems. This talk will show you why the things you already know about building ML systems are still relevant for LLM systems — and where the true novelty of LLMs lies.
This talk explains what it takes to bring end-to-end encryption to streaming data pipelines built on top of Apache Kafka and Flink. A live demo illustrates how to encrypt/decrypt sensitive payload fields by means of single message transformations and user-defined functions without any custom code.
Looking to reduce your home energy consumption, to move to more ecological heating, or just want to know what's going on in your house? You need data, and graphs! Learn how cheap hardware plus devops observability tools let you do that!
The foundations of machine learning were built on offline batch processing techniques for model training and inference. As organisations become more dependent on real-time data, the technological trend for machine learning in production is moving towards adding an online stream processing approach.
The demand for advanced search and data retrieval capabilities is growing exponentially: The rise of AI applications, along with the unprecedented scaling of data is leading to workloads that are pushing traditional software based search to its limits.
As cloud computing costs are skyrocketing, and queries are becoming more complex, organizations are often forced to compromise results relevancy given the strict real-time latency requirements of 100ms.
To address this issue, a new dimension of search must be introduced: domain-specific computing; it focuses on designing a dedicated chip for search, to consistently achieve ten times faster search at a billion-scale, all at a fraction of the infrastructure cost.
In this talk we detail how Yelp evolved its Streaming Data Platform at scale while supporting its growing business needs coupled with the fast changing streaming landscape.
Flink's SQL engine is the workhorse behind many event processing platforms. We take a deep look into the internals and take the stack apart! Optimizer phases, streaming primitives, watermarks, CDC, and upsert keys. Let me give you a feeling for the power of a simple streaming SQL query.
This session explores different fallback strategies for RAG applications, enabling you to build robust LLM systems for various scenarios, using open source LLM framework Haystack.
From relevance guesswork to granular search experimentation at scale. Evaluate modifications to your search, such as incorporating synonyms, adjusting field-boosts, adding vector search, or product assortment modifications, on real data rather than relying solely on intuition.
Gear up for a speed competition, it's ChatGPT versus yours truly, racing to build a data visualisation platform from scratch at top speed. Who will win? We’ll be taking on the challenge of reconstructing an energy generation visualisation platform, drawing inspiration from the likes of GridWatch.
How do you track data lineage in an open-source, realtime analytics pipeline? I'll show in a demo!
Dense Vector Search is not the only route to improve your search relevance. Empower your existing sparse keyword search with semantic search capabilities by leveraging text expansion, metadata enrichment, and query re-writing techniques.
A great chance to hear the experience about the future of work from the biggest all-remote company in the world. Arm yourself with the best-practices toolbox for the async work tips and tricks and leverage the context from the distributed work leaders.
This talk will discuss approaches to evaluate LLMs at the end-to-end and task levels, focusing on use cases such as question-answering (RAG). We will also cover metric selection and ways to generate datasets using LLMs.
Yes it's possible to create an 8-bit computer from scratch on breadboards at home with basic components and that's what I will demonstrate live with lots of blinking LEDs ;)
Apache Pinot is a really fast database that's powering real-time applications across various industries, much akin to how people use OLTP databases. So what's the difference ? Are there scenarios where Pinot can replace an existing OLTP database ? In this talk we will explore this idea.
Learn a new cloud architecture that separates streaming ingest compute from query compute for real-time applications and how it addresses the age-old issue of compute contention.
Join us for snacks & drinks in an relaxed atmosphere. Meet old and new friends, business contacts or get to know other participants of Berlin Buzzwords! The Get-Together takes place directly in front of Palais Kulturbrauerei.
Spare Cores, a Python-based open-source ecosystem, offers standardized inventory and performance evaluations for compute resources of public cloud and server providers, helping with selecting and starting the optimal instance type for containerized tasks, such as ML model training or ETL processes.
GenAI is transforming organizations rapidly. But how can we scale individual experimentation to enterprise-grade business products? This talk will cover learnings from hundreds of use cases in pharma about user intentions, information access patterns and quality metrics.
Introducing an open source library in Python: Quix Streams. It solves all the complexities of stream processing in a cloud native package with a familiar Pandas DataFrame API interface. This library lets you work with streaming data similarly to static data in your Jupyter Notebook!
We all know there are several vector databases, and once Andrej Karpathy said that even an array can do the same job, true but not true. Vector database provide you the infra to store the embeddings but how those embeddings are made are the most innovative part of all.
Exploring and enhancing lexical search and semantic search in practical scenarios, we assess various optimization methods and their varied effects on metrics. The focus is on the integration of semantic search into an established lexical search system, addressing potential challenges and pitfalls.
Curious about where unexpectedly Deep Learning is being used these days? Because not everything is ChatGPT, AI for search or to detect fraud in e-commerce, this talk will review, and build together state of the art DL architectures as the ones used by the Bundesliga. Help your local team with AI!
This talk introduces OpenSearch's Flow Framework and RAG (Retrieval-Augmented Generation) Tool, enabling developers to create AI-augmented agent, queries, and ingestion flows, integrate ML models, and streamline app development through a no-code/low-code builder
Nearly half of data scientists find it difficult to adopt streaming technologies. The audience will learn about the technological barrier and the operational burden that inspired us to look for better solutions, and how we built a stream-batch unified Python dataframe API.
I will discuss advanced retrieval augmented generation(RAG) techniques such as hybrid search, re-ranking, query generation, and semantic text chunking.
Celebrating Apache Lucene's 22nd anniversary, this conference explores its pivotal role in search and data technologies, from powering platforms like Solr, Elasticsearch and OpenSearch to recent AI synergies with vector indexing/search. Discover Lucene's evolution and future horizons.
This talk will focus on the exploration of serverless Kafka and real-time data streaming with Node.js microservices. In this talk, I will delve into Upstash's serverless Kafka architecture and how it can be leveraged to build scalable and cost-effective solutions for microservices.
This is a speedrun through the past ten years of R&D on approximate nearest-neighbor search algorithms and vector databases, covering the major advances, the current state of the art, and possible future directions. This will be informed by Joel's experience contributing to JVector, a Java embedded vector search library.
Explore how leveraging a knowledge graph combined with vector indices can enhance semantic search and retrieval-augmented generation (RAG), enabling global insights beyond document boundaries
Curious about NLP beyond the startup hype? Join us to explore NLP in a 'traditional' setting. Tackle challenges like data scarcity and domain specificity using e.g. data augmentation and zero-shot classification, and learn some tips and tricks to address concrete and relatable NLP problems
I'll talk about all the mistakes we made and all the lessons we learned developing a data platform on top of a dozen or so Kubernetes operators so you don't have to go through the same ordeal.
This talk aims to challenge the usual "code-centric" narrative around open source projects. That impact doesn't solely rest on lines of code but thrives within a diverse ecosystem of non-technical contributions.
Embeddings transform text into numerical vectors, capturing semantic relationships. This talk explores the data preparation, training processes, evaluation, and a demo of Jina Embeddings V2 in a hybrid search pipeline, showcasing its practical applications.
We explore the evolving role of product management in open-source generative AI, focusing particularly on Retrieval Augmented Generation's (RAG) industry impact. We cover community-commercial balance, innovative AI product discovery, and sustainable monetization strategies.
Key practical insights across dataset construction, custom metrics, model building, and model de-biasing techniques behind the success of our real-time AI search ranking system which drives significant business impact and now serves as a centerpiece of Mercari’s Search & Discovery platform.
You've heard the wisdom of choosing boring technology, but how do you balance this with new usage patterns and ever-increasing scale? Learn how Slack developed a petabyte-scale log search engine with a team of just three engineers, supporting over one million queries per day.
This talk presents an experimental serverless Solr indexer where the records from multiple database tables are merged using the map-reduce approach. The merge is implemented in AWS using Step Function Distributed Map to orchestrate many lambda functions.
Learn how sequential testing has cut our experiment durations by nearly 40%. We'll delve into the core principles, tackle real-world implementation issues, and unveil Basemath, our Python-driven open-source tool. Gain insights on refining your experimental analysis for smarter feature deployment.
Everyone is unique, and that is especially true for what they eat. At Albert Heijn, we are moving away from popularity-based search for the broader customer base, in favor of tailoring search to our unique customers' tastes. We will tell you all about this transition.
Most of the foundational tools we use in this industry are open source. But CI/CD is a domain where open source has stagnated and most innovation is happening in proprietary code. This talk discusses why this is, why it's dangerous, and what we can do about it.
Log-structured merge-tree databases rely on a combination of in-memory buffers and hierarchies of immutable on-disk structures, which are two cases where a trie data structure is most efficient. Join us for a glimpse at the ways in Apache Cassandra applies and benefits from this combination.
We present the lessons learnt from improving the search of our global online marketplace with 20 million products sold per year. We successfully moved from a traditional word-match based approach (BM25) to a modern hybrid solution by adding a semantic vector model which we fine-tuned to our domain.
Modifying SolrCloud to separate compute and storage presents challenges but is cost effective and improves scalability and availability. Nodes become stateless and durability is delegated to the shared storage implementation.
Salesforce runs such clusters at scale. The code is open source.
This talk equips developers with everything they need to understand and optimize the cost of Kafka workloads in the era of cloud computing.
Have you unlocked the geospatial powers of PostgreSQL?
This talk will show 3 neat features of how to use postgresql in effective ways to handle geospatial timeseries in a new domain. The data in subject is live marine traffic from the Norwegian territory.
We've recently seen a boom of specialized vector databases. At the same time, almost all popular database projects have added support for vectors. So a lot of people are asking themselves if and when do they really need a specialized vector databases, and when could use an already deployed tool.
Explore RAG systems: Understand evaluation criteria, tools, and frameworks.
Discover essential components and compare them with existing frameworks.
Gain insights for enhanced control in your path toward RAG optimization.
The Open Web Search Initiative (OWS.eu) is a European initiative providing an open platform to foster innovation in search and AI applications. Join us for an introduction to OWS, its public datasets and an intro to two of the open source projects that underpin it - StormCrawler and URLFrontier.
Join us as we wrap up Berlin Buzzwords 2024!