Join us as we kick off Berlin Buzzwords 2023
This talk focuses on unpacking this year’s big buzzwords of “open AI” and “responsible AI” to highlight the range of (sometimes contradictory) activities that exist under these umbrella terms and how
This talk introduces a novel programming model - the user declares data collections with the properties, and and these declarations can be transparently ported to multiple platforms including GPUs.
For quite some time, Hadoop served as the data warehouse for Kleinanzeigen. In this presentation, our objective is to provide an overview of our approach, which involves implementing a cloud-based data pipeline with the help of dbt and Airflow.
Fascinated by vector search but don't know where to start?
Join us to crack the code and leverage the potential of vector search to delight your users.
A practical guide for training learned sparse models to outperform BM25 on zero-shot document retrieval tasks
In this talk, we share our experiences, best practices, and lessons learned in designing and operating a serverless Lucene serving system at PB scale.
Tutorial for writing Solr Query Parser that use TensorFlow for Java to augment queries.
This talk will explore the bad and best practices when deploying Apache Airflow in a production environment. From common pitfalls such as misconfigured tasks and lack of scalability,
Using generative Large Language Models (LLMs) to generate synthetic labeled data to train in-domain ranking models. Distilling the knowledge and power of generative LLMs into effective ranking models.
Columnar databases seem to be full of mysteries and confusion. In this introduction for ClickHouse, we'll take apart its building blocks to see how it achieves its remarkable performance.
Probabilistic data structures give developers room to massively cut down on space requirements while sacrificing a bit of accuracy, so when is probably good enough?
How we successfully transitioned the search system for the world's largest recipe sharing platform to a modern stack – including the wins, fails, team structures, and processes along the way.
The rise of cloud-native technologies has revolutionised the way organisations store, process, and manage their data. We'll explore the power that Kubernetes give us to efficiently manage them.
Multiple vectors in a field dedicated to K-nearest-neighbors search has been a fundamental problem for Apache Lucene for long.
This talk describes how this has been finally designed and implemented.
Apache Iceberg is an open table format that has wide support among open-source and cloud vendors. After this talk, you'll be comfortable with all the concepts and how to use Iceberg.
Data vs. Fake news : using available data to offer a critical view of the world
This talk will explore the techniques and best practices for joining dozens of data streams, focusing on different joining mechanisms, such as binary joins and delta joins, as well as pros and cons.
Learn how the News Search Infrastructure Team at Bloomberg migrated from a customized implementation of Apache Solr to the upstream Apache Solr
An ethical overview of how a privacy-focused search engine has to adapt its behavior from crawling to ranking web documents without knowing anything about the user and still be as relevant as possible
Open source software is so much more than code – docs, community and infra need maintaining. How do you attract and keep non-code contributors? Let two experienced practitioners show you the way!
Advanced ML models for text may need hundreds of machines, but with open source tools and pre-trained models, you can do a lot just on your laptop or docker container. Discover what and how!
Deep learning for search has become a hot topic, while pre-trained neural nets do not function well as expected. We will discuss the algorithms behind model fine-tuning, and how to scale it up.
In this session we will share our vision towards an alternative, decentralized and collaborative search engine, from social considerations to technical implementation.
As AI grows, software manages more risks to humans. Moving fast and breaking things won't do. We will look at aviation to learn how successful risk management structures might look in software & AI.
When we set up to write an open source fast time series database, we realised we would need every trick in the book to make it as performant as possible. This talk will show what's inside.
The BSI provides actual data on acute IT threat situations. We developed a system for detecting threats: crawling, automatic analysis with NER, NEL, provision and use of dedicated tools for evaluating
No one wants to be responsible for breaking the build. But what can you do as a developer to avoid being the bad guy? How can project leads enable their teams to reduce the occurrence of builds?
Understand how data moves into and out of Apache Kafka® by taking a look at the producer and consumer request life cycle. Follow a request from an initial call to send() or poll(), all the way to disk
In this talk, we will explore common problems faced when building real-time applications at scale, with a focus on a specific use case: detecting and responding to cyclist crashes.
Chaos engineering is hard, in containers it is even harder.
This session will show attendees the considerations and get them started on their way to making more resilient applications in the cloud
It's that old question - which search engine should I choose for my project? Elasticsearch, Solr, Opensearch (all based on Lucene), or Vespa, or maybe one of the new vector search engines?
This demo-heavy workshop scores a hat trick by combining Apache Lucene, MongoDB, and GraphQL to easily build search functionality across data collections and 3rd party APIs into applications.
Communicating technical knowledge effectively is a core skill for practitioners, but one which is often neglected. We’ll give practical advice on how to (and not to!) communicate technical ideas.
This talk will discuss the ways Apache Lucene might go in the next years. From the perspective of a full-text search engine, it looks like it is feature-complete. So what comes next?
Achieving optimal execution plans in distributed databases is a challenging task. This talk will focus on CrateDB: a distributed SQL database, and key strategies for optimizing its query performance.
The MLOps infrastructure we built to support ML in search at Mercari, Japan’s largest C2C e-commerce platform.
Learn how to build with LLMs, like ChatGPT, and avoid typical pitfalls like hallucination and outdated information. Accompanied by practical code examples using the open source framework Haystack.
Siren Federate is an Elasticsearch plugin for joining inverted indices at query-time. Learn in this talk about its inner workings and how it complements features of Elasticsearch like runtime fields.
Modernization efforts face particular hurdles in large, established OSS projects. Come learn about the community and technical challenges encountered on Apache Solr's path towards revamped HTTP APIs.
During this talk, I will take you on my over-a-decade-long journey in search. Starting from having witnessed the inception of Elasticsearch to my current endeavors with Weaviate, I will share my first-hand experience of the evolution, challenges, and lessons learned along the way.
Conducting online testing is crucial for assessing a model’s performance in a real-world scenario. This talk explores a customized approach for evaluating ranking models using Kibana.
This talk will cover several use cases in which generating synthetic data is useful (or even essential) and introduce a toolbox of practical techniques for synthesizing data in these situations.
If you want to build a chat bot like ChatGPT on your own data, you need to use search to provide the context. Usually semantic search is used, but we've found that keyword search has some pros.
In this talk, we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using Kafka Connect, and ksqlDB
Apache Solr’s built-in autoscaling is gone, but the need for autoscaling persists. Using Kubernetes’ HPA, the Solr Operator and new Solr APIs, we re-introduce autoscaling for Solr on Kubernetes.
During covid the pressure was on for search. I’ll discuss the challenges of building a search engine matching people to covid test facilities and how the lessons learned can solve healthcare issues.
Amazon's Alexa team has lost billions. Google and Apple's hub aren't great successes. Is the Smart home failing? How can you keep your lights on when they depend on cloud infrastructure to work?
Learn how to handle errors in streaming data pipelines using concepts, such as dead-letter queues.
Large Language Models are great in grammar but tend to confabulate. Building a reliable knowledge base might be a way to solve it. Here is how.
“Platform Engineering,” the latest buzzword, means building an internal platform to improve your SDLC in a way your developers will want to use. Can this be done with engineering skills alone?
How dense vector functionality was used to provide several ‘Google-like’ capabilities such as Extractive Answers and knowledge graph search over a large dataset at the EU Publications Office.
Learn about the motivation that led to the development of the new Cross Data Center (XDC) Replication module in Apache Solr and discover the capabilities it offers making it disaster ready.
In this infodemic era, fact-checking is becoming a vital task.
In this talk, we’ll discover how to build a simple fact-checking system for rock music, leveraging the power of open-source libraries.
We are introducing a new Hadoop Filesystem API called "vectored read" using which we can achieve significant speedups for all big data applications, especially in cloud storage like S3 and ABFS.
Tackle large search results by estimating hit count, interpolating a first phase ranking and limiting
the returned result set to the most relevant documents in a multi-million document index.
How are the columns containing sensitive data used across the data ecosystem? What input columns were used to produce a given report field? Openlineage can answers those questions automatically.
This talk shares the story of how Shopify implemented seamless storage autoscaling for Elasticsearch that powers search for millions of merchants without data loss.
Combining BM25, neural embeddings and customer behavior with Learning-to-Rank into an ultimate ranking ensemble, with examples on Amazon ESCI e-commerce search dataset.
In this session, you'll discover seven Apache Pulsar features that enable you to build amazing event-driven applications and how Apache Pulsar differs from traditional message brokers.
This is the story of how to catch cheaters by combining observability and analytics data through the power of search.
We will explore options to run Apache Flink with a very low resource footprint, allowing users to run full streaming SQL queries or custom streaming applications on JVMs with less than 500mb