Berlin Buzzwords 2025

How [not] to evaluate your RAG
2025-06-17 , Palais Atelier

How do you know if your RAG system is actually working? We’ll share a real-world case study on evaluating RAG in production—tackling messy data, chunking fails, and unexpected chatbot behavior—so you can measure quality with confidence.


Judging search relevance seems straightforward: the higher a relevant product ranks, the better your search system works. But when it comes to RAG, things get complicated—there’s no ranking, no traditional documents, just an LLM-generated response to a query. So how do you know if it’s any good? Is there an objective way to measure progress, or are you just guessing?

In this talk, we’ll share a real (if not exactly glamorous) case study of building and evaluating a production RAG system for a fintech company. We’ll cover the headaches of working with a small and noisy corpus, chunking gone wrong, handling low-resource languages (plus users who think your support chatbot is their therapist), and the different frameworks (like RAGAS) to evaluate a RAG system—so you’re not flying blind.


Tags:

Search, Data Science, Stories

Level:

Intermediate

A principal ML engineer and an ex startup CTO working on modern search and recommendations problems. A pragmatic fan of open-source software, functional programming, LLMs and performance engineering.