How [not] to evaluate your RAG Berlin Buzzwords 2025

How [not] to evaluate your RAG
.ical
2025-06-17 12:00–12:40, Palais Atelier

How do you know if your RAG system is actually working? We’ll share a real-world case study on evaluating RAG in production—tackling messy data, chunking fails, and unexpected chatbot behavior—so you can measure quality with confidence.

Judging search relevance seems straightforward: the higher a relevant product ranks, the better your search system works. But when it comes to RAG, things get complicated—there’s no ranking, no traditional documents, just an LLM-generated response to a query. So how do you know if it’s any good? Is there an objective way to measure progress, or are you just guessing?

In this talk, we’ll share a real (if not exactly glamorous) case study of building and evaluating a production RAG system for a fintech company. We’ll cover the headaches of working with a small and noisy corpus, chunking gone wrong, handling low-resource languages (plus users who think your support chatbot is their therapist), and the different frameworks (like RAGAS) to evaluate a RAG system—so you’re not flying blind.

Tags: Search, Data Science, Stories Level: Intermediate

Roman Grebennikov

A principal ML engineer and an ex startup CTO working on modern search and recommendations problems. A pragmatic fan of open-source software, functional programming, LLMs and performance engineering.

How [not] to evaluate your RAG .ical 2025-06-17 12:00–12:40, Palais Atelier

How [not] to evaluate your RAG
.ical
2025-06-17 12:00–12:40, Palais Atelier