2025-06-17 –, Palais Atelier
How do you know if your RAG system is actually working? We’ll share a real-world case study on evaluating RAG in production—tackling messy data, chunking fails, and unexpected chatbot behavior—so you can measure quality with confidence.
Judging search relevance seems straightforward: the higher a relevant product ranks, the better your search system works. But when it comes to RAG, things get complicated—there’s no ranking, no traditional documents, just an LLM-generated response to a query. So how do you know if it’s any good? Is there an objective way to measure progress, or are you just guessing?
In this talk, we’ll share a real (if not exactly glamorous) case study of building and evaluating a production RAG system for a fintech company. We’ll cover the headaches of working with a small and noisy corpus, chunking gone wrong, handling low-resource languages (plus users who think your support chatbot is their therapist), and the different frameworks (like RAGAS) to evaluate a RAG system—so you’re not flying blind.
Search, Data Science, Stories
Level:Intermediate
A principal ML engineer and an ex startup CTO working on modern search and recommendations problems. A pragmatic fan of open-source software, functional programming, LLMs and performance engineering.