Berlin Buzzwords 2024

Advancements in Evaluating Large Language Model Applications
2024-06-10 , Frannz Salon

This talk will discuss approaches to evaluate LLMs at the end-to-end and task levels, focusing on use cases such as question-answering (RAG). We will also cover metric selection and ways to generate datasets using LLMs.


The topic of generative AI has gained significant popularity in recent times. With the help of proprietary and open source LLMs, many companies, start-ups, and researchers are exploring the potential of these models to develop new products or enhance existing ones. They are utilizing prompt-engineering, fine-tuning, and model chaining techniques to implement their solutions. LLMs are widely used for various tasks such as text summarization, translation, stylization, information extraction, question answering, and content generation. However, the evaluation of these models is often limited to human-based assessments, which is not scalable and restricts the ability to conduct experiments.
In this talk, we will delve into the approaches for performing evaluation at both the end-to-end and task levels, taking into consideration different types of tasks. We will examine several typical use cases, including question-answering (RAG), and explore in detail how evaluation can be conducted for them. Additionally, we will discuss metric selection, considering scenarios where we have labels/references and those where we do not. We will also explore the ways in which we can generate datasets using LLMs.

Petr Polezhaev is a data scientist at SIXT, focusing on NLP and GenAI. Previously, he worked as a data scientist in technology companies, creating data products for industry and academia, including recommender systems and AI components for educational platforms.