Berlin Buzzwords 2025

Reproducibility in Embedding Benchmarks
2025-06-16 , Frannz Salon

Reproducibility in embedding benchmarks is challenging, especially with embedding models that are instruction-tuned and increasingly large. Learn how MTEB tackles prompt variability, scaling issues, and large datasets to ensure fair and consistent evaluations, setting a standard for benchmarking in embeddings.


Reproducibility in embedding benchmarks is no small feat. Prompt variability, growing computational demands, and evolving tasks make fair comparisons a challenge. The need for robust benchmarking has never been greater.

The Massive Text Embedding Benchmark (MTEB) addresses these challenges with a standardized, open-source framework for evaluating text embedding models. Covering diverse tasks like clustering, retrieval, and classification, MTEB ensures consistent and reproducible results. Extensions like MMTEB (multilingual) and MIEB (image) further expand its capabilities.

In this talk, we’ll explore the quirks and complexities of benchmarking embedding models, such as prompt sensitivity, scaling issues, and emergent behaviors. We’ll show how MTEB simplifies reproducibility, making it easier for researchers and industry practitioners to measure progress, choose the right models, and push the boundaries of embedding performance.


Tags:

Search, Store, Scale

Level:

Intermediate

My focus is on making AI systems usable, scalable, and maintainable. I'm currently a Staff Data Scientist at Zendesk QA, working on LLM-powered features that see millions of conversations a day.

Previously at Clarifai, I helped build and maintain multimodal retrieval systems in production. My background is in Aerospace Engineering and Machine Learning and I hold undergraduate (B.A.Sc in EngSci) and graduate (M.A.Sc) degrees from the University of Toronto.

In my spare time, I am a maintainer for MTEB, I like to see the world, and do a bit of swim/bike/run racing.