How to train your general purpose document retriever model
2023-06-19 , Maschinenhaus

A practical guide for training learned sparse models to outperform BM25 on zero-shot document retrieval tasks


Large language models augment traditional information retrieval (IR) approaches with both high quality language parsing skills and knowledge external to the corpus. However, training a state of the art general purpose model for document retrieval is challenging. This talk is motivated by our experiences training a high quality retriever model for use alone or together with BM25 to improve relevance out-of-the-box in Elasticsearch.

We chose to focus on the learned sparse model (LSM) architecture. LSMs for information retrieval (IR) were recently popularised by SPLADE [1] and have various attractive properties for our purpose. They enable retrieval via inverted indices for which Elasticsearch has a high quality implementation in Lucene. They provide tuneable parameters which allow one to trade off accuracy with index size and query latency. They enable word level highlighting to explain matches. And they perform well in zero-shot settings.

In this talk we survey LSMs and discuss how they fit into the IR landscape. We describe some challenges training language models effectively. We briefly survey some techniques which have been studied previously and found to improve performance both in and out of domain. These include downstream task aware pre-training and knowledge distillation. Finally, we give an overview of the key ingredients of our full training pipeline and useful lessons we learned along the way.

Our goal was to consistently improve on BM25 relevance in a zero-shot setting. In particular, we set out to beat BM25 across a suite of diverse IR tasks gathered together in the BEIR benchmark [2] without using any in domain supervision. We survey other published results on this benchmark and discuss how we compare.

[1] SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking, Formal et al

[2] BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models, Thakur et al

See also: Slides (2.9 MB)

Tom Veasey has worked at Elastic since September 2016. He is a member of the machine learning team. He started out as a data scientist working on satellite control, phased array radar and drug discovery projects. He then had detours into Electronic Design Automation and FX derivatives pricing. He studied Physics at the University of Cambridge.

Throughout my career, I have worked on diverse subjects such as medical resonance imaging, infra-red sensor characterization, and predicting carbon footprint in buildings using machine learning. I have been working on natural language processing for three years and I joined Elastic nine months ago.