Berlin Buzzwords 2025

Visual Literacy: Complex Document Retrieval with VLMs
2025-06-17 , Palais Atelier

Traditional document retrieval systems struggle with visually rich documents as they discard visual elements during text extraction. This talk shows how vision language models (VLMs) can address these limitations and presents a new benchmark for evaluating document retrieval systems across languages, domains, and document types.


The field of document retrieval has traditionally relied on text-based approaches, which have served well for simple text documents but show significant limitations when dealing with visually complex documents. Many real-world documents contain crucial information embedded in diagrams, charts, plots, tables, and intricate layouts that conventional systems fail to properly process. Thus, if we query these systems with information that is only included in visual elements (for example, "How much did the average temperature in Germany increase from 1990 to 2025?"), they will fail to retrieve relevant documents even if they contain plots or charts with the exact answer.

Vision Language Models (VLMs) offer a new way to approach document retrieval. By processing both text and visual elements together, these models can better understand documents as a whole, seeing how text works together with graphics and layout. This is especially useful for technical documents, research papers, financial reports, and educational materials where images and diagrams are key to understanding the content.

In this talk, we will explore how VLMs can be effectively applied to document retrieval tasks. We'll explain how to fine-tune these models for handling complex documents, including important considerations for data preparation, model architecture choices, and training strategies. We'll also present a new benchmark for testing document retrieval systems across different languages, domains, and document types. This benchmark provides a framework for comparing traditional and VLM-based retrieval systems, enabling practitioners to make informed decisions for their specific use cases.


Tags:

Search, Data Science

Level:

Intermediate

Isabelle is a Machine Learning Engineer at Jina AI, where she develops and trains embedding models, working closely with her team to push the boundaries of what’s possible. Passionate about knowledge sharing, she regularly gives talks on machine learning and NLP, inspiring and connecting with others in the field.

Saba is an ML Research Engineer in the Model Training team at Jina AI, where he develops state-of-the-art text and multimodal embedding models, focusing on enhancing search capabilities.