2025-06-17 –, Maschinenhaus
How different document parsing and chunking strategies impact RAG pipeline performance? Using real-life documents and LLM-generated question/answer pairs, we assess multiple methods – both open-source and commercial – showing that parsing quality significantly affects response accuracy and that the best approach may depends on the question type.
Retrieval-Augmented Generation (RAG) pipelines have shown their effectiveness in exploring complex documents. However, their performance hinges on the quality of the retrieved context, which depends on well-structured document inputs. Real-world documents often contain unstructured elements - images, tables, multi-column text, etc. - making parsing and chunking a critical challenge. Poor document processing can degrade retrieval quality, increasing the risk of hallucinations in LLM responses.
In our talk, we will report on the results of a study conducted to evaluate different PDF parsing and document chunking strategies – spanning both open-source and commercial-grade solutions – to determine their impact on RAG performance. Using a dataset of complex documents and LLM-generated question/answer pairs, we apply several evaluation metrics to quantify how different parsing techniques affect the relevance of retrieved information and response accuracy. Our findings reveal that parsing and chunking strategies significantly shape RAG output quality and that the most effective approach may depend on the nature of the queries. By highlighting the interplay between document processing and RAG performance, this study provides actionable insights for building more reliable knowledge retrieval systems.
Search, Data Science, Stories
Level:Intermediate
I'm a software architect and developer. I started as developer on a wide range of technologies spanning from networks, automation and web before becoming technological advisor for knowledge management and document management projects. I'm approaching AI from a deterministic perspective.
I am a Data Scientist with a Master's degree in Computer Science. I combine academic rigour with hands-on industrial experience, using cutting-edge technologies at the intersection of research and practice.
My research focuses on the optimisation of black-box functions using advanced Bayesian methods. From an industrial perspective, I specialise in the development of versatile machine learning solutions, with a focus on foundation models and Large Language Models (LLMs, aka what's behind ChatGPT).
I am fluent in Italian and English and can converse with AI models.