Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs) for legal applications, but its reliability is critically dependent on the accuracy of the retrieval step. This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail. In this paper, we address this challenge by first identifying and quantifying a critical failure mode we term Document-Level Retrieval Mismatch (DRM), where the retriever selects information from entirely incorrect source documents. To mitigate DRM, we investigate a simple and computationally efficient technique which we refer to as Summary-Augmented Chunking (SAC). This method enhances each text chunk with a document-level synthetic summary, thereby injecting crucial global context that would otherwise be lost during a standard chunking process. Our experiments on a diverse set of legal information retrieval tasks show that SAC greatly reduces DRM and, consequently, also improves text-level retrieval precision and recall. Interestingly, we find that a generic summarization strategy outperforms an approach that incorporates legal expert domain knowledge to target specific legal elements. Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.

Towards Reliable Retrieval in RAG Systems for Large Legal Datasets / Reuter, Markus; Lingenberg, Tobias; Liepiņa, Rūta; Lagioia, Francesca; Lippi, Marco; Sartor, Giovanni; Passerini, Andrea; Sayin, Burcu. - (2025). ( Natural Legal Language Processing Workshop 2025 (NLLP 2025) Suzhou, China 8th November 2025) [10.18653/v1/2025.nllp-1.3].

Towards Reliable Retrieval in RAG Systems for Large Legal Datasets

Francesca Lagioia;Andrea Passerini;Burcu Sayin
2025-01-01

Abstract

Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs) for legal applications, but its reliability is critically dependent on the accuracy of the retrieval step. This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail. In this paper, we address this challenge by first identifying and quantifying a critical failure mode we term Document-Level Retrieval Mismatch (DRM), where the retriever selects information from entirely incorrect source documents. To mitigate DRM, we investigate a simple and computationally efficient technique which we refer to as Summary-Augmented Chunking (SAC). This method enhances each text chunk with a document-level synthetic summary, thereby injecting crucial global context that would otherwise be lost during a standard chunking process. Our experiments on a diverse set of legal information retrieval tasks show that SAC greatly reduces DRM and, consequently, also improves text-level retrieval precision and recall. Interestingly, we find that a generic summarization strategy outperforms an approach that incorporates legal expert domain knowledge to target specific legal elements. Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.
2025
Proceedings of the Natural Legal Language Processing Workshop 2025
Stroudsburg PA USA
Association for Computational Linguistics (ACL)
Reuter, Markus; Lingenberg, Tobias; Liepiņa, Rūta; Lagioia, Francesca; Lippi, Marco; Sartor, Giovanni; Passerini, Andrea; Sayin, Burcu...espandi
Towards Reliable Retrieval in RAG Systems for Large Legal Datasets / Reuter, Markus; Lingenberg, Tobias; Liepiņa, Rūta; Lagioia, Francesca; Lippi, Marco; Sartor, Giovanni; Passerini, Andrea; Sayin, Burcu. - (2025). ( Natural Legal Language Processing Workshop 2025 (NLLP 2025) Suzhou, China 8th November 2025) [10.18653/v1/2025.nllp-1.3].
File in questo prodotto:
File Dimensione Formato  
2510.06999v1.pdf

accesso aperto

Descrizione: Post-print manuscript
Tipologia: Post-print referato (Refereed author’s manuscript)
Licenza: Creative commons
Dimensione 625.49 kB
Formato Adobe PDF
625.49 kB Adobe PDF Visualizza/Apri
2025.nllp-1.3.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 386.91 kB
Formato Adobe PDF
386.91 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/465791
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact