Models and Application of Question Retrieval for Natural Language Processing

Campese, Stefano

This thesis investigates the role of question understanding in Question Answering systems, developing methods that exploit question semantic equivalence at progressively larger scales: from individual question pairs, through equivalence clusters, to entire datasets. The first part addresses question retrieval at scale. We introduce QUADRo, a retrieval framework operating over millions of question-answer pairs, and the Question Ranking Corpus (QRC), a large-scale resource with answer-aware annotations and challenging hard negatives. We demonstrate that incorporating answers during retrieval substantially improves accuracy, as answers serve as a semantic bridge between questions that share little lexical overlap but seek the same information. To reduce annotation costs, we develop Question Ranking Pre-training (QRP), a self-supervised method that learns question equivalence patterns without labeled data, achieving significant improvements while reducing model variance by over 50\%. The second part extends pairwise equivalence to question clusters. We analyze coherence in Large Language Models, finding that a substantial portion of question clusters exhibit incoherent behavior: models answer some phrasings correctly while failing on semantically equivalent alternatives. This reveals that understanding failures, not just knowledge gaps, limit LLM performance. We introduce Question-Augmented Generation (q-RAG), which supplements prompts with retrieved similar questions, improving accuracy by up to 9 percentage points and coherence by up to 28 points. We further show that q-RAG's benefits can be distilled into model parameters through Direct Preference Optimization (DPO) and Supervised Fine-Tuning, producing standalone models with improved coherence that surpass the inference-time approach. For retrieval systems, we apply clusters to train models for consistency: the Coherence Ranking Loss improves ranking coherence by up to 30\% while simultaneously improving relevance. The third part lifts equivalence to the dataset level. We introduce dataset declassification, a framework that replaces proprietary questions with semantically equivalent public alternatives, enabling dataset sharing without exposing sensitive content. Models trained on fully declassified data match baseline performance (WikiQA $\Delta \approx 0$, TrecQA $|\Delta| \leq 1.2$ points), and test set declassification preserves evaluation validity when high-quality mappings exist ($|\Delta| \leq 2$ on standard benchmarks), enabling the release of ``shadow benchmarks'' for evaluation integrity. We identify boundary conditions through experiments on adversarially-constructed benchmarks. Together, these contributions show that question semantic equivalence, systematically exploited at multiple scales, enables substantial improvements to QA system accuracy, consistency, and evaluation integrity.

Models and Application of Question Retrieval for Natural Language Processing / Campese, Stefano. - (2026 Apr 27), pp. 1-203.

Models and Application of Question Retrieval for Natural Language Processing

Campese, Stefano

2026-04-27

Abstract

This thesis investigates the role of question understanding in Question Answering systems, developing methods that exploit question semantic equivalence at progressively larger scales: from individual question pairs, through equivalence clusters, to entire datasets. The first part addresses question retrieval at scale. We introduce QUADRo, a retrieval framework operating over millions of question-answer pairs, and the Question Ranking Corpus (QRC), a large-scale resource with answer-aware annotations and challenging hard negatives. We demonstrate that incorporating answers during retrieval substantially improves accuracy, as answers serve as a semantic bridge between questions that share little lexical overlap but seek the same information. To reduce annotation costs, we develop Question Ranking Pre-training (QRP), a self-supervised method that learns question equivalence patterns without labeled data, achieving significant improvements while reducing model variance by over 50\%. The second part extends pairwise equivalence to question clusters. We analyze coherence in Large Language Models, finding that a substantial portion of question clusters exhibit incoherent behavior: models answer some phrasings correctly while failing on semantically equivalent alternatives. This reveals that understanding failures, not just knowledge gaps, limit LLM performance. We introduce Question-Augmented Generation (q-RAG), which supplements prompts with retrieved similar questions, improving accuracy by up to 9 percentage points and coherence by up to 28 points. We further show that q-RAG's benefits can be distilled into model parameters through Direct Preference Optimization (DPO) and Supervised Fine-Tuning, producing standalone models with improved coherence that surpass the inference-time approach. For retrieval systems, we apply clusters to train models for consistency: the Coherence Ranking Loss improves ranking coherence by up to 30\% while simultaneously improving relevance. The third part lifts equivalence to the dataset level. We introduce dataset declassification, a framework that replaces proprietary questions with semantically equivalent public alternatives, enabling dataset sharing without exposing sensitive content. Models trained on fully declassified data match baseline performance (WikiQA $\Delta \approx 0$, TrecQA $|\Delta| \leq 1.2$ points), and test set declassification preserves evaluation validity when high-quality mappings exist ($|\Delta| \leq 2$ on standard benchmarks), enabling the release of ``shadow benchmarks'' for evaluation integrity. We identify boundary conditions through experiments on adversarially-constructed benchmarks. Together, these contributions show that question semantic equivalence, systematically exploited at multiple scales, enables substantial improvements to QA system accuracy, consistency, and evaluation integrity.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di esame finale/Defended on
	
				27-apr-2026
			
	Ciclo
	
				XXXVIII
			
	Anno Accademico
	
				2024-2025
			
	Dipartimento
	
				Ingegneria e scienza dell'Informaz (29/10/12-)
			
	Corso di dottorato
	
				Informatica e telecomunicazioni (fino a.a. 2020-21, 36° ciclo)
			
	Supervisore/Relatore di tesi Unitn (Unitn internal supervisor)
	
				Moschitti, Alessandro
			
	Tesi in cotutela (Bi-nationally supervised Doctoral Thesis)
	
				no
			
	Lingua (Language)
	
				Inglese
			
	Appare nelle tipologie:
	
				08.1 Tesi di dottorato (Doctoral Thesis)

File in questo prodotto:

File	Dimensione	Formato
_UNITN__Tesi_PHD_v2.pdf accesso aperto Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.83 MB Formato Adobe PDF Visualizza/Apri	1.83 MB	Adobe PDF	Visualizza/Apri