This thesis investigates the role of question understanding in Question Answering systems, developing methods that exploit question semantic equivalence at progressively larger scales: from individual question pairs, through equivalence clusters, to entire datasets. The first part addresses question retrieval at scale. We introduce QUADRo, a retrieval framework operating over millions of question-answer pairs, and the Question Ranking Corpus (QRC), a large-scale resource with answer-aware annotations and challenging hard negatives. We demonstrate that incorporating answers during retrieval substantially improves accuracy, as answers serve as a semantic bridge between questions that share little lexical overlap but seek the same information. To reduce annotation costs, we develop Question Ranking Pre-training (QRP), a self-supervised method that learns question equivalence patterns without labeled data, achieving significant improvements while reducing model variance by over 50\%. The second part extends pairwise equivalence to question clusters. We analyze coherence in Large Language Models, finding that a substantial portion of question clusters exhibit incoherent behavior: models answer some phrasings correctly while failing on semantically equivalent alternatives. This reveals that understanding failures, not just knowledge gaps, limit LLM performance. We introduce Question-Augmented Generation (q-RAG), which supplements prompts with retrieved similar questions, improving accuracy by up to 9 percentage points and coherence by up to 28 points. We further show that q-RAG's benefits can be distilled into model parameters through Direct Preference Optimization (DPO) and Supervised Fine-Tuning, producing standalone models with improved coherence that surpass the inference-time approach. For retrieval systems, we apply clusters to train models for consistency: the Coherence Ranking Loss improves ranking coherence by up to 30\% while simultaneously improving relevance. The third part lifts equivalence to the dataset level. We introduce dataset declassification, a framework that replaces proprietary questions with semantically equivalent public alternatives, enabling dataset sharing without exposing sensitive content. Models trained on fully declassified data match baseline performance (WikiQA $\Delta \approx 0$, TrecQA $|\Delta| \leq 1.2$ points), and test set declassification preserves evaluation validity when high-quality mappings exist ($|\Delta| \leq 2$ on standard benchmarks), enabling the release of ``shadow benchmarks'' for evaluation integrity. We identify boundary conditions through experiments on adversarially-constructed benchmarks. Together, these contributions show that question semantic equivalence, systematically exploited at multiple scales, enables substantial improvements to QA system accuracy, consistency, and evaluation integrity.

Models and Application of Question Retrieval for Natural Language Processing / Campese, Stefano. - (2026 Apr 27), pp. 1-203.

Models and Application of Question Retrieval for Natural Language Processing

Campese, Stefano
2026-04-27

Abstract

This thesis investigates the role of question understanding in Question Answering systems, developing methods that exploit question semantic equivalence at progressively larger scales: from individual question pairs, through equivalence clusters, to entire datasets. The first part addresses question retrieval at scale. We introduce QUADRo, a retrieval framework operating over millions of question-answer pairs, and the Question Ranking Corpus (QRC), a large-scale resource with answer-aware annotations and challenging hard negatives. We demonstrate that incorporating answers during retrieval substantially improves accuracy, as answers serve as a semantic bridge between questions that share little lexical overlap but seek the same information. To reduce annotation costs, we develop Question Ranking Pre-training (QRP), a self-supervised method that learns question equivalence patterns without labeled data, achieving significant improvements while reducing model variance by over 50\%. The second part extends pairwise equivalence to question clusters. We analyze coherence in Large Language Models, finding that a substantial portion of question clusters exhibit incoherent behavior: models answer some phrasings correctly while failing on semantically equivalent alternatives. This reveals that understanding failures, not just knowledge gaps, limit LLM performance. We introduce Question-Augmented Generation (q-RAG), which supplements prompts with retrieved similar questions, improving accuracy by up to 9 percentage points and coherence by up to 28 points. We further show that q-RAG's benefits can be distilled into model parameters through Direct Preference Optimization (DPO) and Supervised Fine-Tuning, producing standalone models with improved coherence that surpass the inference-time approach. For retrieval systems, we apply clusters to train models for consistency: the Coherence Ranking Loss improves ranking coherence by up to 30\% while simultaneously improving relevance. The third part lifts equivalence to the dataset level. We introduce dataset declassification, a framework that replaces proprietary questions with semantically equivalent public alternatives, enabling dataset sharing without exposing sensitive content. Models trained on fully declassified data match baseline performance (WikiQA $\Delta \approx 0$, TrecQA $|\Delta| \leq 1.2$ points), and test set declassification preserves evaluation validity when high-quality mappings exist ($|\Delta| \leq 2$ on standard benchmarks), enabling the release of ``shadow benchmarks'' for evaluation integrity. We identify boundary conditions through experiments on adversarially-constructed benchmarks. Together, these contributions show that question semantic equivalence, systematically exploited at multiple scales, enables substantial improvements to QA system accuracy, consistency, and evaluation integrity.
27-apr-2026
XXXVIII
2024-2025
Ingegneria e scienza dell'Informaz (29/10/12-)
Informatica e telecomunicazioni (fino a.a. 2020-21, 36° ciclo)
Moschitti, Alessandro
no
Inglese
File in questo prodotto:
File Dimensione Formato  
_UNITN__Tesi_PHD_v2.pdf

accesso aperto

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.83 MB
Formato Adobe PDF
1.83 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/483952
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact