Gathering information by asking questions about the surrounding world is a hallmark of human intelligence. Modelling this feature in Natural Language Generation systems represents a central challenge for effective and reliable conversational agents. The evaluation of these systems plays a crucial role in understanding the strengths and weaknesses of current neural architectures. In the scientific community, there is an open debate about what makes generated dialogues sound natural and human-like, and there is no agreement on what measures to use to track progress. In the first part of the thesis, after reviewing existing metrics, we aggregate different and complementary metrics that capture surface-level linguistic features into one single score. We take different referential tasks (both multimodal and language-only) as test-bed and wonder how the single metric we propose relates to task success across the training epochs of computational models (Chapter 3). Based on our findings, on the one hand, we present a method that intervenes on the training data to improve surface-level metrics (Chapter 4), especially repetitions in the generated dialogues. On the other hand, given the limitations of surface-level metrics to capture relevant phenomena that improve referential task success, we propose a different approach for the evaluation of computational models on a deeper level to capture the interplay between the Encoder and Decoder components. In the second part of the thesis, we take the case of entity hallucinations in multimodal dialogue systems as a case study to investigate the relationship between Natural Language Generation and Natural Language Understanding on a more fine-grained level (Chapter 5). Our results reveal that these two components are profoundly interconnected and influence one another. We find that hallucinations create a detrimental cascade effect on consecutive dialogue turns and are more likely to appear after negative answers, corroborating evidence from previous work on the deficiency of current architectures to handle negation properly. Our progressive advance towards even deeper dialogue evaluation criteria leads us to the study of the informativeness of questions asked to solve referential tasks. Current decoding strategies generate text in a word-by-word fashion, according to the probabilities of underlying language models. We advocate for the need of going beyond this paradigm and injecting high-level reasoning skills at decoding time. Inspired by cognitive studies on the question-asking strategies of children and adults, we propose a beam search re-ranking technique that implements a confirmation-driven strategy across dialogue turns, and we compare it against a wide variety of different decoding strategies and hyperparameters configurations (Chapter 6 and Chapter 7). We demonstrate that our approach effectively improves task success and dialogue quality when considering both the surface-level metrics described in the first part of the thesis and more fine-grained features such as hallucinations. To make our findings more solid and rule out the possibility that our improvements are due to biases in the model, we propose an evaluation paradigm in which human annotators receive machine-generated dialogues and have to solve the referential task. In general, we find that this paradigm confirms the results obtained with computational models and it demonstrates that machine-generated dialogues are indeed informative to solve the task. In the last part of the thesis (Chapter 8), we broaden the horizons on what is still missing from achieving human-like dialogue systems. We present a large-scale study on the GuessWhat dataset of human-human conversations, usually exploited only to train computational models. Instead, we present a thorough evaluation of the question-asking strategies of human players in this problem-solving task to unveil the pragmatic phenomena that characterize their conversations. Our analyses reveal that humans are far from asking optimal questions. Instead, their efficiency arises from learning to ask uninformative questions at the right moment during the dialogue, i.e., to establish a common ground with the interlocutor at the start of the dialogue exchanges and ask for confirmation of their own hypotheses before deciding to end the dialogue and select the target. We believe modelling such peculiar and effective features of human conversations in dialogue systems is an essential step toward building competent systems that meet the users’ expectations and display human-like traits.
Asking Strategic and Informative Questions in Visual Dialogue Games: Strengths and Weaknesses of Neural Generative Models / Testoni, Alberto. - (2023 Feb 24), pp. 1-160. [10.15168/11572_370672]
Asking Strategic and Informative Questions in Visual Dialogue Games: Strengths and Weaknesses of Neural Generative Models
Testoni, Alberto
2023-02-24
Abstract
Gathering information by asking questions about the surrounding world is a hallmark of human intelligence. Modelling this feature in Natural Language Generation systems represents a central challenge for effective and reliable conversational agents. The evaluation of these systems plays a crucial role in understanding the strengths and weaknesses of current neural architectures. In the scientific community, there is an open debate about what makes generated dialogues sound natural and human-like, and there is no agreement on what measures to use to track progress. In the first part of the thesis, after reviewing existing metrics, we aggregate different and complementary metrics that capture surface-level linguistic features into one single score. We take different referential tasks (both multimodal and language-only) as test-bed and wonder how the single metric we propose relates to task success across the training epochs of computational models (Chapter 3). Based on our findings, on the one hand, we present a method that intervenes on the training data to improve surface-level metrics (Chapter 4), especially repetitions in the generated dialogues. On the other hand, given the limitations of surface-level metrics to capture relevant phenomena that improve referential task success, we propose a different approach for the evaluation of computational models on a deeper level to capture the interplay between the Encoder and Decoder components. In the second part of the thesis, we take the case of entity hallucinations in multimodal dialogue systems as a case study to investigate the relationship between Natural Language Generation and Natural Language Understanding on a more fine-grained level (Chapter 5). Our results reveal that these two components are profoundly interconnected and influence one another. We find that hallucinations create a detrimental cascade effect on consecutive dialogue turns and are more likely to appear after negative answers, corroborating evidence from previous work on the deficiency of current architectures to handle negation properly. Our progressive advance towards even deeper dialogue evaluation criteria leads us to the study of the informativeness of questions asked to solve referential tasks. Current decoding strategies generate text in a word-by-word fashion, according to the probabilities of underlying language models. We advocate for the need of going beyond this paradigm and injecting high-level reasoning skills at decoding time. Inspired by cognitive studies on the question-asking strategies of children and adults, we propose a beam search re-ranking technique that implements a confirmation-driven strategy across dialogue turns, and we compare it against a wide variety of different decoding strategies and hyperparameters configurations (Chapter 6 and Chapter 7). We demonstrate that our approach effectively improves task success and dialogue quality when considering both the surface-level metrics described in the first part of the thesis and more fine-grained features such as hallucinations. To make our findings more solid and rule out the possibility that our improvements are due to biases in the model, we propose an evaluation paradigm in which human annotators receive machine-generated dialogues and have to solve the referential task. In general, we find that this paradigm confirms the results obtained with computational models and it demonstrates that machine-generated dialogues are indeed informative to solve the task. In the last part of the thesis (Chapter 8), we broaden the horizons on what is still missing from achieving human-like dialogue systems. We present a large-scale study on the GuessWhat dataset of human-human conversations, usually exploited only to train computational models. Instead, we present a thorough evaluation of the question-asking strategies of human players in this problem-solving task to unveil the pragmatic phenomena that characterize their conversations. Our analyses reveal that humans are far from asking optimal questions. Instead, their efficiency arises from learning to ask uninformative questions at the right moment during the dialogue, i.e., to establish a common ground with the interlocutor at the start of the dialogue exchanges and ask for confirmation of their own hypotheses before deciding to end the dialogue and select the target. We believe modelling such peculiar and effective features of human conversations in dialogue systems is an essential step toward building competent systems that meet the users’ expectations and display human-like traits.File | Dimensione | Formato | |
---|---|---|---|
thesis.pdf
embargo fino al 23/02/2025
Tipologia:
Tesi di dottorato (Doctoral Thesis)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
9.57 MB
Formato
Adobe PDF
|
9.57 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione