Asking Strategic and Informative Questions in Visual Dialogue Games: Strengths and Weaknesses of Neural Generative Models

Testoni, Alberto

doi:10.15168/11572_370672

Gathering information by asking questions about the surrounding world is a hallmark of human intelligence. Modelling this feature in Natural Language Generation systems represents a central challenge for effective and reliable conversational agents. The evaluation of these systems plays a crucial role in understanding the strengths and weaknesses of current neural architectures. In the scientific community, there is an open debate about what makes generated dialogues sound natural and human-like, and there is no agreement on what measures to use to track progress. In the first part of the thesis, after reviewing existing metrics, we aggregate different and complementary metrics that capture surface-level linguistic features into one single score. We take different referential tasks (both multimodal and language-only) as test-bed and wonder how the single metric we propose relates to task success across the training epochs of computational models (Chapter 3). Based on our findings, on the one hand, we present a method that intervenes on the training data to improve surface-level metrics (Chapter 4), especially repetitions in the generated dialogues. On the other hand, given the limitations of surface-level metrics to capture relevant phenomena that improve referential task success, we propose a different approach for the evaluation of computational models on a deeper level to capture the interplay between the Encoder and Decoder components. In the second part of the thesis, we take the case of entity hallucinations in multimodal dialogue systems as a case study to investigate the relationship between Natural Language Generation and Natural Language Understanding on a more fine-grained level (Chapter 5). Our results reveal that these two components are profoundly interconnected and influence one another. We find that hallucinations create a detrimental cascade effect on consecutive dialogue turns and are more likely to appear after negative answers, corroborating evidence from previous work on the deficiency of current architectures to handle negation properly. Our progressive advance towards even deeper dialogue evaluation criteria leads us to the study of the informativeness of questions asked to solve referential tasks. Current decoding strategies generate text in a word-by-word fashion, according to the probabilities of underlying language models. We advocate for the need of going beyond this paradigm and injecting high-level reasoning skills at decoding time. Inspired by cognitive studies on the question-asking strategies of children and adults, we propose a beam search re-ranking technique that implements a confirmation-driven strategy across dialogue turns, and we compare it against a wide variety of different decoding strategies and hyperparameters configurations (Chapter 6 and Chapter 7). We demonstrate that our approach effectively improves task success and dialogue quality when considering both the surface-level metrics described in the first part of the thesis and more fine-grained features such as hallucinations. To make our findings more solid and rule out the possibility that our improvements are due to biases in the model, we propose an evaluation paradigm in which human annotators receive machine-generated dialogues and have to solve the referential task. In general, we find that this paradigm confirms the results obtained with computational models and it demonstrates that machine-generated dialogues are indeed informative to solve the task. In the last part of the thesis (Chapter 8), we broaden the horizons on what is still missing from achieving human-like dialogue systems. We present a large-scale study on the GuessWhat dataset of human-human conversations, usually exploited only to train computational models. Instead, we present a thorough evaluation of the question-asking strategies of human players in this problem-solving task to unveil the pragmatic phenomena that characterize their conversations. Our analyses reveal that humans are far from asking optimal questions. Instead, their efficiency arises from learning to ask uninformative questions at the right moment during the dialogue, i.e., to establish a common ground with the interlocutor at the start of the dialogue exchanges and ask for confirmation of their own hypotheses before deciding to end the dialogue and select the target. We believe modelling such peculiar and effective features of human conversations in dialogue systems is an essential step toward building competent systems that meet the users’ expectations and display human-like traits.

Asking Strategic and Informative Questions in Visual Dialogue Games: Strengths and Weaknesses of Neural Generative Models / Testoni, Alberto. - (2023 Feb 24), pp. 1-160. [10.15168/11572_370672]

Asking Strategic and Informative Questions in Visual Dialogue Games: Strengths and Weaknesses of Neural Generative Models

Testoni, Alberto

2023-02-24

Abstract

Gathering information by asking questions about the surrounding world is a hallmark of human intelligence. Modelling this feature in Natural Language Generation systems represents a central challenge for effective and reliable conversational agents. The evaluation of these systems plays a crucial role in understanding the strengths and weaknesses of current neural architectures. In the scientific community, there is an open debate about what makes generated dialogues sound natural and human-like, and there is no agreement on what measures to use to track progress. In the first part of the thesis, after reviewing existing metrics, we aggregate different and complementary metrics that capture surface-level linguistic features into one single score. We take different referential tasks (both multimodal and language-only) as test-bed and wonder how the single metric we propose relates to task success across the training epochs of computational models (Chapter 3). Based on our findings, on the one hand, we present a method that intervenes on the training data to improve surface-level metrics (Chapter 4), especially repetitions in the generated dialogues. On the other hand, given the limitations of surface-level metrics to capture relevant phenomena that improve referential task success, we propose a different approach for the evaluation of computational models on a deeper level to capture the interplay between the Encoder and Decoder components. In the second part of the thesis, we take the case of entity hallucinations in multimodal dialogue systems as a case study to investigate the relationship between Natural Language Generation and Natural Language Understanding on a more fine-grained level (Chapter 5). Our results reveal that these two components are profoundly interconnected and influence one another. We find that hallucinations create a detrimental cascade effect on consecutive dialogue turns and are more likely to appear after negative answers, corroborating evidence from previous work on the deficiency of current architectures to handle negation properly. Our progressive advance towards even deeper dialogue evaluation criteria leads us to the study of the informativeness of questions asked to solve referential tasks. Current decoding strategies generate text in a word-by-word fashion, according to the probabilities of underlying language models. We advocate for the need of going beyond this paradigm and injecting high-level reasoning skills at decoding time. Inspired by cognitive studies on the question-asking strategies of children and adults, we propose a beam search re-ranking technique that implements a confirmation-driven strategy across dialogue turns, and we compare it against a wide variety of different decoding strategies and hyperparameters configurations (Chapter 6 and Chapter 7). We demonstrate that our approach effectively improves task success and dialogue quality when considering both the surface-level metrics described in the first part of the thesis and more fine-grained features such as hallucinations. To make our findings more solid and rule out the possibility that our improvements are due to biases in the model, we propose an evaluation paradigm in which human annotators receive machine-generated dialogues and have to solve the referential task. In general, we find that this paradigm confirms the results obtained with computational models and it demonstrates that machine-generated dialogues are indeed informative to solve the task. In the last part of the thesis (Chapter 8), we broaden the horizons on what is still missing from achieving human-like dialogue systems. We present a large-scale study on the GuessWhat dataset of human-human conversations, usually exploited only to train computational models. Instead, we present a thorough evaluation of the question-asking strategies of human players in this problem-solving task to unveil the pragmatic phenomena that characterize their conversations. Our analyses reveal that humans are far from asking optimal questions. Instead, their efficiency arises from learning to ask uninformative questions at the right moment during the dialogue, i.e., to establish a common ground with the interlocutor at the start of the dialogue exchanges and ask for confirmation of their own hypotheses before deciding to end the dialogue and select the target. We believe modelling such peculiar and effective features of human conversations in dialogue systems is an essential step toward building competent systems that meet the users’ expectations and display human-like traits.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di esame finale/Defended on
	
				24-feb-2023
			
	Ciclo
	
				XXXV
			
	Anno Accademico
	
				2021-2022
			
	Dipartimento
	
				Ingegneria e scienza dell'Informaz (29/10/12-)
			
	Corso di dottorato
	
				Information and Communication Technology
			
	Supervisore/Relatore di tesi Unitn (Unitn internal supervisor)
	
				Bernardi, Raffaella
			
	Tesi in cotutela (Bi-nationally supervised Doctoral Thesis)
	
				no
			
	Codice DOI
	
				https://dx.doi.org/10.15168/11572_370672
			
	Lingua (Language)
	
				Inglese
			
	Appare nelle tipologie:
	
				08.1 Tesi di dottorato (Doctoral Thesis)

File in questo prodotto:

File	Dimensione	Formato
thesis.pdf embargo fino al 23/02/2025 Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 9.57 MB Formato Adobe PDF Visualizza/Apri	9.57 MB	Adobe PDF	Visualizza/Apri