Grounded conversational agents are a fascinating research line on which important progress has been made lately thanks to the development of neural network models and to the release of visual dialogue datasets. The latter have been used to set visual dialogue games which are an interesting test bed to evaluate conversational agents. Researchers’ attention is on building models of increasing complexity, trained with computationally costly machine learning paradigms that lead to higher task success scores. In this paper, we take a step back: We use a rather simple neural network architecture and we scrutinize the GuessWhich task, the dataset, and the quality of the generated dialogues. We show that our simple Questioner agent reaches state-of-the art performance, that the evaluation metric commonly used is too coarse to compare different models, and that high task success does not correspond to high quality of the dialogues. Our work shows the importance of running detailed analyses of the results to spot possible models’ weaknesses rather than aiming to outperform state-of-the-art scores
The Devil is in the Detail: A Magnifying Glass for the GuessWhich Visual Dialogue Game / Testoni, Alberto; Shekhar, Ravi; Fernández, Raquel; Bernardi, Raffaella. - ELETTRONICO. - (2019), pp. 15-24. (Intervento presentato al convegno SemDial 2019 tenutosi a London nel 4th-6th September 2019).
The Devil is in the Detail: A Magnifying Glass for the GuessWhich Visual Dialogue Game
Testoni, Alberto;Shekhar Ravi;Bernardi Raffaella
2019-01-01
Abstract
Grounded conversational agents are a fascinating research line on which important progress has been made lately thanks to the development of neural network models and to the release of visual dialogue datasets. The latter have been used to set visual dialogue games which are an interesting test bed to evaluate conversational agents. Researchers’ attention is on building models of increasing complexity, trained with computationally costly machine learning paradigms that lead to higher task success scores. In this paper, we take a step back: We use a rather simple neural network architecture and we scrutinize the GuessWhich task, the dataset, and the quality of the generated dialogues. We show that our simple Questioner agent reaches state-of-the art performance, that the evaluation metric commonly used is too coarse to compare different models, and that high task success does not correspond to high quality of the dialogues. Our work shows the importance of running detailed analyses of the results to spot possible models’ weaknesses rather than aiming to outperform state-of-the-art scoresFile | Dimensione | Formato | |
---|---|---|---|
Testoni_semdial_0005.pdf
accesso aperto
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
965.44 kB
Formato
Adobe PDF
|
965.44 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione