Automatic evaluation models for open-domain conversational agents either correlate poorly with human judgment or require expensive annotations on top of conversation scores. In this work we investigate the feasibility of learning evaluation models without relying on any further annotations besides conversationlevel human ratings. We use a dataset of rated (1-5) open domain spoken conversations between the conversational agent Roving Mind (competing in the Amazon Alexa Prize Challenge 2017) and Amazon Alexa users. First, we assess the complexity of the task by asking two experts to re-annotate a sample of the dataset and observe that the subjectivity of user ratings yields a low upper-bound. Second, through an analysis of the entire dataset we show that automatically extracted features such as user sentiment, Dialogue Acts and conversation length have significant, but low correlation with user ratings. Finally, we report the results of our experiments exploring different combinations of these features to train automatic dialogue evaluation models. Our work suggests that predicting subjective user ratings in open domain conversations is a challenging task.
Automatically Predicting User Ratings for Conversational Systems / Cervone, Alessandra; Gambi, Enrico; Tortoreto, Giuliano; Stepanov, Evgeny A.; Riccardi, Giuseppe. - ELETTRONICO. - 2253:(2018), pp. 99-104. (Intervento presentato al convegno CLiC-it tenutosi a Torino nel 10th-12th December 2018).
Automatically Predicting User Ratings for Conversational Systems
Alessandra Cervone;Giuliano Tortoreto;Evgeny A. Stepanov;Giuseppe Riccardi
2018-01-01
Abstract
Automatic evaluation models for open-domain conversational agents either correlate poorly with human judgment or require expensive annotations on top of conversation scores. In this work we investigate the feasibility of learning evaluation models without relying on any further annotations besides conversationlevel human ratings. We use a dataset of rated (1-5) open domain spoken conversations between the conversational agent Roving Mind (competing in the Amazon Alexa Prize Challenge 2017) and Amazon Alexa users. First, we assess the complexity of the task by asking two experts to re-annotate a sample of the dataset and observe that the subjectivity of user ratings yields a low upper-bound. Second, through an analysis of the entire dataset we show that automatically extracted features such as user sentiment, Dialogue Acts and conversation length have significant, but low correlation with user ratings. Finally, we report the results of our experiments exploring different combinations of these features to train automatic dialogue evaluation models. Our work suggests that predicting subjective user ratings in open domain conversations is a challenging task.File | Dimensione | Formato | |
---|---|---|---|
paper32.pdf
accesso aperto
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Creative commons
Dimensione
229.71 kB
Formato
Adobe PDF
|
229.71 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione