Automatic evaluation models for open-domain conversational agents either correlate poorly with human judgment or require expensive annotations on top of conversation scores. In this work we investigate the feasibility of learning evaluation models without relying on any further annotations besides conversationlevel human ratings. We use a dataset of rated (1-5) open domain spoken conversations between the conversational agent Roving Mind (competing in the Amazon Alexa Prize Challenge 2017) and Amazon Alexa users. First, we assess the complexity of the task by asking two experts to re-annotate a sample of the dataset and observe that the subjectivity of user ratings yields a low upper-bound. Second, through an analysis of the entire dataset we show that automatically extracted features such as user sentiment, Dialogue Acts and conversation length have significant, but low correlation with user ratings. Finally, we report the results of our experiments exploring different combinations of these features to train automatic dialogue evaluation models. Our work suggests that predicting subjective user ratings in open domain conversations is a challenging task.

Automatically Predicting User Ratings for Conversational Systems / Cervone, Alessandra; Gambi, Enrico; Tortoreto, Giuliano; Stepanov, Evgeny A.; Riccardi, Giuseppe. - ELETTRONICO. - 2253:(2018), pp. 99-104. ((Intervento presentato al convegno CLiC-it tenutosi a Torino nel 10th-12th December 2018.

Automatically Predicting User Ratings for Conversational Systems

Alessandra Cervone;Giuliano Tortoreto;Evgeny A. Stepanov;Giuseppe Riccardi
2018

Abstract

Automatic evaluation models for open-domain conversational agents either correlate poorly with human judgment or require expensive annotations on top of conversation scores. In this work we investigate the feasibility of learning evaluation models without relying on any further annotations besides conversationlevel human ratings. We use a dataset of rated (1-5) open domain spoken conversations between the conversational agent Roving Mind (competing in the Amazon Alexa Prize Challenge 2017) and Amazon Alexa users. First, we assess the complexity of the task by asking two experts to re-annotate a sample of the dataset and observe that the subjectivity of user ratings yields a low upper-bound. Second, through an analysis of the entire dataset we show that automatically extracted features such as user sentiment, Dialogue Acts and conversation length have significant, but low correlation with user ratings. Finally, we report the results of our experiments exploring different combinations of these features to train automatic dialogue evaluation models. Our work suggests that predicting subjective user ratings in open domain conversations is a challenging task.
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)
Torino
CEUR
Cervone, Alessandra; Gambi, Enrico; Tortoreto, Giuliano; Stepanov, Evgeny A.; Riccardi, Giuseppe
Automatically Predicting User Ratings for Conversational Systems / Cervone, Alessandra; Gambi, Enrico; Tortoreto, Giuliano; Stepanov, Evgeny A.; Riccardi, Giuseppe. - ELETTRONICO. - 2253:(2018), pp. 99-104. ((Intervento presentato al convegno CLiC-it tenutosi a Torino nel 10th-12th December 2018.
File in questo prodotto:
File Dimensione Formato  
paper32.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 229.71 kB
Formato Adobe PDF
229.71 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11572/221156
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact