Human Evaluation (HE) of automatically generated responses is necessary for the advancement of human-machine dialogue research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon HE protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent and inconsistent procedures or, worse, limit themselves to automated metrics. We propose to standardize the human evaluation of response generation models by publicly sharing a detailed protocol. The proposal includes the task design, annotators recruitment, task execution, and annotation reporting. Such protocol and process can be used as-is, as-a-whole, in-part, or modified and extended by the research community. We validate the protocol by evaluating two conversationally fine-tuned state-of-the-art models (GPT-2 and T5) for the complex task of personalized response generation. We invite the community to use this protocol - or its future community...

Evaluation of Response Generation Models: Shouldn't It Be Shareable and Replicable? / Mousavi, Seyed Mahed; Roccabruna, Gabriel; Lorandi, Michela; Caldarella, Simone; Riccardi, Giuseppe. - (2022), pp. 136-147. (Intervento presentato al convegno 2nd Workshop on Natural Language Generation, Evaluation, and Metrics, GEM 2022, as part of EMNLP 2022 tenutosi a abu dhabi nel 5 december 2022).

Evaluation of Response Generation Models: Shouldn't It Be Shareable and Replicable?

Seyed Mahed Mousavi
Primo
;
Gabriel Roccabruna;Michela Lorandi;Giuseppe Riccardi
2022-01-01

Abstract

Human Evaluation (HE) of automatically generated responses is necessary for the advancement of human-machine dialogue research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon HE protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent and inconsistent procedures or, worse, limit themselves to automated metrics. We propose to standardize the human evaluation of response generation models by publicly sharing a detailed protocol. The proposal includes the task design, annotators recruitment, task execution, and annotation reporting. Such protocol and process can be used as-is, as-a-whole, in-part, or modified and extended by the research community. We validate the protocol by evaluating two conversationally fine-tuned state-of-the-art models (GPT-2 and T5) for the complex task of personalized response generation. We invite the community to use this protocol - or its future community...
2022
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Association for Computational Linguistics
Association for Computational Linguistics (ACL)
9781959429128
Mousavi, Seyed Mahed; Roccabruna, Gabriel; Lorandi, Michela; Caldarella, Simone; Riccardi, Giuseppe
Evaluation of Response Generation Models: Shouldn't It Be Shareable and Replicable? / Mousavi, Seyed Mahed; Roccabruna, Gabriel; Lorandi, Michela; Caldarella, Simone; Riccardi, Giuseppe. - (2022), pp. 136-147. (Intervento presentato al convegno 2nd Workshop on Natural Language Generation, Evaluation, and Metrics, GEM 2022, as part of EMNLP 2022 tenutosi a abu dhabi nel 5 december 2022).
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/372268
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact