Human Evaluation (HE) of automatically generated responses is necessary for the advancement of human-machine dialogue research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon HE protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent and inconsistent procedures or, worse, limit themselves to automated metrics. We propose to standardize the human evaluation of response generation models by publicly sharing a detailed protocol. The proposal includes the task design, annotators recruitment, task execution, and annotation reporting. Such protocol and process can be used as-is, as-a-whole, in-part, or modified and extended by the research community. We validate the protocol by evaluating two conversationally fine-tuned state-of-the-art models (GPT-2 and T5) for the complex task of personalized response generation. We invite the community to use this protocol - or its future community...
Evaluation of Response Generation Models: Shouldn't It Be Shareable and Replicable? / Mousavi, Seyed Mahed; Roccabruna, Gabriel; Lorandi, Michela; Caldarella, Simone; Riccardi, Giuseppe. - (2022), pp. 136-147. (Intervento presentato al convegno 2nd Workshop on Natural Language Generation, Evaluation, and Metrics, GEM 2022, as part of EMNLP 2022 tenutosi a abu dhabi nel 5 december 2022).
Evaluation of Response Generation Models: Shouldn't It Be Shareable and Replicable?
Seyed Mahed Mousavi
Primo
;Gabriel Roccabruna;Michela Lorandi;Giuseppe Riccardi
2022-01-01
Abstract
Human Evaluation (HE) of automatically generated responses is necessary for the advancement of human-machine dialogue research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon HE protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent and inconsistent procedures or, worse, limit themselves to automated metrics. We propose to standardize the human evaluation of response generation models by publicly sharing a detailed protocol. The proposal includes the task design, annotators recruitment, task execution, and annotation reporting. Such protocol and process can be used as-is, as-a-whole, in-part, or modified and extended by the research community. We validate the protocol by evaluating two conversationally fine-tuned state-of-the-art models (GPT-2 and T5) for the complex task of personalized response generation. We invite the community to use this protocol - or its future community...I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



