Automatic methods for generating and gathering linguistic data have proven effective for fine-tuning Language Models (LMs) in languages less resourced than English. Still, while there has been emphasis on data quantity, less attention has been given to its quality. In this work, we investigate the impact of human intervention on machine-generated data when fine-tuning dialogical models. In particular, we study (1) whether post-edited dialogues exhibit higher perceived quality compared to the originals that were automatically generated; (2) whether fine-tuning with post-edited dialogues results in noticeable differences in the generated outputs; and (3) whether post-edited dialogues influence the outcomes when considering the parameter size of the LMs. To this end we created HED-IT, a large-scale dataset where machine-generated dialogues are paired with the version post-edited by humans. Using both the edited and unedited portions of HED-IT, we fine-tuned three different sizes of an LM. Results from both human and automatic evaluation show that the different quality of training data is clearly perceived and it has an impact also on the models trained on such data. Additionally, our findings indicate that larger models are less sensitive to data quality, whereas this has a crucial impact on smaller models. These results enhance our comprehension of the impact of human intervention on training data in the development of high-quality LMs.

Fine-tuning with HED-IT: The impact of human post-editing for dialogical language models / Occhipinti, Daniela; Marchi, Michele; Mondella, Irene; Lai, Huiyuan; Dell’Orletta, Felice; Nissim, Malvina; Guerini, Marco. - ELETTRONICO. - (2024), pp. 11892-11907. (Intervento presentato al convegno ACL 2024 tenutosi a Bangkok, Thailand nel August 11–16, 2024) [10.18653/v1/2024.findings-acl.707].

Fine-tuning with HED-IT: The impact of human post-editing for dialogical language models

Daniela Occhipinti;Michele Marchi;Marco Guerini
2024-01-01

Abstract

Automatic methods for generating and gathering linguistic data have proven effective for fine-tuning Language Models (LMs) in languages less resourced than English. Still, while there has been emphasis on data quantity, less attention has been given to its quality. In this work, we investigate the impact of human intervention on machine-generated data when fine-tuning dialogical models. In particular, we study (1) whether post-edited dialogues exhibit higher perceived quality compared to the originals that were automatically generated; (2) whether fine-tuning with post-edited dialogues results in noticeable differences in the generated outputs; and (3) whether post-edited dialogues influence the outcomes when considering the parameter size of the LMs. To this end we created HED-IT, a large-scale dataset where machine-generated dialogues are paired with the version post-edited by humans. Using both the edited and unedited portions of HED-IT, we fine-tuned three different sizes of an LM. Results from both human and automatic evaluation show that the different quality of training data is clearly perceived and it has an impact also on the models trained on such data. Additionally, our findings indicate that larger models are less sensitive to data quality, whereas this has a crucial impact on smaller models. These results enhance our comprehension of the impact of human intervention on training data in the development of high-quality LMs.
2024
Findings of the Association for Computational Linguistics: ACL 2024
Bangkok, Thailand
Association for Computational Linguistics
979-8-89176-099-8
Occhipinti, Daniela; Marchi, Michele; Mondella, Irene; Lai, Huiyuan; Dell’Orletta, Felice; Nissim, Malvina; Guerini, Marco
Fine-tuning with HED-IT: The impact of human post-editing for dialogical language models / Occhipinti, Daniela; Marchi, Michele; Mondella, Irene; Lai, Huiyuan; Dell’Orletta, Felice; Nissim, Malvina; Guerini, Marco. - ELETTRONICO. - (2024), pp. 11892-11907. (Intervento presentato al convegno ACL 2024 tenutosi a Bangkok, Thailand nel August 11–16, 2024) [10.18653/v1/2024.findings-acl.707].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/440071
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact