We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages - namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

MLSUM: The multilingual summarization corpus / Scialom, Thomas; Dray, Paul-Alexis; Lamprier, Sylvain; Piwowarski, Benjamin; Staiano, Jacopo. - (2020), pp. 8051-8067. (Intervento presentato al convegno EMNLP 2020 tenutosi a Virtual conference nel 16th–20th November 2020).

MLSUM: The multilingual summarization corpus

Staiano, Jacopo
Ultimo
2020-01-01

Abstract

We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages - namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
2020
2020 Conference on Empirical Methods in Natural Language Processing: Proceedings of the Conference
Stroudsburg, PA
Association for Computational Linguistics (ACL)
978-1-952148-60-6
Scialom, Thomas; Dray, Paul-Alexis; Lamprier, Sylvain; Piwowarski, Benjamin; Staiano, Jacopo
MLSUM: The multilingual summarization corpus / Scialom, Thomas; Dray, Paul-Alexis; Lamprier, Sylvain; Piwowarski, Benjamin; Staiano, Jacopo. - (2020), pp. 8051-8067. (Intervento presentato al convegno EMNLP 2020 tenutosi a Virtual conference nel 16th–20th November 2020).
File in questo prodotto:
File Dimensione Formato  
2020.emnlp-main.647.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 480.35 kB
Formato Adobe PDF
480.35 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/363004
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 97
  • ???jsp.display-item.citation.isi??? 51
  • OpenAlex ND
social impact