We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages - namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
MLSUM: The multilingual summarization corpus / Scialom, Thomas; Dray, Paul-Alexis; Lamprier, Sylvain; Piwowarski, Benjamin; Staiano, Jacopo. - (2020), pp. 8051-8067. (Intervento presentato al convegno EMNLP 2020 tenutosi a Virtual conference nel 16th–20th November 2020).
MLSUM: The multilingual summarization corpus
Staiano, JacopoUltimo
2020-01-01
Abstract
We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages - namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.File | Dimensione | Formato | |
---|---|---|---|
2020.emnlp-main.647.pdf
Solo gestori archivio
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
480.35 kB
Formato
Adobe PDF
|
480.35 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione