MLSUM: The multilingual summarization corpus

IRIS

We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages - namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

MLSUM: The multilingual summarization corpus / Scialom, Thomas; Dray, Paul-Alexis; Lamprier, Sylvain; Piwowarski, Benjamin; Staiano, Jacopo. - (2020), pp. 8051-8067. (Intervento presentato al convegno EMNLP 2020 tenutosi a Virtual conference nel 16th–20th November 2020).

MLSUM: The multilingual summarization corpus

Scialom, Thomas^Primo;Dray, Paul-Alexis^Secondo;Lamprier, Sylvain;Piwowarski, Benjamin^Penultimo;Staiano, Jacopo^Ultimo

2020-01-01

Abstract

We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages - namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2020
			
	Titolo del volume (Proceedings title)
	
				2020 Conference on Empirical Methods in Natural Language Processing: Proceedings of the Conference
			
	Luogo di edizione (Place of publication)
	
				Stroudsburg, PA
			
	Casa editrice (Publisher)
	
				Association for Computational Linguistics (ACL)
			
	ISBN
	
				978-1-952148-60-6
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85106975936
			
	Codice WOS (WOS identifier)
	
				WOS:000855160708022
			
	Tutti gli autori
	
						Scialom, Thomas; Dray, Paul-Alexis; Lamprier, Sylvain; Piwowarski, Benjamin; Staiano, Jacopo
					
	Citazione
	
				MLSUM: The multilingual summarization corpus / Scialom, Thomas; Dray, Paul-Alexis; Lamprier, Sylvain; Piwowarski, Benjamin; Staiano, Jacopo. - (2020), pp. 8051-8067. (Intervento presentato al  convegno EMNLP 2020 tenutosi a Virtual conference nel 16th–20th November 2020).
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
2020.emnlp-main.647.pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 480.35 kB Formato Adobe PDF Visualizza/Apri	480.35 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/363004

Citazioni

ND

106

64

ND

social impact