Text Summarization is a popular task and an active area of research for the Natural Language Processing community. It requires accounting for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with visual/layout information and propose four novel datasets - consistently built from scholar resources - covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models - two orthogonal approaches - and obtain state-of-the-art results, showing the importance of combining both lines of research.

LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization / Nguyen, L.; Scialom, T.; Piwowarski, B.; Staiano, J.. - (2023), pp. 636-651. (Intervento presentato al convegno 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 tenutosi a [Dubrovnik] nel 2023).

LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization

Staiano J.
Ultimo
2023-01-01

Abstract

Text Summarization is a popular task and an active area of research for the Natural Language Processing community. It requires accounting for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with visual/layout information and propose four novel datasets - consistently built from scholar resources - covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models - two orthogonal approaches - and obtain state-of-the-art results, showing the importance of combining both lines of research.
2023
EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
Dubrovnik
Association for Computational Linguistics (ACL)
9781959429449
Nguyen, L.; Scialom, T.; Piwowarski, B.; Staiano, J.
LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization / Nguyen, L.; Scialom, T.; Piwowarski, B.; Staiano, J.. - (2023), pp. 636-651. (Intervento presentato al convegno 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 tenutosi a [Dubrovnik] nel 2023).
File in questo prodotto:
File Dimensione Formato  
2023.eacl-main.46.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 1.44 MB
Formato Adobe PDF
1.44 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/411851
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact