Text Summarization is a popular task and an active area of research for the Natural Language Processing community. It requires accounting for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with visual/layout information and propose four novel datasets - consistently built from scholar resources - covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models - two orthogonal approaches - and obtain state-of-the-art results, showing the importance of combining both lines of research.
LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization / Nguyen, L.; Scialom, T.; Piwowarski, B.; Staiano, J.. - (2023), pp. 636-651. (Intervento presentato al convegno 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 tenutosi a [Dubrovnik] nel 2023) [10.18653/v1/2023.eacl-main.46].
LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization
Staiano J.Ultimo
2023-01-01
Abstract
Text Summarization is a popular task and an active area of research for the Natural Language Processing community. It requires accounting for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with visual/layout information and propose four novel datasets - consistently built from scholar resources - covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models - two orthogonal approaches - and obtain state-of-the-art results, showing the importance of combining both lines of research.File | Dimensione | Formato | |
---|---|---|---|
2023.eacl-main.46.pdf
accesso aperto
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Creative commons
Dimensione
1.44 MB
Formato
Adobe PDF
|
1.44 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione