This paper presents the second release of arrau, a multigenre corpus of anaphoric information created over 10 years to provide data for the next generation of coreference/anaphora resolution systems combining different types of linguistic and world knowledge with advanced discourse modeling supporting rich linguistic annotations. The distinguishing features of arrau include the following: treating all NPs as markables, including non-referring NPs, and annotating their (non-) referentiality status; distinguishing between several categories of non-referentiality and annotating non-anaphoric mentions; thorough annotation of markable boundaries (minimal/maximal spans, discontinuous markables); annotating a variety of mention attributes, ranging from morphosyntactic parameters to semantic category; annotating the genericity status of mentions; annotating a wide range of anaphoric relations, including bridging relations and discourse deixis; and, finally, annotating anaphoric ambiguity. The current version of the dataset contains 350K tokens and is publicly available from LDC. In this paper, we discuss in detail all the distinguishing features of the corpus, so far only partially presented in a number of conference and workshop papers, and we also discuss the development between the first release of arrau in 2008 and this second one.

Annotating a broad range of anaphoric phenomena, in a variety of genres: The ARRAU Corpus / Uryupina, O.; Artstein, R.; Bristot, A.; Cavicchio, F.; Delogu, F.; Rodriguez, K. J.; Poesio, M.. - In: NATURAL LANGUAGE ENGINEERING. - ISSN 1351-3249. - ELETTRONICO. - 26:1(2019), pp. 95-128. [10.1017/S1351324919000056]

Annotating a broad range of anaphoric phenomena, in a variety of genres: The ARRAU Corpus

Uryupina O.;Bristot A.;Cavicchio F.;Delogu F.;Rodriguez K. J.;Poesio M.
2019

Abstract

This paper presents the second release of arrau, a multigenre corpus of anaphoric information created over 10 years to provide data for the next generation of coreference/anaphora resolution systems combining different types of linguistic and world knowledge with advanced discourse modeling supporting rich linguistic annotations. The distinguishing features of arrau include the following: treating all NPs as markables, including non-referring NPs, and annotating their (non-) referentiality status; distinguishing between several categories of non-referentiality and annotating non-anaphoric mentions; thorough annotation of markable boundaries (minimal/maximal spans, discontinuous markables); annotating a variety of mention attributes, ranging from morphosyntactic parameters to semantic category; annotating the genericity status of mentions; annotating a wide range of anaphoric relations, including bridging relations and discourse deixis; and, finally, annotating anaphoric ambiguity. The current version of the dataset contains 350K tokens and is publicly available from LDC. In this paper, we discuss in detail all the distinguishing features of the corpus, so far only partially presented in a number of conference and workshop papers, and we also discuss the development between the first release of arrau in 2008 and this second one.
1
Uryupina, O.; Artstein, R.; Bristot, A.; Cavicchio, F.; Delogu, F.; Rodriguez, K. J.; Poesio, M.
Annotating a broad range of anaphoric phenomena, in a variety of genres: The ARRAU Corpus / Uryupina, O.; Artstein, R.; Bristot, A.; Cavicchio, F.; Delogu, F.; Rodriguez, K. J.; Poesio, M.. - In: NATURAL LANGUAGE ENGINEERING. - ISSN 1351-3249. - ELETTRONICO. - 26:1(2019), pp. 95-128. [10.1017/S1351324919000056]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11572/295823
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 9
  • ???jsp.display-item.citation.isi??? 4
social impact