We set out to uncover the unique grammatical properties of an important yet so far under-researched type of natural language text: that of short labels typically found within structured datasets. We show that such labels obey a specific type of abbreviated grammar that we call the Language of Data, with properties significantly different from the kinds of text typically addressed in computational linguistics and NLP, such as ‘standard’ written language or social media messages. We analyse orthography, parts of speech, and syntax over a large, bilingual, hand-annotated corpus of data labels collected from a variety of domains. We perform experiments on tokenisation, part-of-speech tagging, and named entity recognition over real-world structured data, demonstrating that models adapted to the Language of Data outperform those trained on standard text. These observations point in a new direction to be explored as future research, in order to develop new NLP tools and models dedicated to the Language of Data.

Exploring the Language of Data / Bella, Gábor; Gremes, Linda; Giunchiglia, Fausto. - (2020), pp. 6638-6648. ((Intervento presentato al convegno COLING 2020 tenutosi a Barcelona (online) nel 8th-13th December 2020 [10.18653/v1/2020.coling-main.582].

Exploring the Language of Data

Bella, Gábor;Giunchiglia, Fausto
2020

Abstract

We set out to uncover the unique grammatical properties of an important yet so far under-researched type of natural language text: that of short labels typically found within structured datasets. We show that such labels obey a specific type of abbreviated grammar that we call the Language of Data, with properties significantly different from the kinds of text typically addressed in computational linguistics and NLP, such as ‘standard’ written language or social media messages. We analyse orthography, parts of speech, and syntax over a large, bilingual, hand-annotated corpus of data labels collected from a variety of domains. We perform experiments on tokenisation, part-of-speech tagging, and named entity recognition over real-world structured data, demonstrating that models adapted to the Language of Data outperform those trained on standard text. These observations point in a new direction to be explored as future research, in order to develop new NLP tools and models dedicated to the Language of Data.
The 28th International Conference on Computational Linguistics: Proceedings of the Conference
s.l.
International Committee on Computational Linguistics
978-1-952148-27-9
Bella, Gábor; Gremes, Linda; Giunchiglia, Fausto
Exploring the Language of Data / Bella, Gábor; Gremes, Linda; Giunchiglia, Fausto. - (2020), pp. 6638-6648. ((Intervento presentato al convegno COLING 2020 tenutosi a Barcelona (online) nel 8th-13th December 2020 [10.18653/v1/2020.coling-main.582].
File in questo prodotto:
File Dimensione Formato  
2020.coling-main.582.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 417.27 kB
Formato Adobe PDF
417.27 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11572/313134
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact