Natural language text is pervasive in structured data sets—relational database tables, spreadsheets, XML documents, RDF graphs, etc.—requiring data processing operations to possess some level of natural language understanding capability. This, in turn, involves dealing with aspects of diversity present in structured data such as multilingualism or the coexistence of data from multi- ple domains. Word sense disambiguation is an essential component of natural language understanding processes. State-of-the-art WSD techniques, however, were developed to operate on single languages and on corpora that are considerably different from structured data sets, such as articles, newswire, web pages, forum posts, or tweets. In this paper we present a WSD method that is designed for short text typically present in structured data, applicable to multiple lan- guages and domains. Our proof-of-concept implementation reaches an all-words F-score between 60% and 80% on both English and Ital- ian data. We consider these as very promising first results given the known difficulty of WSD and the particularity of the corpora targeted with respect to more conventional text.
Domain-Based Sense Disambiguation in Multilingual Structured Data / Bella, Gabor; Zamboni, Alessio; Giunchiglia, Fausto. - (2016).
Domain-Based Sense Disambiguation in Multilingual Structured Data
Bella, Gabor;Zamboni, Alessio;Giunchiglia, Fausto
2016-01-01
Abstract
Natural language text is pervasive in structured data sets—relational database tables, spreadsheets, XML documents, RDF graphs, etc.—requiring data processing operations to possess some level of natural language understanding capability. This, in turn, involves dealing with aspects of diversity present in structured data such as multilingualism or the coexistence of data from multi- ple domains. Word sense disambiguation is an essential component of natural language understanding processes. State-of-the-art WSD techniques, however, were developed to operate on single languages and on corpora that are considerably different from structured data sets, such as articles, newswire, web pages, forum posts, or tweets. In this paper we present a WSD method that is designed for short text typically present in structured data, applicable to multiple lan- guages and domains. Our proof-of-concept implementation reaches an all-words F-score between 60% and 80% on both English and Ital- ian data. We consider these as very promising first results given the known difficulty of WSD and the particularity of the corpora targeted with respect to more conventional text.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione