A General Framework for Exploiting Background Knowledge in Natural Language Processing

Tymoshenko, Kateryna

doi:10.15168/11572_368094

The two key aspects of natural language processing (NLP) applications based on machine learning techniques are the learning algorithm and the feature representation of the documents, entities, or words that have to be manipulated. Until now, the majority of the approaches exploited syntactic features, while semantic feature extraction suffered from low coverage of the available knowledge resources and the difficulty to match text and ontology elements. Nowadays, the Semantic Web made available a large amount of logically encoded world knowledge called Linked Open Data (LOD). However, extending state-of-the-art natural language applications to use LOD resources is not a trivial task due to a number of reasons, including natural language ambiguity and heterogeneity and ambiguity of the schemes adopted by different LOD resources. In this thesis we define a general framework for supporting NLP with semantic features extracted from LOD. The main idea behind the framework is to (i) map terms in text to the unique resource identifiers (URIs) of LOD concepts through Wikipedia mediation; (ii) use the URIs to obtain background knowledge from LOD; (iii) integrate the obtained knowledge as semantic features into machine learning algorithms. We evaluate the framework by means of case studies on coreference resolution and relation extraction. Additionally, we propose an approach for increasing accuracy of the mapping step based on the "one sense per discourse" hypothesis. Finally, we present an open-source Java tool for extracting LOD knowledge through SPARQL endpoints and converting it to NLP features.

A General Framework for Exploiting Background Knowledge in Natural Language Processing / Tymoshenko, Kateryna. - (2012), pp. 1-151. [10.15168/11572_368094]