Information Extraction (IE) aims at mapping texts into fixed structure representing the key information. A typical IE system will try to answer the questions like who are present in the text, what events happen and when these events happen. The task is making possible significant advances in applications that require deep understanding capabilities such as question-answering engines, dialogue systems, or the semantic web. Due to the huge effort and time consumation of developping extraction systems by domain experts, our approach focuses on machine learning methods that can accurately infer an extraction model by training on a dataset. The goal of this research is to design and implement models with improved performance by learning the combination of different algorithms or by inventing novel structures that are able to exploit kinds of evidence that have not been explored in the literature. A basic component of an IE system is named entity recognition (NER) whose purpose is to locate objects that can be referred by names, belonging to a predefined set of categories. We approach this task by proposing a novel reranking framework that employs two learning phases to pick the best candidate. The task is considered as sequence labelling with Conditional Random Fields (CRFs) is selected as the baseline algorithm. Our research employs novel kernels based on structured and unstructured features for reranking the N-best hypotheses from the CRFs baseline. The former features are generated by a polynomial kernel encoding entity features whereas tree kernels are used to model dependencies amongst tagged candidate examples. Relation Extraction (RE) is concerned with finding relationships between pairs of entities in texts. State-of-the-art relation extraction model is based on convolution kernel over the constituent parse tree. In our research, we employ dependency parses from dependency parsing in addition to phrase-structure parses from constituent parsing. We define several variations of dependency parses to inject additional information into the trees. Additionally, we provide an extensive ablation over various types of kernels by combining the tree, sequence, and polynomial kernels. These novel kernels are able to exploit learned correlations between phrase-structure parses and grammatical relations. A large amounts of wide-coverage semantic knowledge today exists in large repositories of unstructured or semi-structured text documents. The increased availability of online collaborative resources has attracted the attention of much work in the Artificial Intelligence (AI) community. Nevertheless, the ability to extract it using statistical machine learning techniques is hindered by well-known problems such as heavy supervision and scalability. These drawbacks can be alleviated by applying a form of weakly supervision, specifically named distant supervision (DS), to automatically derive explicit facts from the semi-structured part of Wikipedia. To learn relational facts from Wikipedia without any labeled example or hand-crafted pattern, we employ DS where the relation providers are external repositories, e.g., YAGO (a huge semantic knowledge base), and the training instances are gathered from Freebase (a huge semantic database). These allow for potentially obtaining larger training data and many more relations, defined in different sources. We apply state-of-the-art models for ACE RE, that are sentence level RE (SRLE), to Wikipedia. Based on a mapping table of relations from YAGO to ACE (according to their semantic definitions), we design a joint RE model of DS/ACE and tested it on ACE annotations (thus according to expert linguistic annotators). Moreover, we experiment with end-to-end systems for real-world RE applications. Consequently, our RE system is applicable to any document/sentence, i.e. another major improvement on previous work, which, to our knowledge, does not show experiments on end-to-end SLRE.

End-to-End Relation Extraction via Syntactic Structures and Semantic Resources / Nguyen, Truc-Vien T.. - (2011), pp. 1-116.

End-to-End Relation Extraction via Syntactic Structures and Semantic Resources

Nguyen, Truc-Vien T.
2011-01-01

Abstract

Information Extraction (IE) aims at mapping texts into fixed structure representing the key information. A typical IE system will try to answer the questions like who are present in the text, what events happen and when these events happen. The task is making possible significant advances in applications that require deep understanding capabilities such as question-answering engines, dialogue systems, or the semantic web. Due to the huge effort and time consumation of developping extraction systems by domain experts, our approach focuses on machine learning methods that can accurately infer an extraction model by training on a dataset. The goal of this research is to design and implement models with improved performance by learning the combination of different algorithms or by inventing novel structures that are able to exploit kinds of evidence that have not been explored in the literature. A basic component of an IE system is named entity recognition (NER) whose purpose is to locate objects that can be referred by names, belonging to a predefined set of categories. We approach this task by proposing a novel reranking framework that employs two learning phases to pick the best candidate. The task is considered as sequence labelling with Conditional Random Fields (CRFs) is selected as the baseline algorithm. Our research employs novel kernels based on structured and unstructured features for reranking the N-best hypotheses from the CRFs baseline. The former features are generated by a polynomial kernel encoding entity features whereas tree kernels are used to model dependencies amongst tagged candidate examples. Relation Extraction (RE) is concerned with finding relationships between pairs of entities in texts. State-of-the-art relation extraction model is based on convolution kernel over the constituent parse tree. In our research, we employ dependency parses from dependency parsing in addition to phrase-structure parses from constituent parsing. We define several variations of dependency parses to inject additional information into the trees. Additionally, we provide an extensive ablation over various types of kernels by combining the tree, sequence, and polynomial kernels. These novel kernels are able to exploit learned correlations between phrase-structure parses and grammatical relations. A large amounts of wide-coverage semantic knowledge today exists in large repositories of unstructured or semi-structured text documents. The increased availability of online collaborative resources has attracted the attention of much work in the Artificial Intelligence (AI) community. Nevertheless, the ability to extract it using statistical machine learning techniques is hindered by well-known problems such as heavy supervision and scalability. These drawbacks can be alleviated by applying a form of weakly supervision, specifically named distant supervision (DS), to automatically derive explicit facts from the semi-structured part of Wikipedia. To learn relational facts from Wikipedia without any labeled example or hand-crafted pattern, we employ DS where the relation providers are external repositories, e.g., YAGO (a huge semantic knowledge base), and the training instances are gathered from Freebase (a huge semantic database). These allow for potentially obtaining larger training data and many more relations, defined in different sources. We apply state-of-the-art models for ACE RE, that are sentence level RE (SRLE), to Wikipedia. Based on a mapping table of relations from YAGO to ACE (according to their semantic definitions), we design a joint RE model of DS/ACE and tested it on ACE annotations (thus according to expert linguistic annotators). Moreover, we experiment with end-to-end systems for real-world RE applications. Consequently, our RE system is applicable to any document/sentence, i.e. another major improvement on previous work, which, to our knowledge, does not show experiments on end-to-end SLRE.
2011
XXIII
2010-2011
Ingegneria e Scienza dell'Informaz (cess.4/11/12)
Information and Communication Technology
Moschitti, Alessandro
no
Inglese
Settore INF/01 - Informatica
File in questo prodotto:
File Dimensione Formato  
TVNguyen_Dissertation_old.pdf

accesso aperto

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 2.18 MB
Formato Adobe PDF
2.18 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/369167
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact