Searching for scientific publications is a tedious task, especially when exploring an unfamiliar domain. Typical scholarly search engines produce lengthy unstructured result lists, which are difficult to comprehend, interpret and browse.An informative visual summary could convey useful information about the returned results as a whole, without the need to sift through individual publications.The first contribution of this thesis is a novel method of representing academic search results with concise and informative topic maps. The method consists of two steps: i) extracting interrelated topics from the publication titles and abstracts, and ii) summarizing the resulting topic graph. In the first step we map the returned publications to articles and categories of Wikipedia, constructing a graph of relevant topics with hierarchical relations. In the second step we sequentially build a summary of the topic graph that represents the search results in the most informative way. We rely on sequential prediction to automatically learn to build informative summaries from examples. The summarized topic maps share the most of the benefits and avoid most of the drawbacks of the current methods for grouping documents, such as clustering, topic models, and predefined taxonomies. Specifically, the topic maps are dynamic, fine-grained, of flexible granularity, with up-to-date topics connected with informative relations and having meaningful concise labels. The second contribution of this thesis is a method for bootstrapping domain-specific ontologies from the categories of Wikipedia. The method performs three steps: i) selecting the set of categories relevant to the domain, ii) classifying the categories into classes and individuals, and iii) classifying the sub-category relations into ``subclass-of'', ``instance-of'', ``part-of'' and ``related-to''. In each step we rely on binary classification, which makes the method flexible and easily extensible with new features. For the purpose of academic search, the proposed method advances the creation of semantically rich topic maps. In general, the method semi-automates the construction of large-scale domain ontologies, benefiting multiple potential applications. Providing ground truth data for structured prediction of large objects, such as topic map summaries or domain ontologies, is tedious. The last contribution of this thesis is an initial investigation into reducing the labeling effort in structured prediction tasks. First, we present a labeling interface that suggests topics to be added to the ground truth topic map summary. We modify a state of the art sequential prediction method to iteratively learn from the summaries one topic at a time, while retaining the convergence guarantees. Second, we present an interactive learning method for selecting the categories of Wikipedia relevant to a given domain. The method reduces the number of required labels by actively selecting the queries to the annotator and learning one label at a time.
Towards structured representation of academic search results / Mirylenka, Daniil. - (2015), pp. 1-124.
Towards structured representation of academic search results
Mirylenka, Daniil
2015-01-01
Abstract
Searching for scientific publications is a tedious task, especially when exploring an unfamiliar domain. Typical scholarly search engines produce lengthy unstructured result lists, which are difficult to comprehend, interpret and browse.An informative visual summary could convey useful information about the returned results as a whole, without the need to sift through individual publications.The first contribution of this thesis is a novel method of representing academic search results with concise and informative topic maps. The method consists of two steps: i) extracting interrelated topics from the publication titles and abstracts, and ii) summarizing the resulting topic graph. In the first step we map the returned publications to articles and categories of Wikipedia, constructing a graph of relevant topics with hierarchical relations. In the second step we sequentially build a summary of the topic graph that represents the search results in the most informative way. We rely on sequential prediction to automatically learn to build informative summaries from examples. The summarized topic maps share the most of the benefits and avoid most of the drawbacks of the current methods for grouping documents, such as clustering, topic models, and predefined taxonomies. Specifically, the topic maps are dynamic, fine-grained, of flexible granularity, with up-to-date topics connected with informative relations and having meaningful concise labels. The second contribution of this thesis is a method for bootstrapping domain-specific ontologies from the categories of Wikipedia. The method performs three steps: i) selecting the set of categories relevant to the domain, ii) classifying the categories into classes and individuals, and iii) classifying the sub-category relations into ``subclass-of'', ``instance-of'', ``part-of'' and ``related-to''. In each step we rely on binary classification, which makes the method flexible and easily extensible with new features. For the purpose of academic search, the proposed method advances the creation of semantically rich topic maps. In general, the method semi-automates the construction of large-scale domain ontologies, benefiting multiple potential applications. Providing ground truth data for structured prediction of large objects, such as topic map summaries or domain ontologies, is tedious. The last contribution of this thesis is an initial investigation into reducing the labeling effort in structured prediction tasks. First, we present a labeling interface that suggests topics to be added to the ground truth topic map summary. We modify a state of the art sequential prediction method to iteratively learn from the summaries one topic at a time, while retaining the convergence guarantees. Second, we present an interactive learning method for selecting the categories of Wikipedia relevant to a given domain. The method reduces the number of required labels by actively selecting the queries to the annotator and learning one label at a time.File | Dimensione | Formato | |
---|---|---|---|
mirylenka.pdf
accesso aperto
Tipologia:
Tesi di dottorato (Doctoral Thesis)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
3.79 MB
Formato
Adobe PDF
|
3.79 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione