Towards structured representation of academic search results

Mirylenka, Daniil

doi:10.15168/11572_367624

Searching for scientific publications is a tedious task, especially when exploring an unfamiliar domain. Typical scholarly search engines produce lengthy unstructured result lists, which are difficult to comprehend, interpret and browse.An informative visual summary could convey useful information about the returned results as a whole, without the need to sift through individual publications.The first contribution of this thesis is a novel method of representing academic search results with concise and informative topic maps. The method consists of two steps: i) extracting interrelated topics from the publication titles and abstracts, and ii) summarizing the resulting topic graph. In the first step we map the returned publications to articles and categories of Wikipedia, constructing a graph of relevant topics with hierarchical relations. In the second step we sequentially build a summary of the topic graph that represents the search results in the most informative way. We rely on sequential prediction to automatically learn to build informative summaries from examples. The summarized topic maps share the most of the benefits and avoid most of the drawbacks of the current methods for grouping documents, such as clustering, topic models, and predefined taxonomies. Specifically, the topic maps are dynamic, fine-grained, of flexible granularity, with up-to-date topics connected with informative relations and having meaningful concise labels. The second contribution of this thesis is a method for bootstrapping domain-specific ontologies from the categories of Wikipedia. The method performs three steps: i) selecting the set of categories relevant to the domain, ii) classifying the categories into classes and individuals, and iii) classifying the sub-category relations into ``subclass-of'', ``instance-of'', ``part-of'' and ``related-to''. In each step we rely on binary classification, which makes the method flexible and easily extensible with new features. For the purpose of academic search, the proposed method advances the creation of semantically rich topic maps. In general, the method semi-automates the construction of large-scale domain ontologies, benefiting multiple potential applications. Providing ground truth data for structured prediction of large objects, such as topic map summaries or domain ontologies, is tedious. The last contribution of this thesis is an initial investigation into reducing the labeling effort in structured prediction tasks. First, we present a labeling interface that suggests topics to be added to the ground truth topic map summary. We modify a state of the art sequential prediction method to iteratively learn from the summaries one topic at a time, while retaining the convergence guarantees. Second, we present an interactive learning method for selecting the categories of Wikipedia relevant to a given domain. The method reduces the number of required labels by actively selecting the queries to the annotator and learning one label at a time.

Towards structured representation of academic search results / Mirylenka, Daniil. - (2015), pp. 1-124. [10.15168/11572_367624]

Towards structured representation of academic search results

Mirylenka, Daniil

2015-01-01

Abstract

Searching for scientific publications is a tedious task, especially when exploring an unfamiliar domain. Typical scholarly search engines produce lengthy unstructured result lists, which are difficult to comprehend, interpret and browse.An informative visual summary could convey useful information about the returned results as a whole, without the need to sift through individual publications.The first contribution of this thesis is a novel method of representing academic search results with concise and informative topic maps. The method consists of two steps: i) extracting interrelated topics from the publication titles and abstracts, and ii) summarizing the resulting topic graph. In the first step we map the returned publications to articles and categories of Wikipedia, constructing a graph of relevant topics with hierarchical relations. In the second step we sequentially build a summary of the topic graph that represents the search results in the most informative way. We rely on sequential prediction to automatically learn to build informative summaries from examples. The summarized topic maps share the most of the benefits and avoid most of the drawbacks of the current methods for grouping documents, such as clustering, topic models, and predefined taxonomies. Specifically, the topic maps are dynamic, fine-grained, of flexible granularity, with up-to-date topics connected with informative relations and having meaningful concise labels. The second contribution of this thesis is a method for bootstrapping domain-specific ontologies from the categories of Wikipedia. The method performs three steps: i) selecting the set of categories relevant to the domain, ii) classifying the categories into classes and individuals, and iii) classifying the sub-category relations into ``subclass-of'', ``instance-of'', ``part-of'' and ``related-to''. In each step we rely on binary classification, which makes the method flexible and easily extensible with new features. For the purpose of academic search, the proposed method advances the creation of semantically rich topic maps. In general, the method semi-automates the construction of large-scale domain ontologies, benefiting multiple potential applications. Providing ground truth data for structured prediction of large objects, such as topic map summaries or domain ontologies, is tedious. The last contribution of this thesis is an initial investigation into reducing the labeling effort in structured prediction tasks. First, we present a labeling interface that suggests topics to be added to the ground truth topic map summary. We modify a state of the art sequential prediction method to iteratively learn from the summaries one topic at a time, while retaining the convergence guarantees. Second, we present an interactive learning method for selecting the categories of Wikipedia relevant to a given domain. The method reduces the number of required labels by actively selecting the queries to the annotator and learning one label at a time.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di esame finale/Defended on
	
				2015
			
	Ciclo
	
				XXVI
			
	Anno Accademico
	
				2013-2014
			
	Dipartimento
	
				Ingegneria e scienza dell'Informaz (29/10/12-)
			
	Corso di dottorato
	
				Informatica e telecomunicazioni (fino a.a. 2020-21, 36° ciclo)
			
	Supervisore/Relatore di tesi Unitn (Unitn internal supervisor)
	
				Passerini, Andrea
			
	Tesi in cotutela (Bi-nationally supervised Doctoral Thesis)
	
				no
			
	Codice DOI
	
				https://dx.doi.org/10.15168/11572_367624
			
	Lingua (Language)
	
				Inglese
			
	Settori scientifico-disciplinari (validi fino a 24/06/2024) - Reference SSD (valid until 24/06/2024)
	
				Settore INF/01 - Informatica
			
	Appare nelle tipologie:
	
				08.1 Tesi di dottorato (Doctoral Thesis)

File in questo prodotto:

File	Dimensione	Formato
mirylenka.pdf accesso aperto Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 3.79 MB Formato Adobe PDF Visualizza/Apri	3.79 MB	Adobe PDF	Visualizza/Apri