Providing Insight into Data Source Topics

Bergamaschi, Sonia; Ferrari, Davide; Guerra, Francesco; Simonini, Giovanni; Velegrakis, Ioannis

doi:10.1007/s13740-016-0063-6

A fundamental service for the exploitation of the modern large data sources that are available online is the ability to identify the topics of the data that they contain. Unfortunately, the heterogeneity and lack of centralized control makes it difficult to identify the topics directly from the actual values used in the sources. We present an approach that generates signatures of sources that are matched against a reference vocabulary of concepts through the respective signature to generate a description of the topics of the source in terms of this reference vocabulary. The reference vocabulary may be provided ready, may be created manually, or may be created by applying our signature-generated algorithm over a well-curated data source with a clear identification of topics. In our particular case, we have used DBpedia for the creation of the vocabulary, since it is one of the largest known collections of entities and concepts. The signatures are generated by exploiting the entropy and ...

A fundamental service for the exploitation of the modern large data sources that are available online is the ability to identify the topics of the data that they contain. Unfortunately, the heterogeneity and lack of centralized control makes it difficult to identify the topics directly from the actual values used in the sources. We present an approach that generates signatures of sources that are matched against a reference vocabulary of concepts through the respective signature to generate a description of the topics of the source in terms of this reference vocabulary. The reference vocabulary may be provided ready, may be created manually, or may be created by applying our signature-generated algorithm over a well-curated data source with a clear identification of topics. In our particular case, we have used DBpedia for the creation of the vocabulary, since it is one of the largest known collections of entities and concepts. The signatures are generated by exploiting the entropy and the mutual information of the attributes of the sources to generate semantic identifiers of the various attributes, which combined together form a unique signature of the concepts (i.e. the topics) of the source. The generation of the identifiers is based on the entropy of the values of the attributes; thus, they are independent of naming heterogeneity of attributes or tables. Although the use of traditional information-theoretical quantities such as entropy and mutual information is not new, they may become untrustworthy due to their sensitivity to overfitting, and require an equal number of samples used to construct the reference vocabulary. To overcome these limitations, we normalize and use pseudo-additive entropy measures, which automatically downweight the role of vocabulary items and property values with very low frequencies, resulting in a more stable solution than the traditional counterparts. We have materialized our theory in a system called WHATSIT and we experimentally demonstrate its effectiveness.

Providing Insight into Data Source Topics / Bergamaschi, S., Ferrari, D., Guerra, F., Simonini, G., Velegrakis, I.. - In: JOURNAL ON DATA SEMANTICS. - ISSN 1861-2032. - STAMPA. - 5:4(2016), pp. 211-228. [10.1007/s13740-016-0063-6]

Providing Insight into Data Source Topics

Bergamaschi, Sonia;Ferrari, Davide;Guerra, Francesco;Simonini, Giovanni;Velegrakis, Ioannis

2016-01-01

Abstract

A fundamental service for the exploitation of the modern large data sources that are available online is the ability to identify the topics of the data that they contain. Unfortunately, the heterogeneity and lack of centralized control makes it difficult to identify the topics directly from the actual values used in the sources. We present an approach that generates signatures of sources that are matched against a reference vocabulary of concepts through the respective signature to generate a description of the topics of the source in terms of this reference vocabulary. The reference vocabulary may be provided ready, may be created manually, or may be created by applying our signature-generated algorithm over a well-curated data source with a clear identification of topics. In our particular case, we have used DBpedia for the creation of the vocabulary, since it is one of the largest known collections of entities and concepts. The signatures are generated by exploiting the entropy and ...

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2016
			
	Titolo del periodico (Journal title)
	
				JOURNAL ON DATA SEMANTICS
			
	Numero e parte del fascicolo (Issue number and part)
	
				4
			
	DOI
	
				https://dx.doi.org/10.1007/s13740-016-0063-6
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-84992386943
			
	Codice WOS (WOS identifier)
	
				WOS:000391188500001
			
	Tutti gli autori
	
						Bergamaschi, Sonia; Ferrari, Davide; Guerra, Francesco; Simonini, Giovanni; Velegrakis, Ioannis
					
	Citazione
	
				Providing Insight into Data Source Topics / Bergamaschi, S., Ferrari, D., Guerra, F., Simonini, G., Velegrakis, I.. - In: JOURNAL ON DATA SEMANTICS. - ISSN 1861-2032. - STAMPA. - 5:4(2016), pp. 211-228. [10.1007/s13740-016-0063-6]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
BergamaschiFGSV16.pdf accesso aperto Tipologia: Post-print referato (Refereed author’s manuscript) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.64 MB Formato Adobe PDF Visualizza/Apri	1.64 MB	Adobe PDF	Visualizza/Apri
Bergamaschi2016_Article_ProvidingInsightIntoDataSource (1).pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.05 MB Formato Adobe PDF Visualizza/Apri	1.05 MB	Adobe PDF	Visualizza/Apri