By its nature, the term “data quality†with its generic meaning “fitness for use†has both subjective and objective aspects. There are numerous methodologies and techniques to evaluate its subjective parts and to measure its objective parts. However, none of them are uniform enough for exploitation in diverse real-world applications. None of those, in fact, can be created as such, since data quality penetrates too deep into business operations to prevent from finding “a silver bullet†for all of them: it normally goes from representation of real world entities or their properties with data in an information system, to data processing and delivering to consumers. In this work, we considered three real world use cases which entirely or partially cover those areas of data quality scope. In particular, we study the following problems: 1) how quality of data can be defined and propagated to customers in a business intelligence application for quality-aware decision making; 2) how data quality can be defined, measured and used in a web-based system operating with semi-structured data from and designated to both humans and machines; 3) how a data-driven (vs. system-driven) time-related data quality notion of staleness can be defined, efficiently measured and monitored in a generic information system. Thus, we expand the corresponding state of the art with Application, System and Dimension aspects of data quality. In the Application context, we propose a quality-aware architecture for a typical business intelligence application in a healthcare environment. We demonstrate potential quality issues implications, including intra- and inter-dimensional quality dependencies, prone to data from early processing stages up to the reporting level. In the part dedicated to the System, we demonstrate an approach to understand, measure and disseminate data quality measurement results in a context of a web based system called Entity Name System (ENS). On the Dimension side, we propose a definition of data staleness in accordance with key time-related quality metrics requirements, relying on the corresponding similar notions elaborated by the researchers before. We demonstrate an approach to measure data staleness by different statistical methods, including exponential smoothing. In our experiments, we compare their space efficiency and data update instants predictive accuracy using history of updates of sample representative articles from Wikipedia.

Three Case Studies For Understanding, Measuring and Using a Compound Notion of Data Quality With Emphasis on the data Staleness Dimension / Chayka, Oleksiy. - (2012), pp. 1-100.

Three Case Studies For Understanding, Measuring and Using a Compound Notion of Data Quality With Emphasis on the data Staleness Dimension

Chayka, Oleksiy
2012-01-01

Abstract

By its nature, the term “data quality†with its generic meaning “fitness for use†has both subjective and objective aspects. There are numerous methodologies and techniques to evaluate its subjective parts and to measure its objective parts. However, none of them are uniform enough for exploitation in diverse real-world applications. None of those, in fact, can be created as such, since data quality penetrates too deep into business operations to prevent from finding “a silver bullet†for all of them: it normally goes from representation of real world entities or their properties with data in an information system, to data processing and delivering to consumers. In this work, we considered three real world use cases which entirely or partially cover those areas of data quality scope. In particular, we study the following problems: 1) how quality of data can be defined and propagated to customers in a business intelligence application for quality-aware decision making; 2) how data quality can be defined, measured and used in a web-based system operating with semi-structured data from and designated to both humans and machines; 3) how a data-driven (vs. system-driven) time-related data quality notion of staleness can be defined, efficiently measured and monitored in a generic information system. Thus, we expand the corresponding state of the art with Application, System and Dimension aspects of data quality. In the Application context, we propose a quality-aware architecture for a typical business intelligence application in a healthcare environment. We demonstrate potential quality issues implications, including intra- and inter-dimensional quality dependencies, prone to data from early processing stages up to the reporting level. In the part dedicated to the System, we demonstrate an approach to understand, measure and disseminate data quality measurement results in a context of a web based system called Entity Name System (ENS). On the Dimension side, we propose a definition of data staleness in accordance with key time-related quality metrics requirements, relying on the corresponding similar notions elaborated by the researchers before. We demonstrate an approach to measure data staleness by different statistical methods, including exponential smoothing. In our experiments, we compare their space efficiency and data update instants predictive accuracy using history of updates of sample representative articles from Wikipedia.
2012
XXIII
2011-2012
Ingegneria e scienza dell'Informaz (29/10/12-)
Information and Communication Technology
Bouquet, Paolo
no
Inglese
Settore INF/01 - Informatica
File in questo prodotto:
File Dimensione Formato  
PhD_Thesis.pdf

accesso aperto

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 4.11 MB
Formato Adobe PDF
4.11 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/368390
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact