In the context of Entity Resolution (ER) in highly heterogeneous, noisy, user-generated entity collections, practically all block building methods employ redundancy in order to achieve high effectiveness. This practice, however, results in a high number of pair-wise comparisons, with a negative impact on efficiency. Existing block processing strategies aim at discarding unnecessary comparisons at no cost in effectiveness. In this paper, we systemize blocking methods for Clean-Clean ER (an inherently quadratic task) over highly heterogeneous information spaces (HHIS) through a novel framework that consists of two orthogonal layers: the effectiveness layer encompasses methods for building overlapping blocks with small likelihood of missed matches; the efficiency layer comprises a rich variety of techniques that significantly restrict the required number of pair-wise comparisons, having a controllable impact on the number of detected duplicates. We map to our framework all relevant existing methods for creating and processing blocks in the context of HHIS, and additionally propose two novel techniques: Attribute Clustering Blocking and Comparisons Scheduling. We evaluate the performance of each layer and method on two large-scale, real-world data sets and validate the excellent balance between efficiency and effectiveness that they achieve.

A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces

Palpanas, Themistoklis;
2012-01-01

Abstract

In the context of Entity Resolution (ER) in highly heterogeneous, noisy, user-generated entity collections, practically all block building methods employ redundancy in order to achieve high effectiveness. This practice, however, results in a high number of pair-wise comparisons, with a negative impact on efficiency. Existing block processing strategies aim at discarding unnecessary comparisons at no cost in effectiveness. In this paper, we systemize blocking methods for Clean-Clean ER (an inherently quadratic task) over highly heterogeneous information spaces (HHIS) through a novel framework that consists of two orthogonal layers: the effectiveness layer encompasses methods for building overlapping blocks with small likelihood of missed matches; the efficiency layer comprises a rich variety of techniques that significantly restrict the required number of pair-wise comparisons, having a controllable impact on the number of detected duplicates. We map to our framework all relevant existing methods for creating and processing blocks in the context of HHIS, and additionally propose two novel techniques: Attribute Clustering Blocking and Comparisons Scheduling. We evaluate the performance of each layer and method on two large-scale, real-world data sets and validate the excellent balance between efficiency and effectiveness that they achieve.
2012
12
G., Papadakis; E., Ioannou; Palpanas, Themistoklis; C., Niederee; W., Nejdl
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/92146
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 125
  • ???jsp.display-item.citation.isi??? 82
social impact