An important task in data integration and data cleaning is the identification of data that describe the same real world object, such as an event, a person, or a movie. There are various techniques to tackle this problem. The typical methodology is to collect matching evidence, such as similarities between the entity strings, and based on them generate information to link the entities. Then, using predefined thresholds, or human intervention, the entities are merged, and thus queries are executed over the resulted merged entities. In this chapter, we explain the limitations of this methodology on recently introduced data, for instance data from Web 2.0 applications, and the challenges that such data impose on the entity linkage methodology. We then propose an alternative, generic methodology that allows the use of the entity linkage information upon query processing. In particular, we define a generic data model suitable for representing the entity and linkage information as this is generated by a number of the existing entity linkage techniques. Entities are compiled on-the-fly, by effectively processing the incoming query over the representation model, and thus, query answers reflect the most probable entity solution for the specific query. We also report the results of our extensive experimental evaluation, which verify the efficiency and effectiveness of the suggested methodology.
Embracing Uncertainty in Entity Linking
Velegrakis, Ioannis
2012-01-01
Abstract
An important task in data integration and data cleaning is the identification of data that describe the same real world object, such as an event, a person, or a movie. There are various techniques to tackle this problem. The typical methodology is to collect matching evidence, such as similarities between the entity strings, and based on them generate information to link the entities. Then, using predefined thresholds, or human intervention, the entities are merged, and thus queries are executed over the resulted merged entities. In this chapter, we explain the limitations of this methodology on recently introduced data, for instance data from Web 2.0 applications, and the challenges that such data impose on the entity linkage methodology. We then propose an alternative, generic methodology that allows the use of the entity linkage information upon query processing. In particular, we define a generic data model suitable for representing the entity and linkage information as this is generated by a number of the existing entity linkage techniques. Entities are compiled on-the-fly, by effectively processing the incoming query over the representation model, and thus, query answers reflect the most probable entity solution for the specific query. We also report the results of our extensive experimental evaluation, which verify the efficiency and effectiveness of the suggested methodology.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione