Blocagem Adaptativa e Flexível para o Pareamento Aproximado de Registros

Luiz Osvaldo EvangelistaEli CortezAltigran S. da SilvaWagner Meira Jr.

In data integration tasks, records from a single dataset or from different sources must be often compared to identify records that represent the same real world entity. The cost of this search process for finding duplicate records grows quadratically as the number of records available in the data sources increases and, for this reason, direct approaches, as comparing all record pairs, must be avoided. In this context, blocking methods that are based on machine learning processes are used to find the best blocking function, based on the combination of low cost rules, which define how to perform the record blocking. This work presents a new blocking method based on machine learning. Different from other methods, this new approach is based on genetic programming, allowing the use of more flexible rules and a larger number of such rules for defining blocking functions, leading to a more effective process of identification of duplicate records. Experimental results with real and synthetic data show that the correctness of the genetic programming method may be over 95% when detecting duplicate records in an efficient manner.

Caso o link acima esteja inválido, faça uma busca pelo texto completo na Web: Buscar na Web

Biblioteca Digital Brasileira de Computação - Contato:
     Mantida por: