Charles F. Goncalves, Walter Santos, Luis F. D. Flores, Matheus S. Vilela, Carla Machado, Wagner Meira Jr., Altigran Silva.
Data quality in databases is fundamental to many information management applications. One key criterion while measuring quality is the occurrence of duplicated records in a database, justifying the development of deduplication and entity resolution techniques. In deduplication, the main challenge is the high complexity involved in comparing every single register in a database. In order to minimize such problem, blocking techniques are used to reduce the number of comparisons, using fast and cheap metrics to identify the similarity between each pair of records. In the present study, we evaluate some existing blocking techniques implemented in a distributed, parallel and high scalable deduplication framework. We analyze them comparatively and identify the main advantages and disadvantages achieved by a parallel execution.
http://www.lbd.dcc.ufmg.br:8080/colecoes/waamd/2008/007.pdf
Caso o link acima esteja inválido, faça uma busca pelo texto completo na Web: Buscar na Web