Avaliação de Técnicas Paralelas de Blocagem para a Resolução de Entidades e Deduplicação

Charles F. GoncalvesWalter SantosLuis F. D. FloresMatheus S. VilelaCarla MachadoWagner Meira Jr.Altigran Silva

Data quality in databases is fundamental to many information management applications. One key criterion while measuring quality is the occurrence of duplicated records in a database, justifying the development of deduplication and entity resolution techniques. In deduplication, the main challenge is the high complexity involved in comparing every single register in a database. In order to minimize such problem, blocking techniques are used to reduce the number of comparisons, using fast and cheap metrics to identify the similarity between each pair of records. In the present study, we evaluate some existing blocking techniques implemented in a distributed, parallel and high scalable deduplication framework. We analyze them comparatively and identify the main advantages and disadvantages achieved by a parallel execution.

