FS-Dedup - A Framework for Signature-based Deduplication in large datasets

Guilherme Dal BiancoRenata GalanteCarlos A. Heuser

A main process in data integration is the record deduplication. Record deduplication aims at identifying which records represent the same underlying entity. In overall, deduplication demands the user intervention to identify which pairs represent a match or non-match. However, in large datasets the user intervention may result in large efforts. This thesis aims at reducing the non-specialist user intervention focus on large scale deduplication. We propose a new framework, named FS-Dedup, where the user is not requested to direct intervention. For instance, the user is requested only to label a reduced set of pairs. Such set is automatically selected by our framework. The FS-Dedup removes the specialist-user to be able to anyone tune the entire deduplication process. FS-Dedup uses the Signature-Based Deduplication (Sig-Dedup) algorithms like a "blackbox". These algorithms are characterized by high efficiency and scalability in large datasets. However, a specialist-user intervention is requested to tune the Sig-Dedup algorithms. This thesis aims at filling such gap, it presents a framework that does not demands user knowledge about dataset or thresholds in order to assure the effectiveness. Our approach is novel in the sense that addresses the entire deduplication process in large datasets.

Caso o link acima esteja inválido, faça uma busca pelo texto completo na Web: Buscar na Web

Biblioteca Digital Brasileira de Computação - Contato:
     Mantida por: