Seleção Automática de Exemplos de Treino para um Método de Deduplicação de Registros baseado em Programação Genética

Gabriel Silva GonçalvesMoisés G. de CarvalhoAlberto H. F. LaenderMarcos André Gonçalves

Recently, machine learning techniques have been used to solve the record deduplication problem. However, they require examples, manually generated in most cases, for training purposes. This uneases the use of such techniques because of the cost required to create the set of examples. In this paper, we propose an approach based on a deterministic technique to automatically suggest training examples for a deduplication method based on genetic programming. Our experiments using synthetic datasets show that using only 15% of the examples suggested by our approach, it is possible to achieve results in terms of F1, equivalent to those obtained when using all the examples, leading to savings in training time of up to 85%.

