André Luiz da Costa Carvalho, Allan José de Souza Bezerra, Edleno Silva de Moura, Altigran Soares da Silva, Patrícia Silva Peres.
Identifying replicated sites is an important task for search engines.It can reduce data storage costs, improve query processing time and remove noises that might affect the quality of the final answer given to the user . This paper introduces a new approach to detect replicated sites in search engines databases, using as replication evidences the websites' structure and the content of their pages. It is also depicted the result of experiments performed with a real search engine database. Our approach found 8.43% of the web pages stored in the database were in replicated web sites with 94.4% precision, result witch is more accurate than the ones found in other works.
http://www.lbd.dcc.ufmg.br:8080/colecoes/sbbd/2005/002.pdf
Caso o link acima esteja inválido, faça uma busca pelo texto completo na Web: Buscar na Web