Extração de dados e metadados em textos semi-estruturados usando HMMs

Roberto Oliveira dos SantosFilipe de Sá MesquitaAltigran Soares da SilvaEli Cortez C. Vilarinho

The Web is abundant in pages containing implicit data items. In many cases, these data items occur in semi-structured texts without explicit delimiters and embedded within an implicit structure. In this paper, we present a novel approach for the extraction from semi-structured texts which is based on Hidden Markov Models (HMM). Distinctly from previous proposals in the literature that also use HMM, our approach emphasizes the extraction of metadata in addition to the extraction of data items themselves. Our approach consists of a nested structure of HMMs, in which a main HMM identifies implicit attributes in the text and a set of internal HMM, one for each attribute, identifies data and metadata. The HMM are generated from training using a fraction of the set of the texts from which data is to be extracted. Our experiments with classified ads taken from the Web demonstrate that the extraction process reaches quality levels superior to 0,97 using the F-measure, even if the fraction used for training is small.

Caso o link acima esteja inválido, faça uma busca pelo texto completo na Web: Buscar na Web

Biblioteca Digital Brasileira de Computação - Contato:
     Mantida por: