A Model for XML Instance Level Integration

Aldo Monteiro do NascimentoCarmem S. Hara

There are two major problems for merging instances from different sources in order to build a datawarehouse: entity identification ambiguity and attribute value conflict. In this paper we propose a data model that facilitates the resolution of value attribute conflicts by explicitly representing them in the integrated schema. In this model, the datawarehouse is an XML tree populated with data imported from one or more XML sources, and nodes are annotated with provenance information. The purpose of annotations is twofold: first, they represent the origin of every element in the datawarehouse. This information is essential for determining the quality and amount of trust one places on the data. Second, they allow the portion of source XML tree used to populate the warehouse to be reconstructed. This capability is important if one needs the original document to compare with new releases from the same source in order to incrementally update the warehouse. Algorithms for populating the warehouse according to the proposed model and for reconstructing the source data are presented. We also report results from an experimental study conducted to determine the impact of the annotations on the size of the warehouse.

