Similarity Search and Indexing for High-Dimensional Data

Searching by similarity is a critical operation on many systems, and thus has attracted the attention of many disciplines in Computer Sciences, including Computational Geometry, Machine Learning, Multimedia Retrieval and, of course, Databases. To perform efficiently, similarity search requires the support of indexing, which suffers from the infamous "curse of the dimensionality". In this tutorial we will introduce the challenges of indexing and searching high-dimensional data, and present the most recent tools available to "tame the curse". The tutorial takes 3 hours and is divided in three main sections: an introduction, where we define the problem, its applications and its challenges; a presentation of the state of the art, where we present the most important/interesting solutions to the problem; and a study of case of the application of high dimensional indexing to problems in Content Based Information Retrieval. The second section (study of the state of the art) is the core of the tutorial. We will show how the (exceedingly prolific) literature in high-dimensional indexing can be tackled by dividing it in classes of solutions: a. Solutions based on space-partitioning: those algorithms are based on the idea of partitioning the space into mutually-exclusive hyper-rectangles; b. Solutions based on data partitioning and clustering: those algorithms partition the data into mutually exclusive sets, which may overlap in the space; c. Solutions based on data approximation and quantization: those techniques quantize the data, reducing the size of the files which have to be sequentially analyzed. They may also reorganize the data to reduce the amount of sequential read necessary; d. Solutions based on metric spaces: those techniques use the triangle inequality to select the portions of the data to analyze. The try to minimize the number of distances computed, and are useful when the distance function is expensive or when the data do not belong to a vector space; e. Solutions based on decomposition into subspaces: those techniques use multiple parallel indexes of smaller dimensionality than the original space, and then aggregate the results. For each class, we show the archetypical, easy to understand method, and use it to explain the motivation behind that class of solutions and the challenges they face. We then proceed to explore the state of the art, showing how different methods address those challenges. We end this section by considering some common trends among the recently proposed solutions. At the end, the audience will have a good grasp of the current state of the art, the most promising research trends and the challenges still faced by the technology.

http://www.lbd.dcc.ufmg.br:8080/colecoes/sbbd/2009/023.pdf

Biblioteca Digital Brasileira de Computação - Contato: bdbcomp@lbd.dcc.ufmg.br