Basic word statistics for information retrieval: thesaurus as a complex network

Adriano de Jesus HolandaIvan Torres PisaOsame KinouchiAlexandre Souto MartinezEvandro Eduardo Seron Ruiz

Words are the building blocks to construct sentences and to transmit information. Here, two distinctive hard classification approaches are applied to words. First, we consider words as being the nodes and their relationships as being the links of a directed graph. This permits us define, in a natural manner, the thesaurus conformation. The statistics of the outcoming and incoming links are characterized by simple fitting functions. Later, from a large collection of articles from The New York Times online newspaper, classified by thematical sections, we have shown that current spoken words in natural language is distributed according to the same Zipf's law. A combination of both approaches seems to be a promising tool for automatic information retrieval.

