Enhanced Full-Text Self-Indexes based on Lempel-Ziv Compression

Luís M. S. Russo

Full-text indexes provide an efficient method to search for any sub-string of a text. However, these indexes require a large amount of space. Compressed indexes, that explore the compressibility of the text, have recently been proposed to address this problem. Self-indexes, that in addition are able to reproduce any sub-string of the text without storing it explicitly, represent a further step towards saving space.

This thesis studies self-indexes based on the Lempel-Ziv data compression technique. It starts by analyzing the search algorithm of these indexes, pointing out the causes of a quadratic dependency on the pattern size. It then proposes a new search procedure that solves this problem and confirms empirically that this modification has significant practical results. The thesis also proposes a method to extend the functionality of these indexes, so that it becomes possible to find a longest common sub-string and to compute approximate matches. These results are verified empirically, demonstrating that these indexes are very efficient.

Keywords: information storage, data compression, text indexing, pattern matching, approximate pattern matching, longest common sub-string

Luis Manuel Silveira Russo 2007-12-02