A Compressed Self-Index using a Ziv-Lempel Dictionary

Luís M. S. Russo, Arlindo L. Oliveira


A compressed full-text self-index for a text $ T$ , of size $ u$ , is a data structure used to search patterns $ P$ , of size $ m$ , in $ T$ that requires reduced space, i.e. that depends on the empirical entropy ($ H_k$ , $ H_0$ ) of $ T$ , and is, furthermore, able to reproduce any substring of $ T$ . In this paper we present a new compressed self-index able to locate the occurrences of $ P$ in $ O((m+occ)\log n)$ time, where $ occ$ is the number of occurrences and $ \sigma$ the size of the alphabet of $ T$ . The fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on $ m$ from $ O(m^2)$ to $ O(m)$ . To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose and explore the nature of a recurrent structure in LZ-indexes, the $ \mathcal{T}_{78}$ suffix tree. We show that our method is very competitive in practice by comparing it against the LZ-Index, the FM-index and a compressed suffix array.

Luis Manuel Silveira Russo 2007-12-02