September 2024
·
10 Reads
The r-index represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude in query time. Its space usage, O(r) where r is the number of runs in the Burrows--Wheeler Transform of the text, is however higher than Lempel--Ziv (LZ) and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. We introduce the sr-index, a variant that limits the space to for a text of length n and a given parameter s, at the expense of multiplying by s the time per occurrence reported. The sr-index is obtained subsampling the text positions indexed by the r-index, being still able to support pattern matching with guaranteed performance. Our experiments show that the theoretical analysis falls short in describing the practical advantages of the sr-index, because it performs much better on real texts than on synthetic ones: the sr-index retains the performance of the r-index while using 1.5--4.0 times less space, sharply outperforming {\em virtually every other} compressed index on repetitive texts in both time and space. Only a particular LZ-based index uses less space than the sr-index, but it is an order of magnitude slower. Our second contribution are the r-csa and sr-csa indexes. Just like the r-index adapts the well-known FM-Index to repetitive texts, the r-csa adapts Sadakane's Compressed Suffix Array (CSA) to this case. We show that the principles used on the r-index turn out to fit naturally and efficiently in the CSA framework. The sr-csa is the corresponding subsampled version of the r-csa. While the CSA performs better than the FM-Index on classic texts with alphabets larger than DNA, we show that the sr-csa outperforms the sr-index on repetitive texts over those larger alphabets and some DNA texts as well.