Dustin Cobas’s research while affiliated with University of Santiago Chile and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (6)


Fast and Small Subsampled R-indexes
  • Preprint

September 2024

·

10 Reads

Dustin Cobas

·

Travis Gagie

·

The r-index represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude in query time. Its space usage, O(r) where r is the number of runs in the Burrows--Wheeler Transform of the text, is however higher than Lempel--Ziv (LZ) and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. We introduce the sr-index, a variant that limits the space to O(min(r,n/s))O(\min(r,n/s)) for a text of length n and a given parameter s, at the expense of multiplying by s the time per occurrence reported. The sr-index is obtained subsampling the text positions indexed by the r-index, being still able to support pattern matching with guaranteed performance. Our experiments show that the theoretical analysis falls short in describing the practical advantages of the sr-index, because it performs much better on real texts than on synthetic ones: the sr-index retains the performance of the r-index while using 1.5--4.0 times less space, sharply outperforming {\em virtually every other} compressed index on repetitive texts in both time and space. Only a particular LZ-based index uses less space than the sr-index, but it is an order of magnitude slower. Our second contribution are the r-csa and sr-csa indexes. Just like the r-index adapts the well-known FM-Index to repetitive texts, the r-csa adapts Sadakane's Compressed Suffix Array (CSA) to this case. We show that the principles used on the r-index turn out to fit naturally and efficiently in the CSA framework. The sr-csa is the corresponding subsampled version of the r-csa. While the CSA performs better than the FM-Index on classic texts with alphabets larger than DNA, we show that the sr-csa outperforms the sr-index on repetitive texts over those larger alphabets and some DNA texts as well.


A Fast and Small Subsampled R-index

March 2021

·

39 Reads

The r-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, O(r)\mathcal{O}(r) where r is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the sr-index, a variant that limits the space to O(min(r,n/s))\mathcal{O}(\min(r,n/s)) for a text of length n and a given parameter s, at the expense of multiplying by s the time per occurrence reported. The sr-index is obtained by carefully subsampling the text positions indexed by the r-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the sr-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the r-index while using 1.5--3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the sr-index, using about half the space, but they are an order of magnitude slower.


Tailoring r-index for Document Listing Towards Metagenomics Applications

September 2020

·

8 Reads

·

3 Citations

Lecture Notes in Computer Science

A basic problem in metagenomics is to assign a sequenced read to the correct species in the reference collection. In typical applications in genomic epidemiology and viral metagenomics the reference collection consists of a set of species with each species represented by its highly similar strains. It has been recently shown that accurate read assignment can be achieved with k-mer hashing-based pseudoalignment: a read is assigned to species A if each of its k-mer hits to a reference collection is located only on strains of A. We study the underlying primitives required in pseudoalignment and related tasks. We propose three space-efficient solutions building upon the document listing with frequencies problem. All the solutions use an r-index (Gagie et al., SODA 2018) as an underlying index structure for the text obtained as concatenation of the set of species, as well as for each species. Given t species whose concatenation length is n, and whose Burrows-Wheeler transform contains r runs, our first solution, based on a grammar-compressed document array with precomputed queries at non terminal symbols, reports the frequencies for the distinct documents in which the pattern of length m occurs in time. Our second solution is also based on a grammar-compressed document array, but enhanced with bitvectors and reports the frequencies in time, over a machine with wordsize w. Our third solution, based on the interleaved LCP array, answers the same query in time. We implemented our solutions and tested them on real-world and synthetic datasets. The results show that all the solutions are fast on highly-repetitive data, and the size overhead introduced by the indexes are comparable with the size of the r-index.


Tailoring r-index for metagenomics

June 2020

·

13 Reads

A basic problem in metagenomics is to assign a sequenced read to the correct species in the reference collection. In typical applications in genomic epidemiology and viral metagenomics the reference collection consists of set of species with each species represented by its highly similar strains. It has been recently shown that accurate read assignment can be achieved with k-mer hashing-based pseudoalignment: A read is assigned to species A if each of its k-mer hits to reference collection is located only on strains of A. We study the underlying primitives required in pseudoalignment and related tasks. We propose three space-efficient solutions building upon the document listing with frequencies problem. All the solutions use an r-index (Gagie et al., SODA 2018) as an underlying index structure for the text obtained as concatenation of the set of species, as well as for each species. Given t species whose concatenation length is n, and whose Burrows-Wheeler transform contains r runs, our first solution, based on a grammar-compressed document array with precomputed queries at non terminal symbols, reports the frequencies for the ndoc{\tt ndoc} distinct documents in which the pattern of length m occurs in O(m+log(n)ndoc){\cal O}(m + \log(n){\tt ndoc}) time. Our second solution is also based on a grammar-compressed document array, but enhanced with bitvectors and reports the frequencies in O(m+((t/w)logn+log(n/r))ndoc){\cal O}(m + ((t/w)\log n + \log(n/r)){\tt ndoc}) time, over a machine with wordsize w. Our third solution, based on the interleaved LCP array, answers the same query in O(m+log(n/r)ndoc){\cal O}(m + \log(n/r){\tt ndoc}). We implemented our solutions and tested them on real-world and synthetic datasets. The results show that all the solutions are fast on highly-repetitive data, and the size overhead introduced by the indexes are comparable with the size of the r-index.


Fast, Small, and Simple Document Listing on Repetitive Text Collections

October 2019

·

28 Reads

·

4 Citations

Lecture Notes in Computer Science

Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length n that lists the ndoc distinct documents where a pattern of length m appears in time O(m+ndoclgn){{\,\mathrm{\mathcal {O}}\,}}(m+ ndoc \cdot \lg n). We exploit the repetitiveness of the document array (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.


Fast, Small, and Simple Document Listing on Repetitive Text Collections

February 2019

·

9 Reads

Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length n that lists the ndoc distinct documents where a pattern of length m appears in time O(m+ndoclogn)\mathcal{O}(m+ndoc \cdot \log n). We exploit the repetitiveness of the document array (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.

Citations (2)


... Cobas et al. [27] We note that d can be given at query time, and even modified during the query. ...

Reference:

Algorithms and Data Structures for Large-Scale Pangenomics
Tailoring r-index for Document Listing Towards Metagenomics Applications
  • Citing Chapter
  • September 2020

Lecture Notes in Computer Science

... • Finally, in Section 6 we develop a BWT-based variant of the EFG index enhanced with a set of paths P to solve a problem analogous to document listing queries [25], namely, to report which paths present a given pattern as a substring. With this formulation one can, for example, restrict pattern matching along the rows of the original multiple sequence alignment, making the index functionally equivalent to those solving document listing on repetitive genome collections [8,7]. ...

Fast, Small, and Simple Document Listing on Repetitive Text Collections
  • Citing Chapter
  • October 2019

Lecture Notes in Computer Science