Publications (11)4.13 Total impact
 [Show abstract] [Hide abstract]
ABSTRACT: Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either spaceinefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the BurrowsWheeler transform, and a stack of $O(\sigma^2\log^2 n)$ bits, where $n$ is the length of the string and $\sigma$ is the size of the alphabet. The size of the stack is $o(n)$ except for very large values of $\sigma$. We further improve the algorithm by removing its time dependency on $\sigma$, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate underrepresented strings that $\textit{do not occur}$ in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.  [Show abstract] [Hide abstract]
ABSTRACT: String kernels are typically used to compare genomescale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either spaceinefficient, or incur large slowdowns. We show that a number of exact string kernels, like the $k$mer kernel, the substrings kernels, a number of lengthweighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in $O(nd)$ time and in $o(n)$ bits of space in addition to the input, using just a $\mathtt{rangeDistinct}$ data structure on the BurrowsWheeler transform of the input strings, which takes $O(d)$ time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple value of $k$, like the $k$mer profile and the $k$th order empirical entropy, and for calibrating the value of $k$ using the data.  [Show abstract] [Hide abstract]
ABSTRACT: In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the runlength encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constantspace traversal. 
Conference Paper: Faster variance computation for patterns with gaps
[Show abstract] [Hide abstract]
ABSTRACT: Determining whether a pattern is statistically overrepresented or underrepresented in a string is a fundamental primitive in computational biology and in largescale text mining. We study ways to speed up the computation of the expectation and variance of the number of occurrences of a pattern with rigid gaps in a random string. Our contributions are twofold: first, we focus on patterns in which groups of characters from an alphabet Σ can occur at each position. We describe a way to compute the exact expectation and variance of the number of occurrences of a pattern w in a random string generated by a Markov chain in O(w2) time, improving a previous result that required O(2w) time. We then consider the problem of computing expectation and variance of the motifs of a string s in an iid text. Motifs are rigid gapped patterns that occur at least twice in s, and in which at most one character from Σ occurs at each position. We study the case in which s is given offline, and an arbitrary motif w of s is queried online. We relate computational complexity to the structure of w and s, identifying sets of motifs that are amenable to o(wlogw) time online computation after O(s3) preprocessing of s. Our algorithms lend themselves to efficient implementations.Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms; 12/2012  [Show abstract] [Hide abstract]
ABSTRACT: Patterns with gaps have traditionally been used as signatures of protein families or as features in binary classification. Current alignmentfree algorithms construct phylogenies by comparing the repertoire and frequency of ungapped blocks in genomes and proteomes. In this article, we measure the quality of phylogenies reconstructed by comparing suitably defined sets of gapped motifs that occur in mitochondrial proteomes. We study the dependence between the quality of reconstructed phylogenies and the density, number of solid characters, and statistical significance of gapped motifs. We consider maximal motifs, as well as some of their compact generators. The average performance of suitably defined sets of gapped motifs is comparable to that of popular stringbased alignmentfree methods. Extremely long and sparse motifs produce phylogenies of the same or better quality than those produced by short and dense motifs. The best phylogenies are produced by motifs with 3 or 4 solid characters, while increasing the number of solid characters degrades phylogenies. Discarding motifs with low statistical significance degrades performance as well. In maximal motifs, moving from the smallest basis to bases with higher redundancy leads to better phylogenies.Journal of computational biology: a journal of computational molecular cell biology 06/2012; 19(7):91127. DOI:10.1089/cmb.2012.0060 · 1.74 Impact Factor 
Conference Paper: Sequence Similarity by Gapped LZW.
[Show abstract] [Hide abstract]
ABSTRACT: Measures of sequence similarity based on some underlying notion of relative compressibility are becoming increasingly of interest in connection with massive tasks of textfile classification such as, notably, in document classification and molecular taxonomy on a genomic scale. Sequences that are similar can be expected to share a large number of common substrings, whence some successful measures in this class have been based on the substring composition of the input sequences. Among the corresponding methods, one finds suitable extensions of the bagofwords together with more explicit resorts to data compression techniques such as LZ77. The approach presented in this paper explores the potential of LZW – the variant of LZ78 proposed by Welch – as well as of some of its lossy variants, in this context. Whereas LZW has a faster and simpler implementation than LZ77, the vocabulary underlying LZW is significantly smaller than that of LZ77. In addition, recently introduced “gapped” variants of LZW are considered that are equally straightforward to implement but allow for a controlled number of don’t cares to be introduced in the substrings that constitute the dictionary used in compression. This study assesses the robustness of compressibility based measures of similarity under these faster yet inherently more dispersive paradigms built around LZW.2011 Data Compression Conference (DCC 2011), 2931 March 2011, Snowbird, UT, USA; 01/2011  [Show abstract] [Hide abstract]
ABSTRACT: The quantitative underpinning of the information content of biosequences represents an elusive goal and yet also an obvious prerequisite to the quantitative modeling and study of biological function and evolution. Several past studies have addressed the question of what distinguishes biosequences from random strings, the latter being clearly unpalatable to the living cell. Such studies typically analyze the organization of biosequences in terms of their constituent characters or substrings and have, in particular, consistently exposed a tenacious lack of compressibility on behalf of biosequences. This article attempts, perhaps for the first time, an assessement of the structure and randomness of polypeptides in terms on newly introduced parameters that relate to the vocabulary of their (suitably constrained) subsequences rather than their substrings. It is shown that such parameters grasp structural/functional information, and are related to each other under a specific set of rules that span biochemically diverse polypeptides. Measures on subsequences separate few amino acid strings from their random permutations, but show that the random permutations of most polypeptides amass along specific linear loci.Journal of computational biology: a journal of computational molecular cell biology 08/2010; 17(8):101149. DOI:10.1089/cmb.2010.0073 · 1.74 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: Words that appear as constrained subsequences in a textstring are considered as possible indicators of the host string structure, hence also as a possible means of sequence comparison and classification. The constraint consists of imposing a bound on the number ω of positions in the text that may intervene between any two consecutive characters of a subsequence. A subset of such ωsequences is then characterized that consists, in intuitive terms, of sequences that could not be enriched with more characters without losing some occurrence in the text. A compact spatial representation is then proposed for these representative sequences, within which a number of parameters can be defined and measured. In the final part of the paper, such parameters are empirically analyzed on a small collection of textstrings endowed with various degrees of structure.Theoretical Computer Science 10/2009; 410(43):43604371. DOI:10.1016/j.tcs.2009.07.017 · 0.66 Impact Factor 
Conference Paper: Probing the Randomness of Proteins by Their Subsequence Composition
[Show abstract] [Hide abstract]
ABSTRACT: The quantitative underpinning of the information contents of biosequences represents an elusive goal and yet also an obvious prerequisite to the quantitative modeling and study of biological function and evolution. Previous studies have consistently exposed a tenacious lack of compressibility on behalf of biosequences. This leaves the question open as to what distinguishes them from random strings, the latter being clearly unpalatable to the living cell. This paper assesses the randomness of biosequences in terms on newly introduced parameters that relate to the vocabulary of their (suitably constrained) subsequences rather than their substrings. Results from experiments show the potential of the method in distinguishing a protein sequence from its random reshuffling, as well as in tasks of classification and clustering.Data Compression Conference, 2009. DCC '09.; 04/2009 
Conference Paper: Table Compression by Record Intersections
[Show abstract] [Hide abstract]
ABSTRACT: Saturated patterns with don't care like those emerged in biosequence motif discovery have proven a valuable notion also in the design of lossless and lossy compression of sequence data. In independent endeavors, the peculiarities inherent to the compression of tables have been examined, leading to compression schemata advantageously hinged on a prudent rearrangement of columns. The present paper introduces offline table compression by textual substitution in which the patterns used in compression are chosen among models or templates that capture recurrent record subfields. A model record is to be interpreted here as a sequence of intermixed solid and don't care characters that obeys, in addition, some conditions of saturation: most notably, it must be not possible to replace a don't care in the model by a solid character without having to forfeit some of its occurrences in the table. Saturation is expected to save on the size of the codebook at the outset, and hence to improve compression. It also induces some clustering of the records in the table, which may present independent interest. Results from preliminary experiments show the savings and potential for classification brought about by this method in connection with a table of specimens collected in a context of biodiversity studies.Data Compression Conference, 2008. DCC 2008; 04/2008  [Show abstract] [Hide abstract]
ABSTRACT: We describe succinct and compact representations of the bidirectional bwt of a string s∈Σ * which provide increasing navigation power and a number of spacetime tradeoffs. One such representation allows to extend a substring of s by one character from the left and from the right in constant time, taking O(slogΣ) bits of space. We then match the functions supported by each representation to a number of algorithms that traverse the nodes of the suffix tree of s, exploiting connections between the bwt and the suffixlink tree. This results in nearlinear time algorithms for many sequence analysis problems (e.g., maximal unique matches), for the first time in succinct space.
Publication Stats
22  Citations  
4.13  Total Impact Points  
Top Journals
Institutions

2015

University of Helsinki
 Department of Computer Science
Helsinki, Uusimaa, Finland


20082012

Georgia Institute of Technology
 College of Computing
Atlanta, Georgia, United States
