[Show abstract][Hide abstract] ABSTRACT: Detecting all the strings that occur in a text more frequently or less
frequently than expected according to an IID or a Markov model is a basic
problem in string mining, yet current algorithms are based on data structures
that are either space-inefficient or incur large slowdowns, and current
implementations cannot scale to genomes or metagenomes in practice. In this
paper we engineer an algorithm based on the suffix tree of a string to use just
a small data structure built on the Burrows-Wheeler transform, and a stack of
$O(\sigma^2\log^2 n)$ bits, where $n$ is the length of the string and $\sigma$
is the size of the alphabet. The size of the stack is $o(n)$ except for very
large values of $\sigma$. We further improve the algorithm by removing its time
dependency on $\sigma$, by reporting only a subset of the maximal repeats and
of the minimal rare words of the string, and by detecting and scoring candidate
under-represented strings that $\textit{do not occur}$ in the string. Our
algorithms are practical and work directly on the BWT, thus they can be
immediately applied to a number of existing datasets that are available in this
form, returning this string mining problem to a manageable scale.
[Show abstract][Hide abstract] ABSTRACT: String kernels are typically used to compare genome-scale sequences whose
length makes alignment impractical, yet their computation is based on data
structures that are either space-inefficient, or incur large slowdowns. We show
that a number of exact string kernels, like the $k$-mer kernel, the substrings
kernels, a number of length-weighted kernels, the minimal absent words kernel,
and kernels with Markovian corrections, can all be computed in $O(nd)$ time and
in $o(n)$ bits of space in addition to the input, using just a
$\mathtt{rangeDistinct}$ data structure on the Burrows-Wheeler transform of the
input strings, which takes $O(d)$ time per element in its output. The same
bounds hold for a number of measures of compositional complexity based on
multiple value of $k$, like the $k$-mer profile and the $k$-th order empirical
entropy, and for calibrating the value of $k$ using the data.
[Show abstract][Hide abstract] ABSTRACT: In highly repetitive strings, like collections of genomes from the same
species, distinct measures of repetition all grow sublinearly in the length of
the text, and indexes targeted to such strings typically depend only on one of
these measures. We describe two data structures whose size depends on multiple
measures of repetition at once, and that provide competitive tradeoffs between
the time for counting and reporting all the exact occurrences of a pattern, and
the space taken by the structure. The key component of our constructions is the
run-length encoded BWT (RLBWT), which takes space proportional to the number of
BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it
with data structures from LZ77 indexes, which take space proportional to the
number of LZ77 factors, and with the compact directed acyclic word graph
(CDAWG), which takes space proportional to the number of extensions of maximal
repeats. The combination of CDAWG and RLBWT enables also a new representation
of the suffix tree, whose size depends again on the number of extensions of
maximal repeats, and that is powerful enough to support matching statistics and
constant-space traversal.
[Show abstract][Hide abstract] ABSTRACT: Determining whether a pattern is statistically overrepresented or underrepresented in a string is a fundamental primitive in computational biology and in large-scale text mining. We study ways to speed up the computation of the expectation and variance of the number of occurrences of a pattern with rigid gaps in a random string. Our contributions are twofold: first, we focus on patterns in which groups of characters from an alphabet Σ can occur at each position. We describe a way to compute the exact expectation and variance of the number of occurrences of a pattern w in a random string generated by a Markov chain in O(|w|2) time, improving a previous result that required O(2|w|) time. We then consider the problem of computing expectation and variance of the motifs of a string s in an iid text. Motifs are rigid gapped patterns that occur at least twice in s, and in which at most one character from Σ occurs at each position. We study the case in which s is given offline, and an arbitrary motif w of s is queried online. We relate computational complexity to the structure of w and s, identifying sets of motifs that are amenable to o(|w|log|w|) time online computation after O(|s|3) preprocessing of s. Our algorithms lend themselves to efficient implementations.
Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms; 12/2012
[Show abstract][Hide abstract] ABSTRACT: Patterns with gaps have traditionally been used as signatures of protein families or as features in binary classification. Current alignment-free algorithms construct phylogenies by comparing the repertoire and frequency of ungapped blocks in genomes and proteomes. In this article, we measure the quality of phylogenies reconstructed by comparing suitably defined sets of gapped motifs that occur in mitochondrial proteomes. We study the dependence between the quality of reconstructed phylogenies and the density, number of solid characters, and statistical significance of gapped motifs. We consider maximal motifs, as well as some of their compact generators. The average performance of suitably defined sets of gapped motifs is comparable to that of popular string-based alignment-free methods. Extremely long and sparse motifs produce phylogenies of the same or better quality than those produced by short and dense motifs. The best phylogenies are produced by motifs with 3 or 4 solid characters, while increasing the number of solid characters degrades phylogenies. Discarding motifs with low statistical significance degrades performance as well. In maximal motifs, moving from the smallest basis to bases with higher redundancy leads to better phylogenies.
Journal of computational biology: a journal of computational molecular cell biology 06/2012; 19(7):911-27. DOI:10.1089/cmb.2012.0060 · 1.74 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Measures of sequence similarity based on some underlying notion of relative compressibility are becoming increasingly of interest in connection with massive tasks of textfile classification such as, notably, in document classification and molecular taxonomy on a genomic scale. Sequences that are similar can be expected to share a large number of common substrings, whence some successful measures in this class have been based on the substring composition of the input sequences. Among the corresponding methods, one finds suitable extensions of the bag-of-words together with more explicit resorts to data compression techniques such as LZ77. The approach presented in this paper explores the potential of LZW – the variant of LZ78 proposed by Welch – as well as of some of its lossy variants, in this context. Whereas LZW has a faster and simpler implementation than LZ77, the vocabulary underlying LZW is significantly smaller than that of LZ77. In addition, recently introduced “gapped” variants of LZW are considered that are equally straightforward to implement but allow for a controlled number of don’t cares to be introduced in the substrings that constitute the dictionary used in compression. This study assesses the robustness of compressibility based measures of similarity under these faster yet inherently more dispersive paradigms built around LZW.
2011 Data Compression Conference (DCC 2011), 29-31 March 2011, Snowbird, UT, USA; 01/2011
[Show abstract][Hide abstract] ABSTRACT: The quantitative underpinning of the information content of biosequences represents an elusive goal and yet also an obvious prerequisite to the quantitative modeling and study of biological function and evolution. Several past studies have addressed the question of what distinguishes biosequences from random strings, the latter being clearly unpalatable to the living cell. Such studies typically analyze the organization of biosequences in terms of their constituent characters or substrings and have, in particular, consistently exposed a tenacious lack of compressibility on behalf of biosequences. This article attempts, perhaps for the first time, an assessement of the structure and randomness of polypeptides in terms on newly introduced parameters that relate to the vocabulary of their (suitably constrained) subsequences rather than their substrings. It is shown that such parameters grasp structural/functional information, and are related to each other under a specific set of rules that span biochemically diverse polypeptides. Measures on subsequences separate few amino acid strings from their random permutations, but show that the random permutations of most polypeptides amass along specific linear loci.
Journal of computational biology: a journal of computational molecular cell biology 08/2010; 17(8):1011-49. DOI:10.1089/cmb.2010.0073 · 1.74 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Words that appear as constrained subsequences in a text-string are considered as possible indicators of the host string structure, hence also as a possible means of sequence comparison and classification. The constraint consists of imposing a bound on the number ω of positions in the text that may intervene between any two consecutive characters of a subsequence. A subset of such ω-sequences is then characterized that consists, in intuitive terms, of sequences that could not be enriched with more characters without losing some occurrence in the text. A compact spatial representation is then proposed for these representative sequences, within which a number of parameters can be defined and measured. In the final part of the paper, such parameters are empirically analyzed on a small collection of text-strings endowed with various degrees of structure.
[Show abstract][Hide abstract] ABSTRACT: The quantitative underpinning of the information contents of biosequences represents an elusive goal and yet also an obvious prerequisite to the quantitative modeling and study of biological function and evolution. Previous studies have consistently exposed a tenacious lack of compressibility on behalf of biosequences. This leaves the question open as to what distinguishes them from random strings, the latter being clearly unpalatable to the living cell. This paper assesses the randomness of biosequences in terms on newly introduced parameters that relate to the vocabulary of their (suitably constrained) subsequences rather than their substrings. Results from experiments show the potential of the method in distinguishing a protein sequence from its random reshuffling, as well as in tasks of classification and clustering.
Data Compression Conference, 2009. DCC '09.; 04/2009
[Show abstract][Hide abstract] ABSTRACT: Saturated patterns with don't care like those emerged in biosequence motif discovery have proven a valuable notion also in the design of lossless and lossy compression of sequence data. In independent endeavors, the peculiarities inherent to the compression of tables have been examined, leading to compression schemata advantageously hinged on a prudent rearrangement of columns. The present paper introduces off-line table compression by textual substitution in which the patterns used in compression are chosen among models or templates that capture recurrent record subfields. A model record is to be interpreted here as a sequence of intermixed solid and don't care characters that obeys, in addition, some conditions of saturation: most notably, it must be not possible to replace a don't care in the model by a solid character without having to forfeit some of its occurrences in the table. Saturation is expected to save on the size of the codebook at the outset, and hence to improve compression. It also induces some clustering of the records in the table, which may present independent interest. Results from preliminary experiments show the savings and potential for classification brought about by this method in connection with a table of specimens collected in a context of biodiversity studies.
Data Compression Conference, 2008. DCC 2008; 04/2008
[Show abstract][Hide abstract] ABSTRACT: We describe succinct and compact representations of the bidirectional bwt of a string s∈Σ * which provide increasing navigation power and a number of space-time tradeoffs. One such representation allows to extend a substring of s by one character from the left and from the right in constant time, taking O(|s|log|Σ|) bits of space. We then match the functions supported by each representation to a number of algorithms that traverse the nodes of the suffix tree of s, exploiting connections between the bwt and the suffix-link tree. This results in near-linear time algorithms for many sequence analysis problems (e.g., maximal unique matches), for the first time in succinct space.