Article

# Phase transitions in sequence matches and nuclei acid structure

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

Analyses of phase transitions in biopolymers have previously been restricted to studies of average behavior along macromolecules. Extremal properties, such as longest helical region, can now be studied with a new family of probability distributions [Arratia, R., Gordon, L. & Waterman, M. S. (1986) Ann. Stat. 14, 971-993]. Not only is such extremal behavior analyzed with great precision, but new phase transitions are determined. One phase transition occurs when behavior of the free energy of the longest helical region abruptly changes from proportional to sequence length. The annealing of two single-stranded molecules and the melting of a double helix are both considered. These results, initially suggested by studies of optimal matching of random DNA sequences [Smith, T. F., Waterman, M. S. & Burks, C. (1985) Nucleic Acids Res. 13, 645-656], also have importance for significance tests in comparison of nucleic acid or protein sequences.

## No full-text available

... The aim of the LCS problem is to find the longest of them. This problem and its variants have been widely studied in biology [7][8][9][10], computer science [11][12][13][14], probability theory [16][17][18][19][20][21] and more recently in statistical physics [15,[22][23][24][25]. ...
... Starting from, say, left ends of chains (see figure 1) we find the first actually existing contact between the monomers i (of the first chain) and j (of the second chain) and sum over all possible arrangements of this first contact. The first term '1' in (9) corresponds to the arrangement with no contacts at all. The entries β i,j (1 i m, 1 j n) are the statistical weights of bonds; they are encoded in a contact map {β}: ...
... The initial conditions forF m,n are transformed intoF 0,n =F n,0 =F 0,0 = 0. Note that the model of heteropolymer binding described above is an auxiliary one and it bears only vague resemblance to the formation of real polymer-polymer complexes of linear geometry (on can think of a formation of a double-stranded DNA as, probably, the most familiar example). Indeed, in the partition function described by (9), a series of important features of real-life DNAs are neglected, namely (i) the 'loop factors', i.e. the entropic penalty in the partition function due to forcing ends of side-loops (see figure 1) to meet again; (ii) the cooperativity of bond formation, meaning that it is much easier to form a bond if there is another one between two adjacent monomers; (iii) the restriction on the minimal loop size, which takes into account the finite flexibility of polymer chains; (iv) the fact that different matches (i.e. A-A versus B-B) can have different energies. ...
Article
Full-text available
A new statistical approach to alignment (finding the longest common subsequence) of two random RNA-type sequences is proposed. We have constructed a generalized 'dynamic programming' algorithm for finding the extreme value of the free energy of two noncoding RNAs. In our procedure, we take into account the binding free energy of two random heteropolymer chains which are capable of forming the cloverleaf-like spatial structures typical for RNA molecules. The algorithm is based on two observations: (i) the standard alignment problem can be considered as a zero-temperature limit of a more general statistical problem of binding of two associating heteropolymer chains; (ii) this last problem can be generalized naturally to consider sequences with hierarchical cloverleaf-like structures (i.e. of RNA type). The approach also permits us to perform a 'secondary structure recovery'. Namely, we can predict the optimal secondary structures of interacting RNAs in a zero-temperature limit knowing only their primary sequences.
... We focus on the simplest version of the Smith-Waterman algorithm with linear gap cost. Our calculation reproduces respectively the exact and conjectured result for very large and small gap costs; the latter yields directly (Waterman et al., 1987 ) the important Chvv atal-Sankoo constant of the longest-commonsubsequence problem (Chvv atal and Sankoo, 1975). Our results are not exact for intermediate values of gap costs; however, a comparison with numerical estimates for random nucleotide sequences indicates that our approximate result is oo by only several percent in the worst case. ...
... It is well known from the study by Waterman et al. (1987) and Arratia and Waterman (1994) that gapped local alignment of random sequences exhibits a log-linear phase transition along a critical line of parameters c () in the parameter space (; ). For < c (), h N i depends linearly on N while for > c (), h N i depends logarithmically on N. The angular brackets denote here the average over the whole ensemble of randomly drawn sequences. ...
... conjectured rst by R. Arratia (unpublished; see also Steele (1986)). Through a simple relation between a 0 and c ( 0 ) pointed out rst by Waterman et al. (1987), our work gives an indirect calculation of the Chvv atal-Sankoo constant. ...
Article
A detailed analytic study of the log-linear phase transition of the Smith–Waterman local alignment algorithm is presented. A rectangular alignment lattice is introduced to facilitate the statistical analysis for alignment with gaps. With a few simplifying assumptions, we obtain an analytic expression for the loci of the phase transition line. Our result reproduces the exact and conjectured values for the very large and very small gap costs; the latter corresponds to the related problem of the longest common subsequence. For intermediate values of gap costs, our result is not exact, although a comparison to numerical results yielded a difference of no more than several percent.
... The problem of finding the LCS of a pair of linear sequences drawn from the alphabet of c letters is formulated as follows. Consider two sequences α = in biology [74,75,76,77], computer science [78,79,80,81], probability theory [82,83,84,85,86,87] and more recently in statistical physics [88,89,90,42]. ...
... where the dependence of the height on the the exponent in the distribution has a well-defined maximum. 76 ...
Article
Complex organization is found in many biological systems. For example, biopolymers could possess very hierarchic structure, which provides their functional peculiarity. Understating such, complex organization allows describing biological phenomena and predicting molecule functions. Besides, we can try to characterize the specific phenomenon by some probabilistic quantities (variances, means, etc), assuming the primary biopolymer structure to be randomly formed according to some statistical distribution. Such a formulation is oriented toward evolutionary problems.Artificially constructed biological network is another common object of statistical physics with rich functional properties. A behavior of cells is a consequence of complex interactions between its numerous components, such as DNA, RNA, proteins and small molecules. Cells use signaling pathways and regulatory mechanisms to coordinate multiple processes, allowing them to respond and to adapt to changing environment. Recent theoretical advances allow us to describe cellular network structure using graph concepts to reveal the principal organizational features shared with numerous non-biological networks.The aim of this thesis is to develop bunch of methods for studying statistical and dynamic objects of complex architecture and, in particular, scale-free structures, which have no characteristic spatial and/or time scale. For such systems, the use of standard mathematical methods, relying on the average behavior of the whole system, is often incorrect or useless, while a detailed many-body description is almost hopeless because of the combinatorial complexity of the problem. Here we focus on two problems.The first part addresses to statistical analysis of random biopolymers. Apart from the evolutionary context, our studies cover more general problems of planar topology appeared in description of various systems, ranging from gauge theory to biophysics. We investigate analytically and numerically a phase transition of a generic planar matching problem, from the regime, where almost all the vertices are paired, to the situation, where a finite fraction of them remains unmatched.The second part of this work focus on statistical properties of networks. We demonstrate the possibility to define co-expression gene clusters within a network context from their specific motif distribution signatures. We also show how a method based on the shortest path function (SPF) can be applied to gene interactions sub-networks of co-expression gene clusters, to efficiently predict novel regulatory transcription factors (TFs). The biological significance of this method by applying it on groups of genes with a shared regulatory locus, found by genetic genomics, is presented. Finally, we discuss formation of stable patters of motifs in networks under selective evolution in context of creation of islands of "superfamilies".
... Figure 1A and C also include examples where low gap penalties were used. When gap penalties are too low, alignments shift from local to global and the extreme value statistics no longer apply ( Waterman et al., 1987;Mott, 1992;Altschul & Gish, 1996). In contrast, very high gap penalties simply produce fewer alignments with gaps and move the algorithm towards the BLAST HSP model, where the extreme value distribution was ®rst shown to apply (Karlin & Altschul, 1990;Altschul & Gish, 1996). ...
Article
The FASTA package of sequence comparison programs has been modified to provide accurate statistical estimates for local sequence similarity scores with gaps. These estimates are derived using the extreme value distribution from the mean and variance of the local similarity scores of unrelated sequences after the scores have been corrected for the expected effect of library sequence length. This approach allows accurate estimates to be calculated for both FASTA and Smith-Waterman similarity scores for protein/protein, DNA/DNA, and protein/translated-DNA comparisons. The accuracy of the statistical estimates is summarized for 54 protein families using FASTA and Smith-Waterman scores. Probability estimates calculated from the distribution of similarity scores are generally conservative, as are probabilities calculated using the Altschul-Gish lambda, kappa, and eta parameters. The performance of several alternative methods for correcting similarity scores for library-sequence length was evaluated using 54 protein superfamilies from the PIR39 database and 110 protein families from the Prosite/SwissProt rel. 34 database. Both regression-scaled and Altschul-Gish scaled scores perform significantly better than unscaled Smith-Waterman or FASTA similarity scores. When the Prosite/ SwissProt test set is used, regression-scaled scores perform slightly better; when the PIR database is used, Altschul-Gish scaled scores perform best. Thus, length-corrected similarity scores improve the sensitivity of database searches. Statistical parameters that are derived from the distribution of similarity scores from the thousands of unrelated sequences typically encountered in a database search provide accurate estimates of statistical significance that can be used to infer sequence homology.
... Differential weighting of residues due to position should also be considered. For example, given the constraints of the alignment, if the probability of chance occurrence of three identical, but dispersed, residues is greater than that of three or more contiguous identical residues [75], should the values assigned to the residues of the contiguous cluster be higher than those of the dispersed residues? The use of the multiple alignment program with the modification described above has suggested another correction that could aid in a better representation of relationships reflecting evolution. ...
Article
The eukaryotic retroid family is comprised of retroviruses, retrotransposons, two classes of DNA viruses (caulimoviruses of plants and hepadnaviruses of animals), retroposons, and cellular organelle group II introns and plasmids. The presence of reverse transcriptase-like sequences in these various genetic elements defines membership in this family. Retroid reverse transcriptase sequence similarities clearly indicate a shared ancestor for this gene. Generally, different retroid lineages conserve retroviral-like gene content, order and homologue similarity. The retroposons and group II introns and plasmids are an exception to such conservation, suggesting that either recombination and/or independent gene assortment has played a role in the evolution of this retroid lineage.The multiple sequence analysis of the retroid proteins indicates the current limits of methods used in deduction of distant sequence relationships. Subsequent manual refinement of the data suggests several areas of future computational development that will be necessary to adequately assess the most distant of protein relationships, as the sequence database increases.
... If G is too small, then the algorithm finds many false matches and its statistical behaviour changes from a logarithmic to linear dependence on sequence length (Waterman et al., 1987). ...
Article
Full-text available
Motivation: Extra useful information can be extracted from a DNA chromatogram trace, over that contained in the base-called DNA sequence. Many sequencing applications can benefit from examination of these traces. Results: An algorithm, based on dynamic programming, for aligning a DNA chromatogram to a DNA sequence is described and implemented. Its applications to vector clipping, EST alignment and mutation detection are discussed.
... A sharp transition from cubic to quadratic time complexity occurs near 57% A þ C. This transition is very reminiscent of the phase transition behavior of local alignment scoring systems described in Ref. 23. ...
Article
Full-text available
Commonly used RNA folding programs compute the minimum free energy structure of a sequence under the pseudoknot exclusion constraint. They are based on Zuker's algorithm which runs in time O(n(3)). Recently, it has been claimed that RNA folding can be achieved in average time O(n(2)) using a sparsification technique. A proof of quadratic time complexity was based on the assumption that computational RNA folding obeys the "polymer-zeta property". Several variants of sparse RNA folding algorithms were later developed. Here, we present our own version, which is readily applicable to existing RNA folding programs, as it is extremely simple and does not require any new data structure. We applied it to the widely used Vienna RNAfold program, to create sibRNAfold, the first public sparsified version of a standard RNA folding program. To gain a better understanding of the time complexity of sparsified RNA folding in general, we carried out a thorough run time analysis with synthetic random sequences, both in the context of energy minimization and base pairing maximization. Contrary to previous claims, the asymptotic time complexity of a sparsified RNA folding algorithm using standard energy parameters remains O(n(3)) under a wide variety of conditions. Consistent with our run-time analysis, we found that RNA folding does not obey the "polymer-zeta property" as claimed previously. Yet, a basic version of a sparsified RNA folding algorithm provides 15- to 50-fold speed gain. Surprisingly, the same sparsification technique has a different effect when applied to base pairing optimization. There, its asymptotic running time complexity appears to be either quadratic or cubic depending on the base composition. The code used in this work is available at: .
... In any particular case, the adequacy of the choice of the few specified degrees of freedom must be evaluated empirically: major differences with the predictions suggest that additional degrees of freedom must be specified, i.e., for the purposes at hand they are not behaving randomly enough 1 . These considerations have led to a number of studies of large-scale behavior of various problems based on specifications involving a small number of parameters [4, 19, 17, 37, 39] as well as applications of such results [15, 18]. However, these studies have all focussed on the superficial description of a problem and not on its deep structure. ...
Article
We introduce a technique for analyzing the behavior of sophisticated AI search programs working on realistic, large-scale problems. This approach allows us to predict where, in a space of problem instances, the hardest problems are to be found and where the fluctuations in difficulty are greatest. Our key insight is to shift emphasis from modelling sophisticated algorithms directly to modelling a search space that captures their principal effects. We compare our model's predictions with actual data on real problems obtained independently and show that the agreement is quite good. By systematically relaxing our underlying modelling assumptions we identify their relative contribution to the remaining error and then remedy it. We also discuss further applications of our model and suggest how this type of analysis can be generalized to other kinds of AI problems.
... During optimization, we sweep through the parameter space, but avoid false positive errors from overpredicting the frequency of C insertions by accepting only parameter choices in which about one in every 25 bases is edited on average. Besides, we disallow combinations of the parameters that drive the alignment with random sequences into the linear regime (Waterman et al., 1987), where the maximal alignment score does not follow the Gumbel distribution in order to be able to later assess the statistical significance of our hits. As a result, the parameter values α = 8, β = 14 and γ = 2 give the most accurate prediction of amino acids and editing sites. ...
Article
Full-text available
Insertional RNA editing renders gene prediction very difficult compared to organisms without such RNA editing. A case in point is the mitochondrial genome of Physarum polycephalum in which only about one-third of the number of genes that are to be expected given its length are annotated. Thus, gene prediction methods that explicitly take into account insertional editing are needed for successful annotation of such genomes. We annotate the mitochondrial genome of P.polycephalum using several different approaches for gene prediction in organisms with insertional RNA editing. We computationally validate our annotations by comparing the results from different methods against each other and as proof of concept experimentally validate two of the newly predicted genes. We more than double the number of annotated putative genes in this organism and find several intriguing candidate genes that are not expected in a mitochondrial genome. The C source code of the programs described here are available upon request from the corresponding author.
... The use of statistical mechanics has been applied in a variety of situations to understand the generic behavior of combinatorial search problems [19]. These include phase transitions due to pruning in heuristic search [26], models of associative memory [29,51], automatic planning [5], optimization problems [6,64] and various cases of pattern matching [58,23,17,33]. Here we will focus on the recent studies that relate search difficulty to the structure of constraint satisfaction problems [6,9,15,32,39,47,59,60], with information on these results also available via the World Wide Web [21]. ...
Article
Full-text available
Abstract Thestatisticalmechanicsofcombinatorialsearchproblemsisdescribedusingtheexampleofthewell-known NP-complete graph coloring problem. We focus on a recently identifled phase transition from under- to overconstrainedproblems,nearwhichareconcentratedmanyhardtosolvesearchproblems. Thus,areadily computedmeasureofproblemstructurepredictsthedi‐cultyofsolvingtheproblem,onaverage. However, this prediction is associated with a large variance and depends on the somewhat,arbitrary choice of the problem ensemble. Thus these results are of limited direct use for individual instances. To help address this limitation, additional parameters, describing problem structure as well as heuristic efiectiveness, are introduced. This also highlights the distinction between the statistical mechanics of combinatorial search problems, with their exponentially large search spaces, and physical systems, whose interactions are often governed by a simple euclidean metric. Chapter 1
... Here we extend the analysis to local alignment algorithms [30] which find the best match between contiguous subsequences, subject to (finite) penalties for gaps and mismatches. For uncorrelated random sequences, i.e., for independent sequences with iid or Markov letters, it is well known that depending on the choice of scoring parameters, the length of the optimal subsequence alignment depends either linearly or logarithmically on the total length of the sequences [34,5]. A phase transition line separates the space of scoring parameters into two regimes: the "linear phase" for small gap and mismatch costs, and the "log phase" for large penalty costs. ...
Conference Paper
The statistical properties of local alignment algorithms with gaps are analyzed theoretically for uncorrelated and correlated DNA sequences. In the vicinity of the log-linear phase transition, the statistics of alignment with gaps is shown to be characteristically different from that of gapless alignment. The optimal scores obtained for uncorrelated sequences obey certain robust scaling laws. Deviation from these scaling laws signals sequence homology, and can be used to guide the empirical selection of scoring parameters for the optimal detection of sequence similarities. This can be accomplished in a computationally efficient way by using a novel approach focusing on the score landscape. Furthermore, by assuming a few gross features characterizing the statistics of underlying sequence-sequence correlations, quantitative criteria are obtained for the choice of optimal scoring parameters: Optimal similarity detection is most likely to occur in a region close to the log side of the log-linear phase transition.
... The study of alignment using similarity scoring has shown that for ungapped local alignments of randomly generated sequences, the parameter space for match and mismatch weights is divided into logarithmic and linear regions. In the logarithmic region, the parameters produce alignment scores proportional to the logarithm of sequence lengths whereas in the linear region, the scores are directly proportional to the sequence lengths [29]. It is generally accepted that weight combinations which fall within the logarithmic region are useful for detecting biologically related sequences, whereas those in the linear region do not distinguish between related and unrelated sequences. ...
Conference Paper
In this paper, we develop a new approach for analyzing DNA sequences in order to detect regions with similar nucleotide composition. Our algorithm, which we call composition alignment or, more whimsically, scrambled alignment, employs the mechanisms of string matching and string comparison yet avoids the overdependence of those methods on position-by-position matching. In composition alignment, we extend the matching concept to composition matching. Two strings have a composition match if their lengths are equal and they have the same nucleotide content. We define the composition alignment problem and give a dynamic programming solution. We explore several composition match weighting functions and show that composition alignment with one class of these can be computed in O(nm) time, the same as for standard alignment. We discuss statistical properties of composition alignment scores and demonstrate the ability of the algorithm to detect regions of similar composition in eukaryotic promoter sequences in the absence of detectable similarity through standard alignment.
... Nimblescan and Signalmap were used to prepare the ChIP-chip data figures. The final Nbs1 peak sets (Suppl Tables S1, S2) were obtained using Tamalpais (Bieda et al., 2006;Waterman et al., 1987) modified to work for 2.1 million probe arrays. In short, the number of consecutive probes above threshold values required for peak calling was increased to account for the increased number of probes, as compared to the 384,000 probes for which the software was designed. ...
Article
After immunization or infection, activation-induced cytidine deaminase (AID) initiates diversification of immunoglobulin (Ig) genes in B cells, introducing mutations within the antigen-binding V regions (somatic hypermutation, SHM) and double-strand DNA breaks (DSBs) into switch (S) regions, leading to antibody class switch recombination (CSR). We asked if, during B cell activation, AID also induces DNA breaks at genes other than IgH genes. Using a nonbiased genome-wide approach, we have identified hundreds of reproducible, AID-dependent DSBs in mouse splenic B cells shortly after induction of CSR in culture. Most interestingly, AID induces DSBs at sites syntenic with sites of translocations, deletions, and amplifications found in human B cell lymphomas, including within the oncogene B cell lymphoma11a (bcl11a)/evi9. Unlike AID-induced DSBs in Ig genes, genome-wide AID-dependent DSBs are not restricted to transcribed regions and frequently occur within repeated sequence elements, including CA repeats, non-CA tandem repeats, and SINEs.
... Unfortunately , no good statistical theory yet exists to permit the relaxation of this restriction. Certain results suggest, however, that the general spirit of our analysis should apply as well to alignments with gaps (Smith et al. 1985; Waterman et al. 1987; Mott 1992). Even when gaps are allowed, however, it should be understood that alignment scores based on simple substitution matrices, such as those studied here, exclude many potential sources of information concerning biological relatedness. ...
Article
Full-text available
Protein sequence alignments generally are constructed with the aid of a "substitution matrix" that specifies a score for aligning each pair of amino acids. Assuming a simple random protein model, it can be shown that any such matrix, when used for evaluating variable-length local alignments, is implicitly a "log-odds" matrix, with a specific probability distribution for amino acid pairs to which it is uniquely tailored. Given a model of protein evolution from which such distributions may be derived, a substitution matrix adapted to detecting relationships at any chosen evolutionary distance can be constructed. Because in a database search it generally is not known a priori what evolutionary distances will characterize the similarities found, it is necessary to employ an appropriate range of matrices in order not to overlook potential homologies. This paper formalizes this concept by defining a scoring system that is sensitive at all detectable evolutionary distances. The statistical behavior of this scoring system is analyzed, and it is shown that for a typical protein database search, estimating the originally unknown evolutionary distance appropriate to each alignment costs slightly over two bits of information, or somewhat less than a factor of five in statistical significance. A much greater cost may be incurred, however, if only a single substitution matrix, corresponding to the wrong evolutionary distance, is employed.
Article
Consider two random sequences $X_1 \cdots X_n$ and $Y_1 \cdots Y_n$ of i.i.d. letters in which the probability that two distinct letters match is $p > 0$. For each value $a$ between $p$ and 1, the length of the longest contiguous matching between the two sequences, requiring only a proportion $a$ of corresponding letters to match, satisfies a strong law analogous to the Erdos-Renyi law for coin tossing. The same law applies to matching between two nonoverlapping regions within a single sequence $X_1 \cdots X_n$, and a strong law with a smaller constant applies to matching between two overlapping regions within that single sequence. The method here also works to obtain the strong law for matching between multidimensional arrays, between two Markov chains and for the situation in which a given proportion of mismatches is required.
Article
Full-text available
A new statistical method of alignment of two heteropolymers which can form hierarchical cloverleaf-like secondary structures is proposed. This offers a new constructive algorithm for quantitative determination of binding free energy of two noncoding RNAs with arbitrary primary sequences. The alignment of ncRNAs differs from the complete alignment of two RNA sequences: in ncRNA case we align only the sequences of nucleotides which constitute pairs between two different RNAs, while the secondary structure of each RNA comes into play only by the combinatorial factors affecting the entropc contribution of each molecule to the total cost function. The proposed algorithm is based on two observations: i) the standard alignment problem is considered as a zero-temperature limit of a more general statistical problem of binding of two associating heteropolymer chains; ii) this last problem is generalized onto the sequences with hierarchical cloverleaf-like structures (i.e. of RNA-type). Taking zero-temperature limit at the very end we arrive at the desired "cost function" of the system with account for entropy of side cactus-like loops. Moreover, we have demonstrated in detail how our algorithm enables to solve the "structure recovery" problem. Namely, we can predict in zero-temperature limit the cloverleaf-like (i.e. secondary) structure of interacting ncRNAs by knowing only their primary sequences. Comment: 23 pages, 14 figures
Article
We consider a sequence matching problem involving the optimal alignment score for contiguous subsequences, rewarding matches and penalizing for deletions and mismatches. This score is used by biologists comparing pairs of DNA or protein sequences. We prove that for two sequences of length $n$, as $n \rightarrow \infty$, there is a phase transition between linear growth in $n$, when the penalty parameters are small, and logarithmic growth in $n$, when the penalties are large. The results are valid for independent sequences with iid or Markov letters. The crucial step in proving this is to derive a large deviation result for matching with deletions. The longest common subsequence problem of Chvatal and Sankoff is a special case of our setup. The proof of the large deviation result exploits the Azuma-Hoeffding lemma. The phase transition is also established for more general scoring schemes allowing general letter-to-letter alignment penalties and block deletion penalties. We give a general method for applying the bounded increments martingale method to Lipschitz functionals of Markov processes. The phase transition holds for matching Markov chains and for nonoverlapping repeats in a single sequence.
Article
On the occasion of Dr. Michael Waterman's 80th birthday, we review his major contributions to the field of computational biology and bioinformatics including the famous Smith-Waterman algorithm for sequence alignment, the probability and statistics theory related to sequence alignment, algorithms for sequence assembly, the Lander-Waterman model for genome physical mapping, combinatorics and predictions of ribonucleic acid structures, word counting statistics in molecular sequences, alignment-free sequence comparison, and algorithms for haplotype block partition and tagSNP selection related to the International HapMap Project. His books Introduction to Computational Biology: Maps, Sequences and Genomes for graduate students and Computational Genome Analysis: An Introduction geared toward undergraduate students played key roles in computational biology and bioinformatics education. We also highlight his efforts of building the computational biology and bioinformatics community as the founding editor of the Journal of Computational Biology and a founding member of the International Conference on Research in Computational Molecular Biology (RECOMB).
Article
Consider a renewal process. The renewal events partition the process into i.i.d. renewal cycles. Assume that on each cycle, a rare event called 'success' can occur. Such successes lend themselves naturally to approximation by Poisson point processes. If each success occurs after a random delay, however, Poisson convergence may be relatively slow, because each success corresponds to a time interval, not a point. In 1996, Altschul and Gish proposed a finite-size correction to a particular approximation by a Poisson point process. Their correction is now used routinely (about once a second) when computers compare biological sequences, although it lacks a mathematical foundation. This paper generalizes their correction. For a single renewal process or several renewal processes operating in parallel, this paper gives an asymptotic expansion that contains in successive terms a Poisson point approximation, a generalization of the Altschul-Gish correction, and a correction term beyond that.
Article
In bioinformatics, the notion of an island' enhances the efficient simulation of gapped local alignment statistics. This paper generalizes several results relevant to gapless local alignment statistics from one to higher dimensions, with a particular eye to applications in gapped alignment statistics. For example, reversal of paths (rather than of discrete time) generalizes a distributional equality, from queueing theory, between the Lindley (local sum) and maximum processes. Systematic investigation of an ownership' relationship among vertices in Z<sup>2</sup> formalizes the notion of an island as a set of vertices having a common owner. Predictably, islands possess some stochastic ordering and spatial averaging properties. Moreover, however, the average number of vertices in a subcritical stationary island is 1, generalizing a theorem of Kac about stationary point processes. The generalization leads to alternative ways of simulating some island statistics.
Article
Full-text available
We consider a pair of random heteropolymer chains with quenched primary sequences. For this system we have analyzed the dependence of average ground state energy per monomer E on chain length n in the ensemble of chains with uniform distribution of primary sequences of monomers. Every monomer of the first (second) chain is randomly and independently chosen with the uniform probability distribution p=1/c from a set of c different types A , B , C , D ,... (A', B', C', D',...) . Monomers of the first chain could form saturating reversible bonds with monomers of the second chain. The bonds between similar monomer types (such as A-A', B-B', C-C', etc.) have the attraction energy u , while the bonds between different monomer types (such as A-B', A-D', B-D', etc.) have the attraction energy v . The main attention is paid to the computation of the normalized free energy E(n) for intermediate chain lengths n and different ratios a=v/u at sufficiently low temperatures, when the entropic contribution of the loop formation is negligible compared to direct energetic interactions between chain monomers, and when the partition function of the chains is dominated by the ground state. The performed analysis allows one to derive the force f(x) which is necessary to apply for unzipping of two random heteropolymers of equal lengths whose ends are separated by the distance x , averaged over all equally distributed primary structures at low temperatures for fixed values a and c .
Article
Statistical approaches help in the determination of significant configurations in protein and nucleic acid sequence data. Three recent statistical methods are discussed: (i) score-based sequence analysis that provides a means for characterizing anomalies in local sequence text and for evaluating sequence comparisons; (ii) quantile distributions of amino acid usage that reveal general compositional biases in proteins and evolutionary relations; and (iii) r-scan statistics that can be applied to the analysis of spacings of sequence markers.
Article
DNA and protein sequence comparisons are performed by a number of computational algorithms. Most of these algorithms search for the alignment of two sequences that optimizes some alignment score. It is an important problem to assess the statistical significance of a given score. In this paper we use newly developed methods for Poisson approximation to derive estimates of the statistical significance of k-word matches on a diagonal of a sequence comparison. We require at least q of the k letters of the words to match where 0 less than q less than or equal to k. The distribution of the number of matches on a diagonal is approximated as well as the distribution of the order statistics of the sizes of clumps of matches on the diagonal. These methods provide an easily computed approximation of the distribution of the longest exact matching word between sequences. The methods are validated using comparisons of vertebrate and E. coli protein sequences. In addition, we compare two HLA class II transplantation antigens by this method and contrast the results with a dynamic programming approach. Several open problems are outlined in the last section.
Article
The sensitivity and selectivity of the FASTA and the Smith-Waterman protein sequence comparison algorithms were evaluated using the superfamily classification provided in the National Biomedical Research Foundation/Protein Identification Resource (PIR) protein sequence database. Sequences from each of the 34 superfamilies in the PIR database with 20 or more members were compared against the protein sequence database. The similarity scores of the related and unrelated sequences were determined using either the FASTA program or the Smith-Waterman local similarity algorithm. These two sets of similarity scores were used to evaluate the ability of the two comparison algorithms to identify distantly related protein sequences. The FASTA program using the ktup = 2 sensitivity setting performed as well as the Smith-Waterman algorithm for 19 of the 34 superfamilies. Increasing the sensitivity by setting ktup = 1 allowed FASTA to perform as well as Smith-Waterman on an additional 7 superfamilies. The rigorous Smith-Waterman method performed better than FASTA with ktup = 1 on 8 superfamilies, including the globins, immunoglobulin variable regions, calmodulins, and plastocyanins. Several strategies for improving the sensitivity of FASTA were examined. The greatest improvement in sensitivity was achieved by optimizing a band around the best initial region found for every library sequence. For every superfamily except the globins and immunoglobulin variable regions, this strategy was as sensitive as a full Smith-Waterman. For some sequences, additional sensitivity was achieved by including conserved but nonidentical residues in the lookup table used to identify the initial region.
Article
Full-text available
Protein sequence alignments have become an important tool for molecular biologists. Local alignments are frequently constructed with the aid of a "substitution score matrix" that specifies a score for aligning each pair of amino acid residues. Over the years, many different substitution matrices have been proposed, based on a wide variety of rationales. Statistical results, however, demonstrate that any such matrix is implicitly a "log-odds" matrix, with a specific target distribution for aligned pairs of amino acid residues. In the light of information theory, it is possible to express the scores of a substitution matrix in bits and to see that different matrices are better adapted to different purposes. The most widely used matrix for protein sequence comparison has been the PAM-250 matrix. It is argued that for database searches the PAM-120 matrix generally is more appropriate, while for comparing two specific proteins with suspected homology the PAM-200 matrix is indicated. Examples discussed include the lipocalins, human alpha 1 B-glycoprotein, the cystic fibrosis transmembrane conductance regulator and the globins.
Article
The algorithm of Smith & Waterman for identification of maximally similar subsequences is extended to allow identification of all non-intersecting similar subsequences with similarity score at or above some preset level. The resulting alignments are found in order of score, with the highest scoring alignment first. In the case of single gaps or multiple gaps weighted linear with gap length, the algorithm is extremely efficient, taking very little time beyond that of the initial calculation of the matrix. The algorithm is applied to comparisons of tRNA-rRNA sequences from Escherichia coli. A statistical analysis is important for proper evaluation of the results, which differ substantially from the results of an earlier analysis of the same sequences by Bloch and colleagues.
Article
We have compared commonly used sequence comparison algorithms, scoring matrices, and gap penalties using a method that identifies statistically significant differences in performance. Search sensitivity with either the Smith-Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45-55, and optimized gap penalties instead of the conventional PAM250 matrix. More dramatic improvement can be obtained by scaling similarity scores by the logarithm of the length of the library sequence (In()-scaling). With the best modern scoring matrix (BLOSUM55 or JO93) and optimal gap penalties (-12 for the first residue in the gap and -2 for additional residues), Smith-Waterman and FASTA performed significantly better than BLASTP. With In()-scaling and optimal scoring matrices (BLOSUM45 or Gonnet92) and gap penalties (-12, -1), the rigorous Smith-Waterman algorithm performs better than either BLASTP and FASTA, although with the Gonnet92 matrix the difference with FASTA was not significant. Ln()-scaling performed better than normalization based on other simple functions of library sequence length. Ln()-scaling also performed better than scores based on normalized variance, but the differences were not statistically significant for the BLOSUM50 and Gonnet92 matrices. Optimal scoring matrices and gap penalties are reported for Smith-Waterman and FASTA, using conventional or In()-scaled similarity scores. Searches with no penalty for gap extension, or no penalty for gap opening, or an infinite penalty for gaps performed significantly worse than the best methods. Differences in performance between FASTA and Smith-Waterman were not significant when partial query sequences were used. However, the best performance with complete query sequences was obtained with the Smith-Waterman algorithm and In()-scaling.
Article
Full-text available
Sequence similarity search programs are versatile tools for the molecular biologist, frequently able to identify possible DNA coding regions and to provide clues to gene and protein structure and function. While much attention had been paid to the precise algorithms these programs employ and to their relative speeds, there is a constellation of associated issues that are equally important to realize the full potential of these methods. Here, we consider a number of these issues, including the choice of scoring systems, the statistical significance of alignments, the masking of uninformative or potentially confounding sequence regions, the nature and extent of sequence redundancy in the databases and network access to similarity search services.
Article
A dynamic programming algorithm to find all optimal alignments of DNA subsequences is described. The alignments use not only substitutions, insertions and deletions of nucleotides but also inversions (reversed complements) of substrings of the sequences. The inversion alignments themselves contain substitutions, insertions and deletions of nucleotides. We study the problem of alignment with non-intersecting inversions. To provide a computationally efficient algorithm we restrict candidate inversions to the K highest scoring inversions. An algorithm to find the J best non-intersecting alignments with inversions is also described. The new algorithm is applied to the regions of mitochondrial DNA of Drosophila yakuba and mouse coding for URF6 and cytochrome b and the inversion of the URF6 gene is found. The open problem of intersecting inversions is discussed.
Conference Paper
We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various O(kn) algorithms based on dynamic programming (DP), paying particular attention to dependence on b the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on b. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier cut-off algorithm due to Ukkonen) is O(kn) for random text. Furthermore, we give a heuristic argument that our algorithm is O(kn/(b-1)) on the average, when alphabet size is taken into consideration.
Article
Full-text available
Sequence alignments obtained using affine gap penalties are not always biologically correct, because the insertion of long gaps is over-penalised. There is a need for an efficient algorithm which can find local alignments using non-linear gap penalties. A dynamic programming algorithm is described which computes optimal local sequence alignments for arbitrary, monotonically increasing gap penalties, i.e. where the cost g(k) of inserting a gap of k symbols is such that g(k) >/= g(k-1). The running time of the algorithm is dependent on the scoring scheme; if the expected score of an alignment between random, unrelated sequences of lengths m, n is proportional to log mn, then with one exception, the algorithm has expected running time O(mn). Elsewhere, the running time is no greater than O(mn(m+n)). Optimisations are described which appear to reduce the worst-case run-time to O(mn) in many cases. We show how using a non-affine gap penalty can dramatically increase the probability of detecting a similarity containing a long gap. The source code is available to academic collaborators under licence.
Chapter
Full-text available
Generally, the thermodynamic formalism for dynamical systems, the study of nonlinear flows and/or their maps (the first transformed into the second by a Poincaré surface of section) involves modeling them using a simpler symbolic system and them applying measure theoretic tools from abstract ergodic theory and statistical mechanics to their analyses. (Kolmogorov, 1956; Sinai, 1972; Bowen, 1975; Ruelle, 1978). genotypes.
Article
In bioinformatics, the notion of an ‘island’ enhances the efficient simulation of gapped local alignment statistics. This paper generalizes several results relevant to gapless local alignment statistics from one to higher dimensions, with a particular eye to applications in gapped alignment statistics. For example, reversal of paths (rather than of discrete time) generalizes a distributional equality, from queueing theory, between the Lindley (local sum) and maximum processes. Systematic investigation of an ‘ownership’ relationship among vertices in ℤ 2 formalizes the notion of an island as a set of vertices having a common owner. Predictably, islands possess some stochastic ordering and spatial averaging properties. Moreover, however, the average number of vertices in a subcritical stationary island is 1, generalizing a theorem of Kac about stationary point processes. The generalization leads to alternative ways of simulating some island statistics.
Chapter
This text is addressed to biologists working with DNA or protein data bases, but without enthusiasm for mathematical detail. It is intended to expound the principles of significance theory of homology studies and pattern search without the details of formalism. Complete avoidance is impossible: One cannot explain mathematics without any mathematics.
Article
Computer-assisted analysis is an important instrument of DNA research. This review considers the methods of analysis of nucleotide sequences, describes the main algorithms, provides electronic addresses of the relevant e-mail and WWW servers, and suggests a tentative scheme for analysis of a newly sequenced DNA fragment. The review covers the problems of functional analysis (recognition of protein-coding regions and functional sites) and similarity search in the sequence databanks.
Article
Over two billion US dollars have been budgeted for the Human Genome Project alone in the past twelve years, not to mention other similar or related projects worldwide. These investments have led to the production of enormous amount of biological data, many of which are sequence information of biomolecules — e.g. specifying proteins/DNAs by identifying each amino-acid/nucleotide in the sequential order. These sequence data, presumably containing the “digital” information of life, are hard to decipher. Extracting useful and important information out of those massive biological data has developed into a new branch of science — bioinformatics. One of the most important and widely used method in bioinformatics research is called “sequence alignment”. The basic idea is to expedite the identification of biological functions of a newly sequenced biomolecule, say a protein, by comparing the sequence content of the new molecule to the existing ones (characterized and documented in the database).
Article
We calculate the density function of (U* (t), theta* (t)), where U* (t) is the maximum over [o, g (t)] of a reflected Brownian motion U, where g(t) stands for the last zero of U before t, theta*(t) = f* (t) - g* (t), 7(t) is the hitting time of the level U*(t), and g*(t) is the left-hand point of the interval straddling f*(t). We also calculate explicitly the marginal density functions of U*(t) and theta*(t). Let U-n(*) and theta(*)(n) be the analogs of U*(t) and theta*(t) respectively where the underlying process (U-n) is the Lindley process, i.e. the difference between a centered real random walk and its minimum. We prove that (U-n*/root n, theta(n)*/n, ) converges weakly to n (U*(1),theta*(1)) as n -> infinity.
Article
Alignment algorithms to compare DNA or amino acid sequences are widely used tools in molecular biology. The algorithms depend on the setting of various parameters, most notably gap penalties. The effect that such parameters have on the resulting alignments is still poorly understood. This paper begins by reviewing two recent advances in algorithms and probability that enable us to take a new approach to this question. The first tool we introduce is a newly developed method to delineate efficiently all optimal alignments arising under all choices of parameters. The second tool comprises insights into the statistical behavior of optimal alignment scores. From this we gain a better understanding of the dependence of alignments on parameters in general. We propose novel criteria to detect biologically good alignments and highlight some specific features about the interaction between similarity matrices and gap penalties. To illustrate our analysis we present a detailed study of the comparison of two immunoglobulin sequences.
Article
Full-text available
Searches through biological databases provide the primary motivation for studying sequence alignment statistics. Other motivations include physical models of annealing processes or mathematical similarities to, e.g., first-passage percolation and interacting particle systems. Here, we investigate sequence alignment statistics, partly to explore two general mathematical methods. First, we model the global alignment of random sequences heuristically with Markov additive processes. In sequence alignment, the heuristic suggests a numerical acceleration scheme for simulating an important asymptotic parameter (the Gumbel scale parameter λ). The heuristic might apply to similar mathematical theories. Second, we extract the asymptotic parameter λ from simulation data with the statistical technique of robust regression. Robust regression is admirably suited to 'asymptotic regression' and deserves to be better known for it.
Article
In this paper, we construct stationary sequences of random variables { i : i0} taking values 1 with probability 1/2 and we prove an Erds–Rnyi law of large numbers for the length of the longest run of consecutive +1's in the sample {0,..., n }. Our model, which is called random walk in random scenery, exhibits long-range, positive dependence.
Article
A heuristic approximation to the score distribution of gapped alignments in the logarithmic domain is presented. The method applies to comparisons between random, unrelated protein sequences, using standard score matrices and arbitrary gap penalties. It is shown that gapped alignment behavior is essentially governed by a single parameter, alpha, depending on the penalty scheme and sequence composition. This treatment also predicts the position of the transition point between logarithmic and linear behavior. The approximation is tested by simulation and shown to be accurate over a range of commonly used substitution matrices and gap-penalties.
Article
A method is described for estimating the distribution and hence testing the statistical significance of sequence similarity scores obtained during a data-bank search. Maximum-likelihood is used to fit a model to the scores, avoiding any costly simulation of random sequences. The method is applied in detail to the Smith-Waterman algorithm [see T. F. Smith and M. S. Waterman, J. Molec. Biol. 147, 195-197 (1981)] when gaps are allowed, and is shown to give results very similar to those obtained by simulation.
Chapter
Full-text available
In recent years it has become evident that functional RNAs in living organisms are not just curious remnants from a primordial RNA world but a ubiquitous phenomenon complementing protein enzyme based activity. Functional RNAs, just like proteins, depend in many cases upon their well-defined and evolutionarily conserved three-dimensional structure. In contrast to protein folds, however, RNA molecules have a biophysically important coarse-grained representation: their secondary structure. At this level of resolution at least, RNA structures can be efficiently predicted given only the sequence information. As a consequence, computational studies of RNA routinely incorporate structural information explicitly. RNA secondary structure prediction has proven useful in diverse fields, ranging from theoretical models of sequence evolution and biopolymer folding, to genome analysis, and even the design of biotechnologically or pharmaceutically useful molecules. Properties such as the existence of neutral networks or shape space covering are emergent properties determined by the complex, highly nonlinear relationship between RNA sequences and their structures.
Chapter
Full-text available
We consider a string edit problem in a probabilistic framework. This problem is of considerable interest to many facets of science, most notably molecular biology and computer science. A string editing transforms one string into another by performing a series of weighted edit operations of overall maximum (minimum) cost. An edit operation can be the deletion of a symbol, the insertion of a symbol or the substitution of a symbol. We assume that these weights can be arbitrary distributed. We reduce the problem to finding an optimal path in a weighted grid graph, and provide several results regarding a typical behavior of such a path. In particular, we observe that the optimal path (i.e., edit distance) is asymptotically almost surely (a.s.) equal to n where is a constant and n is the sum of lengths of both strings. We also obtained some bounds on in the so called independent model in which all weights (in the associated grid graph) are assumed to be independent. More importantly, we show that the edit distance is well concentrated around its average value. As a by-product of our results, we also present a precise estimate of the number of alignments between two strings. To prove these findings we use techniques of random walks, diffusion limiting processes, generating functions, and the method of bounded difference.
Article
The field of computational molecular biology and genetics is expanding at an enormous rate. Journals such as CABIOS and Nucleic Acids Research routinely publish articles on computational and mathematical aspects of biology. The purpose of this paper is to provide a bibliographic review of the literature in this area related to DNA mapping and sequence analysis. We have focused on computer and mathematical aspects of molecular biology and genetics (interpreted in a broad sense). Authors are solicited for their additions/corrections to this bibliography. Contact us at the above address.
Article
We describe how techniques that were originally developed in statistical mechanics can be applied to search problems that arise commonly in artificial intelligence. This approach is useful for understanding the typical behavior of classes of problems. In particular, these techniques predict that abrupt changes in computational cost, analogous to physical phase transitions, should occur universally, as heuristic effectiveness or search space topology is varied. We also present a number of open questions raised by these studies.
Article
The free energy of a single-stranded RNA can be calculated by adding the free energies of the components: basepairs, bulges, and loops. Basepairs receive negative free energy while the unpaired bases receive positive free energy. The minimum free energy of a random RNA secondary structure with one domain has valueFnwhere the sequence length isn. Under simplifying assumptions, we show that for “small” values of bulge and loop penaltiesFnhas linear growth inn, while for “large” values of these parametersFnhas logarithmic growth inn. This phase transition generalizes results obtained for the local-alignment score of two random sequences. The random variableFnis conjectured to have a Poisson approximation. The multi-domain secondary structure minimum free energyEnhas linear growth innfor all values of the penalty functions. Nothing more is known about the distributional properties ofEn.
Article
Consider a renewal process. The renewal events partition the process into i.i.d. renewal cycles. Assume that on each cycle, a rare event called 'success’ can occur. Such successes lend themselves naturally to approximation by Poisson point processes. If each success occurs after a random delay, however, Poisson convergence may be relatively slow, because each success corresponds to a time interval, not a point. In 1996, Altschul and Gish proposed a finite-size correction to a particular approximation by a Poisson point process. Their correction is now used routinely (about once a second) when computers compare biological sequences, although it lacks a mathematical foundation. This paper generalizes their correction. For a single renewal process or several renewal processes operating in parallel, this paper gives an asymptotic expansion that contains in successive terms a Poisson point approximation, a generalization of the Altschul-Gish correction, and a correction term beyond that.
Article
Software tools have been developed to do rapid, large-scale protein sequence comparisons on databases of amino acid sequences, using a data parallel computer architecture. This software enables one to compare a protein against a database of several thousand proteins in the same time required by a conventional computer to do a single protein-protein comparison, thus enabling biologists to find relevant similarities much more quickly, and to evaluate many different comparison metrics in a reasonable period of time. We have used this software to analyze the effectiveness of various scoring metrics in determining sequence similarity, and to generate statistical information about the behavior of these scoring systems under the variation of certain parameters.
Article
Due to the rapidity of biological reactions, it is difficult to isolate intermediates or to determine the stoichiometry of participants in intermediate reactions. Instead of determining the absolute amount of each component, this study involved the use of relative parameters, such as dilution factors, percentages probabilities, and slopes of titration curves, that can be more accurately quantified to determine the stoichiometry of components involved in bacteriophage phi29 assembly. This work takes advantage of the sensitive in vitro phage phi29 assembly system, in which 10(8) infectious virions per ml without background can be assembled from eight purified components. It provides a convenient assay for quantification of the stoichiometry of packaging components, including the viral procapsid, genomic DNA, DNA-packaging pRNA, and other structural proteins and enzymes. The presence of a procapsid binding domain and another essential functional domain within the pRNA makes it an ideal component for constructing lethal mutants for competitive procapsid binding. Two methods were used for stoichiometry determination. Method 1 was to determine the combination probability of mutant and wild-type pRNAs bound to procapsids. The probability of procapsids that possess a certain amount of mutant and a certain amount of wild-type pRNA, both with an equal binding affinity, was predicted with the binomial equation [EQUATION IN TEXT] where Z is the total number of pRNAs per procapsid, M is the number of mutant pRNAs bound to one procapsid, and (ZM) is equal to [FORMULA IN TEXT]. With various ratios of mutant to wild-type pRNA in in vitro viral assembly, the percent mutant pRNA versus the yield of virions was plotted and compared to a series of predicted curves to find a best fit. It was determined that five or six copies of pRNA were required for one DNA-packaging event, while only one mutant pRNA per procapsid was sufficient to block packaging. Method 2 involved the comparison of slopes of curves of dilution factors versus the yield of virions. Components with known stoichiometries served as standard controls. The larger the stoichiometry of the component, the more dramatic the influence of the dilution factor on the reaction. A slope of 1 indicates that one copy of the component is involved in the assembly of one virion. A slope larger than 1 would indicate multiple-copy involvement. By this method, the stoichiometry of gp11 in phi29 particles was determined to be approximately 12. These approaches are useful for the determination of the stoichiometry of functional units involved in viral assembly, be they single molecules or oligomers. However, these approaches are not suitable for the determination of exact copy numbers of individual molecules involved if the functional unit is composed of multiple subunits prior to assembly.
Article
Full-text available
All pairs of a large set of known vertebrate DNA sequences were searched by computer for most similar segments. Analysis of this data shows that the computed similarity scores are distributed proportionally to the logarithm of the product of the lengths of the sequences involved. This distribution is closely related to recent results of Erdos and others on the longest run of heads in coin tossing. A simple rule is derived for determination of statistical significance of the similarity scores and to assist in relating statistical and biological significance.
Article
A new high-speed computer algorithm is outlined that ascertains within and between nucleic acid and protein sequences all direct repeats, dyad symmetries, and other structural relationships. Large repeats, repeats of high frequency, dyad symmetries of specified stem length and loop distance, and their distributions are determined. Significance of homologies is assessed by a hierarchy of permutation procedures. Applications are made to papovaviruses, the human papillomavirus HPV, lambda phage, the human and mouse mitochondrial genomes, and the human and mouse immunoglobulin kappa-chain genes.
Article
Introduction to Computational Biology: Maps, Sequencesand Genomes. Chapman Hall, 1995.[WF74] R.A. Wagner and M.J. Fischer. The String to String Correction Problem. Journal of the ACM, 21(1):168--173, 1974.[WM92] S. Wu and U. Manber. Fast Text Searching Allowing Errors. Communicationsof the ACM, 10(35):83--91, 1992.73Bibliography[KOS+00] S. Kurtz, E. Ohlebusch, J. Stoye, C. Schleiermacher, and R. Giegerich.Computation and Visualization of Degenerate Repeats in CompleteGenomes. In ...
• M Zuker
• D Sankoff
Zuker, M. & Sankoff, D. (1984) Bull. Math. Biol. 46, 591-621.
• R W R Darling
• M S Waterman
Darling, R. W. R. & Waterman, M. S. (1986) SIAM J. Appl. Math. 46, 118-132.
• R Arratia
• L Gordon
• M S Waterman
Arratia, R., Gordon, L. & Waterman, M. S. (1986) Ann. Stat. 14, 971-993.
• R Arratia
• M S Waterman
Arratia, R. & Waterman, M. S. (1985) Adv. Math. 55, 13- 23.
• P Erdos
• A Renyi
Erdos, P. & Renyi, A. (1970) J. Anal. Math. 22, 103-111.
• L Gordon
• M Schilling
• M S Waterman
Gordon, L., Schilling, M. & Waterman, M. S. (1986) Probab. Proc. Nati. Acad. Sci. USA 84 (1987) Theor. Rel. Fields 72, 279-287.
• V Chvatal
• D Sankoff
Chvatal, V. & Sankoff, D. (1975) J. Appl. Prob. 12, 306-315.