Article

Computation of d2: A measure of sequence dissimilarity

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It is noteworthy that these word-based algorithms can also find some new functional similarities or dissimilarities that are invisible to other algorithms like FASTA (Blaisdell, 1989a;Hide, Burke, and Davison, 1994) and are useful in detection of coding regions (Fichant and Gautier, 1987) and evolutionary tree reconstruction (Blaisdell, 1989a,b). Several word-based algorithms (Blaisdell, 1986(Blaisdell, , 1989aCressie and Read, 1984;Hide et al., 1994;Pevzner, 1992ab;Torney et al., 1990;Wu, Burke, and Davison, 1997;Wu et al., 2001, among others) have been developed. Vinga and Almeida (2003) review these algorithms and predict that the next few years will see some of them become widely used for functional annotation and phylogenetic study. ...
... is the simplest distance (cf., e.g., Pevzner, 1992ab;Torney et al., 1990). It can be improved upon when some information on the variance/covariance is known. ...
... It is quite worthy to analyze the sensitivity of word-based dissimilarity measures to mutation, window size and word size. We have done analysis on the sample mean and variance of scores for ED (see, also, Torney et al., 1990), SED, SK-LD and some other members of SC-RD family. Since the lessons learned are the same, we shall just present the results for SK-LD here. ...
Article
Full-text available
Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SK-LD (symmetric Kullback-Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity beta between any pair of DNA sequences. Our study shows (1) for whole sequence similiarity/dissimilarity identification the window size taken should be as large as possible, but probably not >3000, as restricted by CPU time in practice, (2) for each measure the optimal word size increases with window size, (3) when the optimal word size is used, SK-LD performance is superior in both simulation and real data analysis, (4) the estimate beta of beta based on SK-LD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (5) beta is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and, therefore, has potential in probe design for microarrays. The algorithm SK-LD, estimate beta and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu
... Furthermore, they can nd some new functional similarities or dissimilarities which are invisible to other algorithms like FASTA (Hide et al 1994, Blaisdell 1989a) and are useful in detection of coding regions (Fichant and Gautier 1987 ) and evolutionary tree reconstruction (Blaisdell 1989a,b). So far all such wordbased algorithms (Blaisdell 1986, 1989a, Torney et al 1990, Pevzner 1992, Hide et al 1994, Wu et al 1997 others) have been developed and studied under the simplifying independent and/or uniform models of base composition. In Wu et al (1997) , dissimilarity measures using Mahalanobis-and standardized Euclidean-distance between frequencies of words were proposed. ...
... The resulting discrepancy is The (modiied) Kullback-Leibler discrepancies are then deened by I (j) n = min W I (j) n;W j = 1; 2. Note that both I (j) n and d 2 n are very eecient computationally, since their values do not depend on the model of base composition and do not require computations of variance or covariance. Following the ideas of Torney et al (1990), we deene the combined Euclidean distance d 2 = P u n=b d 2 n ; where 1 b u are some integers with b typically being 1 and u a number from 3 to 5. The combined Mahalanobis-D 2 , the combined standardized Euclidean-distance S 2 , and the combined Kullback-Leibler discrepancy I (j) are deened similarly. See Wu et al (1997) for more details on D 2 n , D 2 , S 2 n , and S 2 . ...
... Furthermore, they can nd some new functional similarities or dissimilarities which are invisible to other algorithms like FASTA (Hide et al 1994, Blaisdell 1989a) and are useful in detection of coding regions (Fichant and Gautier 1987 ) and evolutionary tree reconstruction (Blaisdell 1989a,b). So far all such wordbased algorithms (Blaisdell 1986, 1989a, Torney et al 1990, Pevzner 1992, Hide et al 1994, Wu et al 1997, among others) have been developed and studied under the simplifying independent and/or uniform models of base composition. In Wu et al (1997) , dissimilarity measures using Mahalanobis-and standardized Euclidean-distance between frequencies of words were proposed. ...
Article
In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word-based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biometrics 53, 1431 1439) characterized a family of word-based dissimilarity measures that defined distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Specifically, they introduced the use of Mahalanobis distance and standardized Euclidean distance into the study of DNA sequence dissimilarity. They showed that both distances had better sensitivity and selectivity than the commonly used Euclidean distance. The purpose of this article is to extend Mahalanobis and standardized Euclidean distances to Markov chain models of base composition. In addition, a new dissimilarity measure based on Kullback-Leibler discrepancy between frequencies of all n-words in the two sequences is introduced. Applications to real data demonstrate that Kullback-Leibler discrepancy gives a better performance than Euclidean distance. Moreover, under a Markov chain model of order kQ for base composition, where kQ is the estimated order based on the query sequence, standardized Euclidean distance performs very well. Under such a model, it performs as well as Mahalanobis distance and better than Kullback-Leibler discrepancy and Euclidean distance. Since standardized Euclidean distance is drastically faster to compute than Mahalanobis distance, in a usual workstation/PC computing environment, the use of standardized Euclidean distance under the Markov chain model of order kQ of base composition is generally recommended. However, if the user is very concerned with computational efficiency, then the use of Kullback-Leibler discrepancy, which can be computed as fast as Euclidean distance, is recommended. This can significantly enhance the current technology in comparing large datasets of DNA sequences.
... Also, effective gene clustering serves as a starting point for the discovery of new gene expression variants such as alternative splicing forms. Torney et al. [1] have developed an algorithm known as d 2 that is used as the basis for a program we have developed that we call d 2 _cluster. It is an agglomerative algorithm specifically developed for rapidly and accurately partitioning transcript databases into index classes by clustering ESTs and full-length sequences according to minimal linkage or " transitive closure " rules. ...
... In 1989, Torney et al. presented an algorithm called d 2 [1] for comparing two gene sequences. Originally developed for quickly locating repetitive sequences in DNA, it has proven to have other uses as well. ...
... This variable is referred to as the Stringency. To detect the overlap criterion we use the d 2 algorithm and set parameters and threshold values as described [1, 14, 15]. The initial and final states of the algorithm constitute a partition of the input sequences in which each sequence is in a cluster and no sequence appears in more than one cluster. ...
Article
A current question of considerable interest to both the medical and nonmedical communities concerns the number of human transcription units (which, for the purposes of this paper, are “genes”) and proteins. Even with the recent announcement of the completion of the draft sequence of the human genome, it is still extremely difficult to predict the number of genes present in the genome. There are several methods for gene prediction, all involving computational tools. One way to approach this question, involving both computation and experiment, is to look at copies of fragments of messenger ribonucleic acid (mRNA) called expressed sequence tags (ESTs). The mRNA comes only from a gene being expressed, or translated, into RNA; by clustering mRNA fragments, we can try to reconstruct the expressed gene. While the final result is a very rough representation of the “true expressed transcript,” it is probably within 20% of the real number. Here, we review the issues involved in EST clustering and present an estimate of the total number of human genes. Our results to date indicate that there are some 70000 transcription units, with an average of 1.2 different transcripts per transcription unit. Thus, we estimate the total number of human proteins to be at least 85 000. The total number of proteins will be higher because of post-translational modification.
... Alignment-free approaches for sequence comparison can be divided into several different groups: a) word-counts [4,5,6,7,8,9,10,11,12,13], b) average longest common substrings [14], shortest unique substrings [15,16], or a combination of both [17], c) sequence representation based on chaos theory [18,19,20], d) the moments of the positions of the nucleotides [21], e) Fourier transformation [22], f) information theory [23], and g) iterated maps [24]. Several excellent reviews on various alignment-free sequence comparison methods have been published [25,26,27,28,29] In this review, we concentrate on methods that can be applied to the comparison of sequences based on NGS data. ...
... Many word-count-based methods for sequence comparison have been developed including the uncentered correlation of word count vectors between two sequences [9], χ 2 -statistics [7,8], composition vectors [13], nucleotide relative abundances [42,43], and the recently developed d * 2 and d S 2 statistics [10,11]. It was shown that alignment-free methods are more robust than alignment-based methods especially against genetic rearrangements and horizontal gene transfers [44,45]. ...
... For measures that do not require background word frequencies, the observed word frequency or word presence (or absence) are directly used to compute the dissimilarity measures. The measures include but are not limited to, Euclidian distance (Eu), Manhattan distance (M a), d 2 [9], Feature Frequency Profiles (F F P ) [12], Jensen-Shannon divergence (JS) [47], Hamming distance, and Jaccard index. For measures that take background word frequency into account, dissimilarity between sequences is computed using the normalized word frequencies, where the expected word frequencies estimated using a background model are subtracted from the observed word frequencies to eliminate the background noise and enhance the signal. ...
Article
Full-text available
Genome and metagenome comparisons based on large amounts of next-generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
... The d 2 distance function 4 uses word-frequency counts and was originally developed for database search [6]. It has, however, been successfully applied to EST clustering [3]. ...
... The remaining columns show how the two methods compared: if the entry in row i, column j is a/b/c, it means that method i beat method j a times, b times they were the same, and c times method j won out. 6 always with 95% confidence ...
Article
Full-text available
The paper presents the results of an experimental study in which different string distance measures were compared and evaluated as to their applicability to EST clustering. We implemented two tools, SeqGen (Sequence Generator) and ECLEST (Evaluator for Clusterings of ESTs). These were used to generate simulated ESTs from input human cDNAs; and to run EST clustering on these ESTs and compute a score for the quality of the clustering, respectively. We propagate the use of simulated data for comparative studies of this type, because they allow evaluation w.r.t. a known ideal solution (in this case, the correct clustering), which is not possible in most cases with real-life data. The distance measures we compared include both subword-based and alignment-based measures. We ran a large number of tests and obtained statistically significant results as to the applicability of the distance measures included. For example, we show that certain subword-based measures produce output, in a significant number of cases, that is comparable to alignment-based ones, and that certain (easy-to-compute) measures are well suited for a preprocessing step. Our results have significant applications in studies of gene expression and discovery of products of alternative splicing, where there is a pressing need for fast clustering of increasingly large sets of ESTs.
... The squared Euclidean distance is not a metric as it does not satisfy the triangle inequality. D 2 Score: The D 2 score is defined as the count of k-mers matches between two sequences ([28] ...
... In order to solve this problem, two variants of it have been introduced, referred to as D S 2 and D * 2 , in [6], [26], [28]. ...
Conference Paper
Sequence comparison i.e., the assessment of how similar two biological sequences are to each other, is a fundamental and routine task in Computational Biology and Bioinformatics. Classically, alignment methods are the de facto standard for such an assessment. In fact, considerable research efforts for the development of efficient algorithms, both on classic and parallel architectures, has been carried out in the past 50 years. Due to the growing amount of sequence data being produced, a new class of methods has emerged: Alignment-free methods. Research in this ares has become very intense in the past few years, stimulated by the advent of Next Generation Sequencing technologies, since those new methods are very appealing in terms of computational resources needed and biological relevance. Despite such an effort and in contrast with sequence alignment methods, no systematic investigation of how to take advantage of distributed architectures to speed up alignment-free methods, has taken place. We provide a contribution of that kind, by evaluating the possibility of using the Hadoop distributed framework to speed up the running times of these methods, compared to their original sequential formulation.
... Traditionally, edit distance/alignment has been used to define similarity between sequences. However, alignment-free measures are increasingly being adopted, such as q-gram distance (Ukkonen, 1992) or d 2 (Torney et al., 1990). These define similarity between sequences with respect to the multiplicity of substrings (subwords) of a fixed, usually small, length. ...
... (Jain et al., 1999). Condition (i) states that C is a partition of S. Dissimilarity measures commonly used for string comparison in EST clustering include the edit distance (Levenshtein, 1966), q-gram distance (Ukkonen, 1992) and d 2 (Torney et al., 1990). Usually, one decides on a threshold θ ∈ R + , and then two sequences s,t are said to be similar if d(s,t) ≤ θ, where d is the dissimilarity measure. ...
Article
Second-generation sequencing technology has reinvigorated research using expression data, and clustering such data remains a significant challenge, with much larger datasets and with different error profiles. Algorithms that rely on all-versus-all comparison of sequences are not practical for large datasets. We introduce a new filter for string similarity which has the potential to eliminate the need for all-versus-all comparison in clustering of expression data and other similar tasks. Our filter is based on multiple long exact matches between the two strings, with the additional constraint that these matches must be sufficiently far apart. We give details of its efficient implementation using modified suffix arrays. We demonstrate its efficiency by presenting our new expression clustering tool, wcd-express, which uses this heuristic. We compare it to other current tools and show that it is very competitive both with respect to quality and run time. Source code and binaries available under GPL at http://code.google.com/p/wcdest. Runs on Linux and MacOS X. scott.hazelhurst@wits.ac.za; zsuzsa@cebitec.uni-bielefeld.de Supplementary data are available at Bioinformatics online.
... Furthermore, they can nd some new functional similarities or dissimilarities which are invisible to other algorithms like FASTA (Hide et al 1994, Blaisdell 1989a and are useful in detection of coding regions (Fichant and Gautier 1987) and evolutionary tree reconstruction (Blaisdell 1989a,b). So far all such wordbased algorithms (Blaisdell 1986, 1989a, Torney et al 1990, Pevzner 1992, Hide et al 1994, Wu et al 1997 have been developed and studied under the simplifying independent and/or uniform models of base composition. ...
... Following the ideas of Torney et al (1990), we de ne the combined Euclidean distance d 2 = P u n=b d 2 n ; where 1 b u are some integers with b typically being 1 and u a number from 3 to 5. The combined Mahalanobis-D 2 , the combined standardized Euclidean-distance S 2 , and the combined Kullback-Leibler discrepancy I (j) are de ned similarly. See Wu et al (1997) for more details on D 2 n , D 2 , S 2 n , and S 2 . ...
Article
In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word-based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biometrics53, 1431–1439) characterized a family of word-based dissimilarity measures that denned distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Specifically, they introduced the use of Mahalanobis distance and standardized Euclidean distance into the study of DNA sequence dissimilarity. They showed that both distances had better sensitivity and selectivity than the commonly used Euclidean distance. The purpose of this article is to extend Mahalanobis and standardized Euclidean distances to Markov chain models of base composition. In addition, a new dissimilarity measure based on Kullback–Leibler discrepancy between frequencies of all n-words in the two sequences is introduced. Applications to real data demonstrate that Kullback–Leibler discrepancy gives a better performance than Euclidean distance. Moreover, under a Markov chain model of order k̂Q for base composition, where k̂Q is the estimated order based on the query sequence, standardized Euclidean distance performs very well. Under such a model, it performs as well as Mahalanobis distance and better than Kullback–Leibler discrepancy and Euclidean distance. Since standardized Euclidean distance is drastically faster to compute than Mahalanobis distance, in a usual workstation/PC computing environment, the use of standardized Euclidean distance under the Markov chain model of order k̂Q of base composition is generally recommended. However, if the user is very concerned with computational efficiency, then the use of Kullback–Leibler discrepancy, which can be computed as fast as Euclidean distance, is recommended. This can significantly enhance the current technology in comparing large datasets of DNA sequences.
... Our proposed metric extracts all the symbols' orders of a sequence and compares them with the symbol ordering of another sequence by utilizing a specific weighting strategy. We evaluate the effectiveness of proposed sequence similarity metric in a clustering problem compared to some of the most common similarity metrics such as d2 [5], Smith-Waterman [4], Levenshtein [6], and Needleman-Wunsch [7]. Evaluation results show the superiority of our proposed metric compared to other evaluated metrics in the clustering of users from their web usage log. ...
... N-gram is another widely used statistical approach. This approach takes the order of symbols into consideration and is exploited in a variety of methods such as d2 [5]. However, it only considers N consecutive symbols but not all of existing orders in a sequence. ...
Article
Full-text available
A variety of different metrics has been introduced to measure the similarity of two given sequences. These widely used metrics are ranging from spell correctors and categorizers to new sequence mining applications. Different metrics consider different aspects of sequences, but the essence of any sequence is extracted from the ordering of its elements. In this paper, we propose a novel sequence similarity measure that is based on all ordered pairs of one sequence and where a Hasse diagram is built in the other sequence. In contrast with existing approaches, the idea behind the proposed sequence similarity metric is to extract all ordering features to capture sequence properties. We designed a clustering problem to evaluate our sequence similarity metric. Experimental results showed the superiority of our proposed sequence similarity metric in maximizing the purity of clustering compared to metrics such as d2, Smith-Waterman, Levenshtein, and Needleman-Wunsch. The limitation of those methods originates from some neglected sequence features, which are considered in our proposed sequence similarity metric.
... Also, effective gene clustering serves as a starting point for the discovery of new gene expression variants such as alternative splicing forms. Torney et al [1] have developed an algorithm, called "d 2 " that is used as the basis for a program we have developed termed d 2 _cluster. It is an agglomerative algorithm specifically developed for rapidly and accurately partitioning transcript databases into index classes by clustering ESTs and full-length sequences according to minimal linkage or "transitive closure" rules. ...
... In 1989, Torney et al. [1] presented an algorithm called d 2 , for comparing two sequences. Most sequence comparison algorithms are contextdependent. ...
Article
The definitive prototype for nanotechnology is the cell. Its many machines and exquisitely controlled internal and external movements are a reference for all researchers working in the field. The complete instructions for every molecular machine in a cell is specified in its DNA. The interactions of those parts are emergent properties of the individual components (RNAs, fats, sugars, and proteins). At present, there is considerable controversy among biologists regarding the number of human genes and proteins. In part, the differences stem from differences in definition. In this presentation we will define a gene as a transcription unit. Each transcription unit may have zero to many splice forms (known as "alternative splices"). While there are several methods for gene prediction, all involving computational tools, none agree. Estimates range from 20,000 to 120,000. One way to approach this question, involving both computation and experiment, is to look at copies of fragments of messenger RNA (mRNA), called expressed sequence tags (ESTs). mRNA comes only from a gene being expressed by a cell or tissue. By clustering mRNA fragments, we can try to reconstruct the expressed gene. The final result is a very rough representation of the 'true expressed transcript'. Our results consistently demonstrate that there are some 70,000 transcription units with an average of 1.2 different transcripts per transcription unit. Thus, we estimate the total number of human genes at about 85,000. Post-translational modification will make the total number of proteins be much higher.
... In this context, it is natural to count the number of k-letter words (k-mers) that a pair of sequences have in common for small values of k. This results in a wellstudied and often applied statistic called D 2 (Torney et al., 1990). Several investigations have analyzed the properties of this statistic and its variants (Forêt et al., 2009;Kantorovitz et al., 2007;Liua et al., 2011). ...
... Dot products are computed as usual, the norm of point p is jjpjj ¼ ffiffiffiffiffiffiffiffiffiffiffi p T Á p p , and the angle between points p and q is h pq = arccos p T q=(jjpjj Á jjqjj). The D 2 similarity measure was first applied to molecular sequences by Torney et al. (1990). Neglecting to account for statistical properties of k-words inherent in various kinds of molecular sequences has proven problematic (Lippert et al., 2002), and augmented measures have been designed to improve upon D 2 . ...
Article
Abstract Local alignment-free sequence comparison arises in the context of identifying similar segments of sequences that may not be alignable in the traditional sense. We propose a randomized approximation algorithm that is both accurate and efficient. We show that under D2 and its important variant [Formula: see text] as the similarity measure, local alignment-free comparison between a pair of sequences can be formulated as the problem of finding the maximum bichromatic dot product between two sets of points in high dimensions. We introduce a geometric framework that reduces this problem to that of finding the bichromatic closest pair (BCP), allowing the properties of the underlying metric to be leveraged. Local alignment-free sequence comparison can be solved by making a quadratic number of alignment-free substring comparisons. We show both theoretically and through empirical results on simulated data that our approximation algorithm requires a subquadratic number of such comparisons and trades only a small amount of accuracy to achieve this efficiency. Therefore, our algorithm can extend the current usage of alignment-free-based methods and can also be regarded as a substitute for local alignment algorithms in many biological studies.
... where S and Q are two sequences, and s i and q i are the number of occurrences of the ith k-mer in S and in Q, respectively. D 2 Score [37]. It is defined as: ...
... For completeness, we mention that notable variants of D 2 are D S 2 and D * 2 , defined in [7] and [37], respectively. They will not be considered in this study since computationally they are as demanding as D 2 . ...
Article
Full-text available
Alignment-free methods are one of the mainstays of biological sequence comparison, i.e., the assessment of how similar two biological sequences are to each other, a fundamental and routine task in computational biology and bioinformatics. They have gained popularity since, even on standard desktop machines, they are faster than methods based on alignments. However, with the advent of Next-Generation Sequencing Technologies, datasets whose size, i.e., number of sequences and their total length, is a challenge to the execution of alignment-free methods on those standard machines are quite common. Here, we propose the first paradigm for the computation of k-mer-based alignment-free methods for Apache Hadoop that extends the problem sizes that can be processed with respect to a standard sequential machine while also granting a good time performance. Technically, as opposed to a standard Hadoop implementation, its effectiveness is achieved thanks to the incremental management of a persistent hash table during the map phase, a task not contemplated by the basic Hadoop functions and that can be useful also in other contexts.
... A consensus tree combining the gene trees from all the homologous genes is used to represent genome sequence relationships, several alignment-free sequence comparison methods have been developed as reviewed in [4,5]. Most of the methods use the counts of word patterns within the sequences [6][7][8][9][10][11][12]. One important problem is the determination of word length used for the comparison of sequences. ...
... They showed that SK-LD performed well and the optimal word length increases with the sequence length. Using a similar approach, Forêt et al. [14] studied the optimal word length for D 2 that measures the number of shared words between two sequences [8]. Sims et al. [13] suggested a range for the optimal word length using alignment-free genome comparison with SK-LD. ...
Article
Full-text available
Background Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ²-statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies. Results We develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r1 and r2, respectively. We show through both simulations and theoretical studies that the optimal k= max(r1,r2)+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains. Conclusion Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4020-z) contains supplementary material, which is available to authorized users.
... The family of D 2 statistics has shown impressive performance in revealing similarities between two sequences (Reinert et al., 2009;Torney et al., 1990). This inspired us to extend the idea of alignment-free sequence comparison to repeat detection in single sequences. ...
... A n and B ¼ B 1 B 2 . . . B m over the alphabet A, the original D 2 statistic is defined as (Lippert et al., 2002;Torney et al., 1990) ...
Article
Motivation: Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. Results: Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. Availability and implementation: The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. Supplementary information: Supplementary data are available at Bioinformatics online.
... We used six alignment-free distance/dissimilarity measures (Manhattan, Euclid, d 2 [26], CVTree [16], d à 2 and d s 2 [17][18][19]) based on the relative frequencies of k-mers to calculate pairwise distances of white oak tree samples based on DNA samples of 50, 100 and 300 Mbp. Figure 1 shows the circular plots [27] of the oak trees at sequencing quantity of 100 Mbp using the six dissimilarity measures (circular plots at sequencing quantities of 50 Mbp and 300 Mbp are shown as Additional File 1 ( Figure S2). In each plot, the most similar sample to each of the reference specimens is linked. ...
... We used six alignment-free distance/dissimilarity measures based on the relative frequencies of k-mers (k-grams, k-tuples, k-words) to compare any pair of samples. These are the traditional Manhattan, Euclid, and d 2 [26] distances, and three recently developed background-adjusted dissimilarity measures: CVTree [16], d à 2 and d s 2 [17][18][19]. Detailed definitions of these measures are given in Additional File 6. ...
Article
Full-text available
Background The application of genomic data and bioinformatics for the identification of restricted or illegally-sourced natural products is urgently needed. The taxonomic identity and geographic provenance of raw and processed materials have implications in sustainable-use commercial practices, and relevance to the enforcement of laws that regulate or restrict illegally harvested materials, such as timber. Improvements in genomics make it possible to capture and sequence partial-to-complete genomes from challenging tissues, such as wood and wood products. Results In this paper, we report the success of an alignment-free genome comparison method, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {d}_2^{\ast }, $$\end{document}d2∗, that differentiates different geographic sources of white oak (Quercus) species with a high level of accuracy with very small amount of genomic data. The method is robust to sequencing errors, different sequencing laboratories and sequencing platforms. Conclusions This method offers an approach based on genome-scale data, rather than panels of pre-selected markers for specific taxa. The method provides a generalizable platform for the identification and sourcing of materials using a unified next generation sequencing and analysis framework. Electronic supplementary material The online version of this article (10.1186/s12864-018-5253-1) contains supplementary material, which is available to authorized users.
... The family of D 2 statistics has shown impressive performance in revealing similarities between two sequences (Torney et al. 1990;Reinert et al. 2009). This inspired us to extend the idea of alignment-free sequence comparison to repeat detection in single sequences. ...
... The D 2 statistic counts the number of k-mer matches between two sequences. Given two sequences A = A 1 A 2 … A n and B = B 1 B 2 … B m over the alphabet , the original D 2 statistic is defined as (Torney et al. 1990;Lippert et al. 2002): ...
Preprint
Full-text available
Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting all types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D 2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic that can efficiently discriminate sequences with or without repetitive regions. Using the statistic, we developed an algorithm of linear complexity in both computation time and memory usage for detecting all types of repetitive sequences in multiple scenarios, including finding candidate CRISPR regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments showed that the method works well on both assembled sequences and unassembled short reads.
... Word count methods relying on exact word matches include Composition Vector Trees (CVTrees) [41,42], Feature Frequency Profiles (FFP) [43], and D2 statistics [44][45][46]. Each of these methods relies on different properties of counting exact matches fixed-length k-mers. ...
Article
Full-text available
Advances in sequencing have generated a large number of complete genomes. Traditionally, phylogenetic analysis relies on alignments of orthologs, but defining orthologs and separating them from paralogs is a complex task that may not always be suited to the large datasets of the future. An alternative to traditional, alignment-based approaches are whole-genome, alignment-free methods. These methods are scalable and require minimal manual intervention. We developed SlopeTree, a new alignment-free method that estimates evolutionary distances by measuring the decay of exact substring matches as a function of match length. SlopeTree corrects for horizontal gene transfer, for composition variation and low complexity sequences, and for branch-length nonlinearity caused by multiple mutations at the same site. We tested SlopeTree on 495 bacteria, 73 archaea, and 72 strains of Escherichia coli and Shigella. We compared our trees to the NCBI taxonomy, to trees based on concatenated alignments, and to trees produced by other alignment-free methods. The results were consistent with current knowledge about prokaryotic evolution. We assessed differences in tree topology over different methods and settings and found that the majority of bacteria and archaea have a core set of proteins that evolves by descent. In trees built from complete genomes rather than sets of core genes, we observed some grouping by phenotype rather than phylogeny, for instance with a cluster of sulfur-reducing thermophilic bacteria coming together irrespective of their phyla. The source-code for SlopeTree is available at: http://prodata.swmed.edu/download/pub/slopetree_v1/slopetree.tar.gz.
... Hamming distance and edit distance are two well-known ones, but there are a host of others with both biological and computational claims made on their behalf. The d 2 distance function (pronounced d2) was originally developed for database search [Torney et al. 1990] and uses word-frequency counts. It has been successfully applied to biological data and to EST clustering in particular [Burke et al. 1999]. ...
Article
The d <sup>2</sup> distance function is commonly used in the clustering of DNA sequences such as expressed sequence tags (ESTs), an important biological application. The use of d <sup>2</sup> allows approximate string matching to be performed with a good balance between selectivity and sensitivity. The computational challenges of EST clustering make the efficient evaluation of the d <sup>2</sup> function an imperative. The paper presents a new incremental algorithm which requires amortised cost of O(m) per evaluation on realistic data sets (where m is the average length of an EST). In addition, two filtering heuristics are presented which improve clustering performance by estimating upper bounds on the d <sup>2</sup> scores.
... Computing an optimal local alignment score can be done with the well-known dynamic programming algorithm of Smith-Waterman. To improve performance, heuristic algorithms like BLAST [ Another dissimilarity measure using the word frequency vectors is referred to as d 2 , [33] ...
Conference Paper
Full-text available
We present a method for evaluating the suitability of different string dissimilarity measures and clustering algorithms for EST clustering, one of the main techniques used in transcriptome projects. The method comprises generating simulated ESTs with user-specified parameters, and then evaluating the quality of clusterings produced when different dissimilarity measures and different clustering algorithms are used. We implemented two tools to do this: ESTSim (EST simulator), which generates simulated EST sequences from mRNAs/cDNAs using user-specified parameters, and ECLEST (evaluator for clusterings of ESTs), which computes and evaluates a clustering of a set of input ESTs, where the dissimilarity measure, the clustering algorithm, and the clustering validity index can be specified independently. We demonstrate the method on a sample of 699 cDNAs, generating approximately 16,000 simulated ESTs. We conducted two experiments and derived statistically significant results from this study comparing subword-based dissimilarity measures to alignment-based ones.
... The simplest similarity measurement is by using the Euclidean distance of the k-word frequency which introduced in 1986 [2]. The method was extended by Pevzner and Torney's group by applying filtration techniques to deduct several of the characteristic measures for search optimization [12] and weight of individual K-tuples were add in the Euclidean distance measurement to maximize the variance of reference sequences with regard to random sequences [4]. The Euclidean distance is further explored in biological sequence comparison with the Standard Euclidean Distance and the Mahalanobis distance which consider the variances of k-words in the calculation [20,21]. ...
Conference Paper
Full-text available
Biological sequence comparison faced various challenges. Although dynamic programming based solution claimed to be the optimal solution for the comparison process, the computation limitation and some fundamental challenges still make it inefficient for mass sequence comparison. Statistical method explores the statistics of sequences by the frequency of the words in the sequence; it provides a comparison solution without loss of statistical information, and also caters some of the fundamental problem in sequence comparison. Normalized Google Distance is a way of finding semantic similarity in web pages, with significant related characteristics; in this research, we propose an algorithm that will integrate Normalized Google Similarity into protein sequence comparison.
... The fact that the frequency of different words may have a different impact on the standard Euclidean distance between specific words has been explored in the literature to derive weighted measures. The earliest work calculated the weights of individual L-tuples in order to maximize the variance of reference sequences with regard to random sequences (Torney et al., 1990). This approach maximizes the discrimination of reference sequence families. ...
Article
Full-text available
Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignment-free methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed. The overwhelming majority of work on alignment-free sequence has taken place in the past two decades, with most reports published in the past 5 years. Two main categories of methods have been proposed-methods based on word (oligomer) frequency, and methods that do not require resolving the sequence with fixed word length segments. The first category is based on the statistics of word frequency, on the distances defined in a Cartesian space defined by the frequency vectors, and on the information content of frequency distribution. The second category includes the use of Kolmogorov complexity and Chaos Theory. Despite their low visibility, alignment-free metrics are in fact already widely used as pre-selection filters for alignment-based querying of large applications. Recent work is furthering their usage as a scale-independent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment. Most of the alignment-free algorithms reviewed were implemented in MATLAB code and are available at http://bioinformatics.musc.edu/resources.html
... Several distance functions, such as character based, feature based, and conditional probability distribution based, have been proposed [14]. Edit distance is an example of character based distance measure, and d2 is a feature based one [12]. While these two measures are not proper choices in measuring the similarity of sequences, conditional probability distribution based distance gives acceptable results [14]. ...
Article
Full-text available
Sequences are one of the most important types of data. Recently, mining and analysis of sequence data has been studied in several fields. Sequence database mining and change mining is an example of data mining to study temporal data. Specific changes might be important to decision maker in different time periods to schedule future activities. Working with long sequences requires useful method. This paper presents a study on similarity measure and ranking sequence data. We employed sequence distance function based on structural features to measure the similarity, and a multi-criteria decision making techniques to rank them.
... Tools such as STACK [37] (with a clustering phase made by d2 cluster [26]) performed such tasks. d2 cluster [26] was named after the d2 distance of dissimilarity between two sequences introduced by Torney [232], based on the comparison of the multiplicity of words between them (d2 = 0 meaning that two sequences are identical). After the similarity computation, many of these tools used single-linkage transitive closure algorithms (i.e. ...
Thesis
Full-text available
The purpose of this thesis work is to allow the processing of transcriptome sequencing data, i.e. messenger RNA sequences, which reflect gene expression. More precisely, it is a question of taking advantage of the characteristics of the data produced by the new sequencing technologies, known as third generation (TGS). These technologies produce large sequences, which cover the total length of RNA molecules. This has the advantage of avoiding the sequence assembly phase, which was tricky, though necessary with the data generated by previous sequencing technologies called NGS. On the other hand, TGS data are noisy (up to 15% sequencing errors), requiring the development of new algorithms to analyze this data. The core work of this thesis consisted in the methodological development and implementation of new algorithms allowing the grouping of TGS sequences by gene, then their correction and finally the detection of the different isoforms of each gene.
... NCW statistic has many applications. It serves as a distance measure between texts, especially for the comparison of biological sequences [11, 17, 13, 6]. It is also important for the analysis of text algorithms and especially for pattern matching algorithms. ...
Conference Paper
The number of missing words (NMW) of length q in a text, and the number of common words (NCW) of two texts are useful text statistics. Knowing the distribution of the NMW in a random text is essential for the construction of so-called monkey tests for pseudorandom number generators. Knowledge of the distribution of the NCW of two independent random texts is useful for the average case analysis of a family of fast pattern matching algorithms, namely those which use a technique called q-gram filtration. Despite these important applications, we are not aware of any exact studies of these text statistics. We propose an efficient method to compute their expected values exactly. The difficulty of the computation lies in the strong dependence of successive words, as they overlap by (q - 1) characters. Our method is based on the enumeration of all string autocorrelations of length q, i.e., of the ways a word of length q can overlap itself. For this, we present the first efficient algorithm. Furthermore, by assuming the words are independent, we obtain very simple approximation formulas, which are shown to be surprisingly good when compared to the exact values.
... Similar "alignment-free" methods have a long history in bioinformatics [13,14]. However, prior methods based on word counts have relied on short words of only a few nucleotides, which lack the power to differentiate between closely related sequences and produce distance measures that can be difficult to interpret [15][16][17][18]. Alternatively, methods based on string matching can produce very accurate estimates of mutation distance, but must process the entire sequence with each comparison, which is not feasible for all-pairs comparisons [19][20][21][22]. ...
Article
Full-text available
Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license (https://github.com/marbl/mash). Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0997-x) contains supplementary material, which is available to authorized users.
... As referred to in the previous review, Blaisdell's paper introduced the notion of sequence similarity measure without pre-alignment using Markov models and all L-tuple counts (Blaisdell, 1986), associated with Euclidean distances   , E L d X Y on this space. This metric was further extended by weighting the vectors, and named d2-distance in a subsequent paper (Torney et al., 1990). Their distributional statistical study was accomplished later (Lippert et al., 2002). ...
Article
Full-text available
Biological sequence analysis is at the core of bioinformatics, bringing together several fields, from computer science to probability and statistics. Its purpose is to computationally process and decode the information stored in biological macromolecules involved in all cell mechanisms of living organisms – such as DNA, RNA and proteins – and provide prediction tools to reveal their structure, function and complex relationship networks. Within this context several methods have arisen that analyze sequences based on alignment algorithms, ubiquitously used in most bioinformatics applications. Alternatively, although less explored in the literature, the use of vector maps for the analysis of biological sequences, both DNA and proteins, represents a very elegant proposal to extract information from those types of sequences using an alignment-free approach. This work presents an overview of alignment-free methods used for sequence analysis and comparison and the new trends of these techniques, applied to DNA and proteins. The recent endeavors found in the literature along with new proposals and widening of applications fully justifies a revisit to these methodologies, partially reviewed before (Vinga and Almeida, 2003).
... Their study demonstrated that L-word frequencies are very useful for explaining evolutionary relationships between DNA sequences, and that frequencies of longer words tend to have a distribution that is more similar to an independent sequence than that of shorter words. Torney et al. (1990) pointed out that different L-words may contribute differently to the standard Euclidean distance, which lead to the exploration of the weighted Euclidean distance: ...
Article
Full-text available
This dissertation presents two statistical methodologies developed on multi-order Markov models. First, we introduce an alignment-free sequence comparison method, which represents a sequence using a multi-order transition matrix (MTM). The MTM contains information of multi-order dependencies and provides a comprehensive representation of the heterogeneous composition within a sequence. Based on the MTM, a distance measure is developed for pair-wise comparison of sequences. The new method is compared with the traditional maximum likelihood (ML) method, the complete composition vector (CCV) method and the improved version of the complete composition vector (ICCV) method using simulated sequences. We further illustrate the application of the MTM method using two real data sets, influenza A virus hemagglutinin gene sequence and complete mitochondrial genome sequences. We then present a stochastic model named Multi-Order Markov Model under Hidden States (MMMHS) for representing heterogeneous sequences. MMMHS is similar to the conventional Hidden Markov Model (HMM) and Double Chain Markov Model (DCMM) in terms of using hidden states to describe the non-homogeneity of a sequence, but it provides a more flexible dependency structure by changing the order of Markov dependency under different hidden states. We extend the forward-backward procedure to MMMHS and provide the complete model estimation procedure based on Expectation-Maximization (EM) algorithm. The method is then illustrated with applications on several real data sets, and the results are compared with that of traditional methods.
... The Euclidean distance (Eu) is defined as: Torney et al., 1990) The d 2 distance is defined as: ...
Article
Full-text available
Horizontal gene transfer (HGT) plays an important role in the evolution of microbial organisms including bacteria. Alignment-free methods based on single genome compositional information have been used to detect HGT. Currently, Manhattan and Euclidean distances based on tetranucleotide frequencies are the most commonly used alignment-free dissimilarity measures to detect HGT. By testing on simulated bacterial sequences and real data sets with known horizontal transferred genomic regions, we found that more advanced alignment-free dissimilarity measures such as CVTree and d2* that take into account the background Markov sequences can solve HGT detection problems with significantly improved performance. We also studied the influence of different factors such as evolutionary distance between host and donor sequences, size of sliding window, and host genome composition on the performances of alignment-free methods to detect HGT. Our study showed that alignment-free methods can predict HGT accurately when host and donor genomes are in different order levels. Among all methods, CVTree with word length of 3, d2* with word length 3, Markov order 1 and d2* with word length 4, Markov order 1 outperform others in terms of their highest F1-score and their robustness under the influence of different factors.
... Torney et al. [7] used the number of k-tuple matches between two sequences A and B as a statistic to measure the similarity between them. Let ...
Article
Full-text available
With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.
... The d 2 distance function uses word-frequency counts and was originally developed for database search [4]. It has, however, been successfully applied to EST clustering [1]. ...
Article
This report gives a skeleton description of the d2 algorithm used for the clustering of ex- pressed sequence tags (ESTs) in the wcd program. It describes how the algorithm works and why some design decisions were made. No experimental evidence is reported here. This is subject of ongoing research.
... Normalising probabilities over all k-mers with the same (k − 1)-mer prefix, gives the probabilities P(x t |x t−k+1 x t−k+2 ...x t−1 ) of a k-th order Markov chain; the variables conditioned on are referred to as the context in the following. Statistics such as the D 2 -statistic allow comparisons of different k-th order Markov chains [8] for applications such as clustering [9]. Clearly, k can not be chosen arbitrarily large as many k-mer counts will be zero, even for large genomes, as the number of k-mers grows exponentially in k. ...
Article
Full-text available
Background Alignment-free methods are a popular approach for comparing biological sequences, including complete genomes. The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k-th order Markov chain over DNA has 4k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4^k$$\end{document} formal parameters. To circumvent this exponential growth in parameters, variable-length Markov chains (VLMCs) have gained popularity for applications in molecular biology and other areas. VLMCs adapt the depth depending on sequence context and thus curtail excesses in the number of parameters. The scarcity of available fast, or even parallel software tools, prompted the development of a parallel implementation using lazy suffix trees and a hash-based alternative. Results An extensive evaluation was performed on genomes ranging from 12Mbp to 22Gbp. Relevant learning parameters were chosen guided by the Bayesian Information Criterion (BIC) to avoid over-fitting. Our implementation greatly improves upon the state-of-the-art even in serial execution. It exhibits very good parallel scaling with speed-ups for long sequences close to the optimum indicated by Amdahl’s law of 3 for 4 threads and about 6 for 16 threads, respectively. Conclusions Our parallel implementation released as open-source under the GPLv3 license provides a practically useful alternative to the state-of-the-art which allows the construction of VLMCs even for very large genomes significantly faster than previously possible. Additionally, our parameter selection based on BIC gives guidance to end-users comparing genomes.
... 2. D 2 [16,17] is a common statistic employed for comparing kmers; it is defined as the count of k-mers shared between two sequences. D S 2 [18,19] is a variant of D 2 , in which the D 2 score for a shared k-mer is normalised based on the probability of occurrences of that k-mer in the sequences. ...
Chapter
Inferring phylogenetic relationships among hundreds or thousands of microbial genomes is an increasingly common task. The conventional phylogenetic approach adopts multiple sequence alignment to compare gene-by-gene, concatenated multigene or whole-genome sequences, from which a phylogenetic tree would be inferred. These alignments follow the implicit assumption of full-length contiguity among homologous sequences. However, common events in microbial genome evolution (e.g., structural rearrangements and genetic recombination) violate this assumption. Moreover, aligning hundreds or thousands of sequences is computationally intensive and not scalable to the rate at which genome data are generated. Therefore, alignment-free methods present an attractive alternative strategy. Here we describe a scalable alignment-free strategy to infer phylogenetic relationships using complete genome sequences of bacteria and archaea, based on short, subsequences of length k (k-mers). We describe how this strategy can be extended to infer evolutionary relationships beyond a tree-like structure, to better capture both vertical and lateral signals of microbial evolution.
Article
Full-text available
Recently a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, firstly, $${C}_{l}^{*}$$ and $${C}_{l}^{S}$$, extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and secondly, $$\overline{{C}_{2}^{*}}$$, $$\overline{{C}_{2}^{S}}$$ and $$\overline{{C}_{2}^{geo}}$$, averages of sums of pairwise comparison statistics. The two tasks we consider are, firstly, to identify sequences which are similar to a set of target sequences, and, secondly, to measure the similarity within a set of sequences. Our investigation uses both simulated data as well as cis-regulatory module (CRM) data where the task is to identify CRMs with similar transcription factor binding sites. We find that while for real data all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free statistics are more sensitive to contamination in the data than the pairwise average statistics. Our implementation of the five statistics is available as R package named "multiAlignFree" at behttp://www-rcf.usc.edu/∼fsun/Programs/multiAlignFree/multiAlignFreemain.html. reinert@stats.ox.ac.uk SUPPLEMENTARY INFORMATION: Results on other variants of the statistics are given in the supplementary materials.
Article
Full-text available
When comparing two sequences, a natural approach is to count the number of k-letter words the two sequences have in common. No positional information is used in the count, but it has the virtue that the comparison time is linear with sequence length. For this reason this statistic D(2) and certain transformations of D(2) are used for EST sequence database searches. In this paper we begin the rigorous study of the statistical distribution of D(2). Using an independence model of DNA sequences, we derive limiting distributions by means of the Stein and Chen-Stein methods and identify three asymptotic regimes, including compound Poisson and normal. The compound Poisson distribution arises when the word size k is large and word matches are rare. The normal distribution arises when the word size is small and matches are common. Explicit expressions for what is meant by large and small word sizes are given in the paper. However, when word size is small and the letters are uniformly distributed, the anticipated limiting normal distribution does not always occur. In this situation the uniform distribution provides the exception to other letter distributions. Therefore a naive, one distribution fits all, approach to D(2) statistics could easily create serious errors in estimating significance.
Conference Paper
The artificial intelligence (AI) languages of logic programming and deductive databases are simple, powerful tools for genomic research. Their simplicity and power in solving the restriction mapping problem for probed partial experiments are demonstrated, and the more traditional Prolog language is compared with the newer logical data language (LDL). The comparisons are made with respect to procedural control, declarativeness, ease of code modification, and efficiency. While a Prolog program means logic plus control, an LDL program means logic plus little control because its compiler takes care of most of the control problem. While Prolog works top-down and is more efficient, LDL works bottom-up and is easier to use.
Article
Full-text available
Alignment-free methods, in which shared properties of sub-sequences (e.g. identity or match length) are extracted and used to compute a distance matrix, have recently been explored for phylogenetic inference. However, the scalability and robustness of these methods to key evolutionary processes remain to be investigated. Here, using simulated sequence sets of various sizes in both nucleotides and amino acids, we systematically assess the accuracy of phylogenetic inference using an alignment-free approach, based on D2 statistics, under different evolutionary scenarios. We find that compared to a multiple sequence alignment approach, D2 methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. Across diverse empirical datasets, the alignment-free methods perform well for sequences sharing low divergence, at greater computation speed. Our findings provide strong evidence for the scalability and the potential use of alignment-free methods in large-scale phylogenomics.
Preprint
Full-text available
Background Taxonomic classification of microbiomes has provided tremendous insight into the underlying genome dynamics of microbial communities but has relied on known microbial genomes contained in curated reference databases. Methods We propose K-core graph decomposition as a novel approach for tracking metagenome dynamics that is taxonomy-oblivious. K-core performs hierarchical decomposition which partitions the graph into shells containing nodes having degree at least K called K-shells, yielding O ( E + V ) complexity. Results The results of the paper are two-fold: (1) KOMB can identify homologous regions efficiently in metagenomes, (2) KOMB reveals community profiles that capture intra- and inter-genome dynamics, as supported by our results on simulated, synthetic, and real data. Software Availability KOMB is available for use on Linux systems at https://gitlab.com/treangenlab/komb.git
Article
A number of algorithms exist for searching genetic databases for biologically significant similarities in DNA sequences. Past research has shown that word-based search tools are computationally efficient and can find similarities or dissimilarities invisible to other algorithms like FASTA. We characterize a family of word-based dissimilarity measures that define distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Applications to real data demonstrate that currently used word-based methods that rely on Euclidean distance can be significantly improved by using Mahalanobis distance, which accounts for both variances and covariances between frequencies of n-words. Furthermore, in those cases where Mahalanobis distance may be too difficult to compute, using standardized Euclidean distance, which only corrects for the variances of frequencies of n-words, still gives better performance than the Euclidean distance. Also, a simple way of combining distances obtained at different n-words is considered. The goal is to obtain a single measure of dissimilarity between two DNA sequences. The performance ranking of the preceding three distances still holds for their combined counterparts. All results obtained in this paper are applicable to amino acid sequences with minor modifications.
Article
Abstract Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D2 statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D2 statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D2 distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D2 statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D2 distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D2 distribution from the human genome.
Article
A number of algorithms exist for searching sequence databases for biologically significant similarities based on the primary sequence similarity of aligned sequences. We have determined the biological sensitivity and selectivity of d2, a high-performance comparison algorithm that rapidly determines the relative dissimilarity of large datasets of genetic sequences. d2 uses sequence-word multiplicity as a simple measure of dissimilarity. It is not constrained by the comparison of direct sequence alignments and so can use word contexts to yield new information on relationships. It is extremely efficient, comparing a query of length 884 bases (INS1ECLAC) with 19,540,603 bases of the bacterial division of GenBank (release 76.0) in 51.77 CPU seconds on a Cray Y/MP-48 supercomputer. It is unique in that subsequences (words) of biological interest can be weighted to improve the sensitivity and selectivity of a search over existing methods. We have determined the ability of d2 to detect biologically significant matches between a query and large datasets of DNA sequences while varying parameters such as word-length and window size. We have also determined the distribution of dissimilarity scores within eukaryotic and prokaryotic divisions of GenBank. We have optimized parameters of the d2 program using Cray hardware and present an analysis of the sensitivity and selectivity of the algorithm. A theoretical analysis of the expectation for scores is presented. This work demonstrates that d2 is a unique, sensitive, and selective method of rapid sequence comparison that can detect novel sequence relationships which remain undetected by alternate methodologies.
Article
Upon searching local similarities in long sequences, the necessity of a 'rapid' similarity search becomes acute. Quadratic complexity of dynamic programming algorithms forces the employment of filtration methods that allow elimination of the sequences with a low similarity level. The paper is devoted to the theoretical substantiations of the filtration method based on the statistical distance between texts. The notion of the filtration efficiency is introduced and the efficiency of several filters is estimated. It is shown that the efficiency of the statistical l-tuple filtration upon DNA database search is associated with a potential extension of the original four-letter alphabet and grows exponentially with increasing l. The formula that allows one to estimate the filtration parameters is presented.
Article
Full-text available
Transcription factor binding site (TFBS) motifs can be accurately represented by position frequency matrices (PFM) or other equivalent forms. We often need to compare TFBS motifs using their PFMs in order to search for similar motifs in a motif database, or cluster motifs according to their binding preference. The majority of current methods for motif comparison involve a similarity metric for column-to-column comparison and a method to find the optimal position alignment between the two compared motifs. In some applications, alignment-free methods might be preferred; however, few such methods with high accuracy have been described. Here we describe a novel alignment-free method for quantifying the similarity of motifs using their PFMs by converting PFMs into k-mer vectors. The motifs could then be compared by measuring the similarity among their corresponding k-mer vectors. We demonstrate that our method in general achieves similar performance or outperforms the existing methods for clustering motifs according to their binding preference and identifying similar motifs of transcription factors of the same family.
Article
Whole genome sequences are generally accepted as excellent tools for studying evolutionary relationships. Due to the problems caused by the uncertainty in alignment, existing tools for phylogenetic analysis based on multiple alignments could not be directly applied to the whole-genome comparison and phylogenomic studies. There has been a growing interest in alignment-free methods for phylogenetic analysis using complete genome data. The “distances” used in these alignment-free methods are not proper distance metrics in the strict mathematical sense. In this study, we first review them in a more general frame — dissimilarity. Then we propose some new dissimilarities for phylogenetic analysis. Last three genome datasets are employed to evaluate these dissimilarities from a biological point of view.
Article
The weighted Euclidean distance (D(2)) is one of the earliest dissimilarity measures used for alignment free comparison of biological sequences. This distance measure and its variants have been used in numerous applications due to its fast computation, and many variants of it have been subsequently introduced. The D(2) distance measure is based on the count of k-words in the two sequences that are compared. Traditionally, all k-words are compared when computing the distance. In this paper we show that similar accuracy in sequence comparison can be achieved by using a selected subset of k-words. We introduce a term variance based quality measure for identifying the important k-words. We demonstrate the application of the proposed technique in phylogeny reconstruction and show that up to 99% of the k-words can be filtered out for certain datasets, resulting in faster sequence comparison. The paper also presents an exploratory analysis based evaluation of optimal k-word values and discusses the impact of using subsets of k-words in such optimal instances.
ResearchGate has not been able to resolve any references for this publication.