Conference PaperPDF Available

Lecture Notes in Computer Science

Authors:

Abstract

In this paper we define a new problem, motivated by computational biology, $LCSk$ aiming at finding the maximal number of $k$ length $substrings$, matching in both input strings while preserving their order of appearance. The traditional LCS definition is a special case of our problem, where $k = 1$. We provide an algorithm, solving the general case in $O(n^2)$ time, where $n$ is the length of the input strings, equaling the time required for the special case of $k=1$. The space requirement of the algorithm is $O(kn)$. %, however, in order to enable %backtracking of the solution, $O(n^2)$ space is needed. We also define a complementary $EDk$ distance measure and show that $EDk(A,B)$ can be computed in $O(nm)$ time and $O(km)$ space, where $m$, $n$ are the lengths of the input sequences $A$ and $B$ respectively.
A preview of the PDF is not available
Article
Full-text available
Multivariate data sets (MDSs), with enormous size and certain ratio of noise/outliers, are generated routinely in various application domains. A major issue, tightly coupled with these MDSs, is how to compute their similarity indexes with available resources in presence of noise/outliers - which is addressed with the development of both classical and non-metric based approaches. However, classical techniques are sensitive to outliers and most of the non-classical approaches are either problem/application specific or overlay complex. Therefore, the development of an efficient and reliable algorithm for MDSs, with minimum time and space complexity, is highly encouraged by the research community. In this paper, a non-metric based similarity measure algorithm, for MDSs, is presented that solves the aforementioned issues, particularly, noise and computational time, successfully. This technique finds the similarity indexes of noisy MDSs, of both equal and variable sizes, through utilizing minimum possible resources i.e., space and time. Experiments were conducted with both benchmark and real time MDSs for evaluating the proposed algorithm‘s performance against its rival algorithms, which are traditional dynamic programming based and sequential similarity measure algorithms. Experimental results show that the proposed scheme performs exceptionally well, in terms of time and space, than its counterpart algorithms and effectively tolerates a considerable portion of noisy data.
Conference Paper
Alignment of sequence reads is an important step of many bioinformatics workflows. While the alignment of short reads is well investigated, the alignment of long reads produced by third-generation sequencing technologies, such as Oxford Nanopore, is more challenging because they have high error rates (10-40%). Furthermore, due to their different algorithmic approaches, different tools produce varied alignments, significantly influencing the downstream analyses. In this study, we evaluated the performance of three alignment tools (LAST, GraphMap, and NanoBLASTer) using simulated nanopore reads. Although the three alignment strategies gave similar results (e.g., all close to 100% precision), GraphMap reported the longest alignments while LAST the shortest. However, GraphMap showed the lowest recall (90%) indicating high false negative rates. While GraphMap had the highest percentage of reads that were mapped to the correct reference regions, NanoBLASTer and especially LAST mapped the majority of the reads only partially correctly. Based on our multiple statistics, GraphMap had the best overall performance.
Article
Two space efficient algorithms to solve the LCSk problem and LCS≥k problem are presented in this paper. The algorithms improve the time and space complexities of the algorithms of Benson et al. The space cost of the first algorithm to solve the LCSk problem is reduced from O(n2) to O(kn), if the size of the two input sequences are both n. The time and space costs of the second algorithm to solve the LCS≥k problem are both improved. The time cost is reduced from O(kn2) to O(n2), and the space cost is reduced from O(n2) to O(kn). In the case of k=O(1), the two algorithms are both linear space algorithms.
Article
Full-text available
Identifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. To supplement our review of the algorithms, we benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. In the versions of the tools tested, we observed that Minimap is the most computationally efficient, specific and sensitive method on the ONT datasets tested; whereas GraphMap and DALIGNER are the most specific and sensitive methods on the tested PB datasets. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput. Contact:cjustin@bcgsc.ca, ibirol@bcgsc.ca Supplementary information:Supplementary data are available at Bioinformatics online.
Article
Computing a similarity measure of a given set of molecular sequences is an important task in bioinformatics studies. Weighted sequences have become an interesting research area since they allow a newer and more precise encoding paradigm of molecular structures. The longest common subsequence (LCS) has been an extensively studied technique to compute similarity on sequences represented as strings and it has been used in many applications. There is a current trend to generalize those algorithms to work on weighted sequences too. The resulting variant of the problem is called the weighted LCS. In this paper, we study the problem of finding the weighted LCS of two weighted sequences. Particularly, a novel approach is presented to tackle the weighted LCS for a bounded molecular alphabet constrained by one or two parameters. Based on the dominant-match-point paradigm, we model the problem using a multiobjective optimization approach. As a result, we propose a novel, efficient and exact algorithm that not only finds the weighted LCS but also the set of all possible solutions. We perform experimental analysis using simulated and real data to compare the performance of the proposed approach. The experiments show that the proposed algorithm has a good performance in small instances of both benchmarks. Furthermore, it can be used on a great number of bioinformatics applications where the computation of similarity between short sequence fragments is needed.
Article
Full-text available
Realizing the democratic promise of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. Here we present GraphMap, a mapping algorithm designed to analyse nanopore sequencing reads, which progressively refines candidate alignments to robustly handle potentially high-error rates and a fast graph traversal to align long reads with speed and high precision (>95%). Evaluation on MinION sequencing data sets against short- and long-read mappers indicates that GraphMap increases mapping sensitivity by 10-80% and maps >95% of bases. GraphMap alignments enabled single-nucleotide variant calling on the human genome with increased sensitivity (15%) over the next best mapper, precise detection of structural variants from length 100 bp to 4 kbp, and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.
Article
Full-text available
In this paper we present $LCSk$++: a new metric for measuring the similarity of long strings, and provide an algorithm for its efficient computation. With ever increasing size of strings occuring in practice, e.g. large genomes of plants and animals, classic algorithms such as Longest Common Subsequence (LCS) fail due to demanding computational complexity. Recently, Benson et al. defined a similarity metric named $LCSk$. By relaxing the requirement that the $k$-length substrings should not overlap, we extend their definition into a new metric. An efficient algorithm is presented which computes $LCSk$++ with complexity of $O((|X|+|Y|)\log(|X|+|Y|))$ for strings $X$ and $Y$ under a realistic random model. The algorithm has been designed with implementation simplicity in mind. Additionally, we describe how it can be adjusted to compute $LCSk$ as well, which gives an improvement of the $O(|X|\dot|Y|)$ algorithm presented in the original $LCSk$ paper.
Article
Finding the longest common subsequence in $k$-length substrings (LCS$k$) is a recently proposed problem motivated by computational biology. This is a generalization of the well-known LCS problem in which matching symbols from two sequences $A$ and $B$ are replaced with matching non-overlapping substrings of length $k$ from $A$ and $B$. We propose several algorithms for LCS$k$, being non-trivial incarnations of the major concepts known from LCS research (dynamic programming, sparse dynamic programming, tabulation). Our algorithms make use of a linear-time and linear-space preprocessing finding the occurrences of all the substrings of length $k$ from one sequence in the other sequence.
Article
Full-text available
Motivation: Mapping of high-throughput sequencing data and other bulk sequence comparison applications have motivated a search for high-efficiency sequence alignment algorithms. The bit-parallel approach represents individual cells in an alignment scoring matrix as bits in computer words and emulates the calculation of scores by a series of logic operations composed of AND, OR, XOR, complement, shift and addition. Bit-parallelism has been successfully applied to the longest common subsequence (LCS) and edit-distance problems, producing fast algorithms in practice. Results: We have developed BitPAl, a bit-parallel algorithm for general, integer-scoring global alignment. Integer-scoring schemes assign integer weights for match, mismatch and insertion/deletion. The BitPAl method uses structural properties in the relationship between adjacent scores in the scoring matrix to construct classes of efficient algorithms, each designed for a particular set of weights. In timed tests, we show that BitPAl runs 7–25 times faster than a standard iterative algorithm. Availability and implementation: Source code is freely available for download at http://lobstah.bu.edu/BitPAl/BitPAl.html. BitPAl is implemented in C and runs on all major operating systems. Contact: jloving@bu.edu or yhernand@bu.edu or gbenson@bu.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Conference Paper
Full-text available
An arc-annotated sequence is a sequence, over a given alphabet, with additional structure described by a - possibly empty - set of arcs, each arc joining a pair of positions in the sequence. As a natural extension of the longest common subsequence problem, Evans introduced the LONGEST ARC-PRESERVING COMMON SUBSEQUENCE (LAPCS) problem as a framework for studying the similarity of arc-annotated sequences. This problem has been studied extensively in the literature due to its potential application for RNA structure comparison, but also because it has a compact definition. In this paper, we focus on the nested case where no two arcs are allowed to cross because it is widely considered the most important variant in practice. Our contributions are three folds: (i) we revisit the nice NP-hardness proof of Lin et al. for LAPCS(NESTED, NESTED), (ii) we improve the running time of the FPT algorithm of Alber et al. from O(3.31^{k1+k2}n) to O(3^{k1+k2}n), where resp. k1 and k2 deletions from resp. the first and secon
Conference Paper
Full-text available
The problem of aligning two sequences A and B to determine their similarity is one of the fundamental problems in pattern matching. A challenging, basic variation of the sequence similarity problem is the incremental string comparison problem, denoted Consecutive Suffix Alignment, which is, given two strings A and B, to compute the alignment solution of each suffix of A versus B. Here, we present two solutions to the Consecutive Suffix Alignment Problem under the LCS metric. The first solution is an O(nL) time and space algorithm for constant alphabets, where n is the size of the compared strings and L ≤ n denotes the size of the LCS of A and B. The second solution is an O(n L + n log|Σ|) time and O(L) space algorithm for general alphabets, where Σ denotes the alphabet of the compared strings. (Note that |Σ| ≤ n.)
Conference Paper
An arc-annotated sequence is a sequence, over a given alphabet, with additional structure described by a set of arcs, each arc joining a pair of positions in the sequence. As a natural extension of the longest common subsequence problem, Evans introduced the Longest Arc-Preserving Common Subsequence (LAPCS) problem as a framework for studying the similarity of arc-annotated sequences. This problem has been studied extensively in the literature due to its potential application for RNA structure comparison, but also because it has a compact definition. In this paper, we focus on the nested case where no two arcs are allowed to cross because it is widely considered the most important variant in practice. Our contributions are three folds: (i) we revisit the nice NP-hardness proof of Lin et al. for LAPCS(Nested, Nested), (ii) we improve the running time of the FPT algorithm of Alber et al. from \(O(3.31^{k_1 + k_2} n)\) to \(O(3^{k_1 + k_2} n)\), where resp. k 1 and k 2 deletions from resp. the first and second sequence are needed to obtain an arc-preserving common subsequence, and (iii) we show that LAPCS(Stem, Stem) is NP-complete for constant alphabet size.
Article
The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space. An algorithm is presented which will solve this problem in quadratic time and in linear space.
Conference Paper
The Longest Common Subsequence (LCS) of two or more strings is a fundamental well-studied problem which has a wide range of applications throughout computational sciences. When the common subsequence must contain one or more constraint strings as subsequences, the problem becomes the Constrained LCS (CLCS) problem. In this paper we consider the Restricted LCS (RLCS) problem, where our goal is finding a longest common subsequence between two or more strings that does not contain a given set of restriction strings as subsequences. First we show that in case of two input strings and an arbitrary number of restriction strings the RLCS problem is NP-hard. Afterwards, we present a dynamic programming solution for RLCS and we show that this algorithm implies that RLCS is in FPT when parameterized by the total length of the restriction strings. In the last part of this paper we present two approximation algorithms for the hard variants of the problem.
Article
A classical measure of similarity between strings is the length of the longest common subsequence (LCS) between the two given strings. The search for efficient algorithms for finding the LCS has been going on for more than three decades. To date, all known algorithms may take quadratic time (shaved by logarithmic factors) to find large LCS. In this paper, the problem of approximating LCS is studied, while focusing on the hard inputs for this problem, namely, approximating LCS of near-linear size in strings over a relatively large alphabet (of size at least nϵ for some constant ϵ>0, where n is the length of the string). We show that, any given string over a relatively large alphabet can be embedded into a locally non-repetitive string. This embedding has a negligible additive distortion for strings that are not too dissimilar in terms of the edit distance. We also show that LCS can be efficiently approximated in locally-non-repetitive strings. Our new method (the embedding together with the approximation algorithm) gives a strictly sub-quadratic time algorithm (i.e., of complexity O(n2-ϵ) for some constant ϵ) which can find common subsequences of linear (and near linear) size that cannot be detected efficiently by the existing tools.
Article
A longest-common-subsequence algorithm is described which operates in terms of bit or bit-string operations. It offers a speedup of the order of the word-length on a conventional computer.
Article
The Common Substring Alignment Problem is defined as follows: Given a set of one or more strings S1, S2 … Sc and a target string T, Y is a common substring of all strings Si, that is, Si = BiYFi. The goal is to compute the similarity of all strings Si with T, without computing the part of Y again and again. Using the classical dynamic programming tables, each appearance of Y in a source string would require the computation of all the values in a dynamic programming table of size O(nℓ) where ℓ is the size of Y. Here we describe an algorithm which is composed of an encoding stage and an alignment stage. During the first stage, a data structure is constructed which encodes the comparison of Y with T. Then, during the alignment stage, for each comparison of a source Si with T, the pre-compiled data structure is used to speed up the part of Y. We show how to reduce the O(nℓ) alignment work, for each appearance of the common substring Y in a source string, to O(n)-at the cost of O(nℓ) encoding work, which is executed only once.