Article

Cache-Oblivious Dynamic Programming for Bioinformatics

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We present efficient cache-oblivious algorithms for some well-studied string problems in bioinformatics including the longest common subsequence, global pairwise sequence alignment and three-way sequence alignment (or median), both with affine gap costs, and RNA secondary structure prediction with simple pseudoknots. For each of these problems, we present cache-oblivious algorithms that match the best-known time complexity, match or improve the best-known space complexity, and improve significantly over the cache-efficiency of earlier algorithms. We present experimental results which show that our cache-oblivious algorithms run faster than software and implementations based on previous best algorithms for these problems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In case of DP programs, a majority use iterative codes. While most parallel implementations are for shared-memory systems, there exist only a few hand-tuned distributed-memory implementations [1,2,21,28,39,41,42]. Recent work has shown that efficient recursive formulations of DP algorithms outperform iterative codes on shared-memory systems primarily because of better cache utilization [11,12]. Regarding distributing the compuation, recursive DP algorithms involve irregular communication patterns and offer a specialized (or simplified) domain. ...
... Distributed-memory parallel programs and simplifying their creation has been extensively studied for many decades with a focus on iterative formulations [3,6,7,9,22,26,32,33]. Recent research has shown that recursive formulations have good locality properties, readily expose parallelism and hence, can be as effective as iterative codes if implemented carefully [10,11,27]. In D2P we identify additional properties that, when true for recursive formulations, simplify the creation of distributed-memory parallel programs. ...
... Galil et al. [19] categorize DP algorithms based on the dependency patterns and design SHM parallel iterative algorithms. Chowdhury et al. [10][11][12] design SHM parallel recursive DP algorithms and automate their design too. In addition, they show that SHM parallel recursive DP implementations adapt better in the presence of cache sharing, are an order of magnitude faster, and have more predictable runtimes than their tiled iterative counterparts [10]. ...
Conference Paper
Recursive formulations of programs are straightforward to reason about and write, often have good locality properties, and readily expose parallelism. We observe that it is easier to automatically generate distributed-memory codes for recursive formulations with certain properties: i) inclusive---a recursive method's parameters summarize the data access done within the method body. ii) Intersection---data-set intersection tests among method invocations can be computed efficiently. In this paper we present D2P, a system that automatically generates distributed-memory codes for recursive divide-conquer algorithms with these properties. D2P produces MPI-based implementations starting from shared-memory specifications of the recursive algorithms. We evaluate D2P with recursive Dynamic Programming (DP) algorithms, since these algorithms have the desired properties and are well known. We show that the generated implementations are scalable and efficient: D2P-generated implementations execute faster than implementations generated by recent distributed DP frameworks, and are competitive with (and often faster than) hand-written implementations.
... Our bounds match the best bounds known for work stealing schedulers. The class of algorithms we consider includes efficient multithreaded algorithms for several fundamental problems such as matrix multiplication [18], the Gaussian Elimination Paradigm (GEP) [11], longest common subsequence (LCS) and related dynamic programming problems [11,9], FFT [18], SPMS sorting [16], list ranking [12], and graph connectivity [12]. These are all well-known multithreaded algorithms that use parallel recursive divide and conquer. ...
... If the ith sorting problem has size n i and incurs S i steals, using the SPMS cache miss bound, the bound on cache misses is O(Q+S ·B + i Longest Common Subsequence (LCS). LCS [11,9] is a Type 3 HBP that has a constituent that is a Type 2 HBP that finds only the length of an LCS, and one recursive constituent with three parallel calls of size n/2. There are O(4 j ) tasks of size n/2 j . ...
... We apply Theorem 2.2 to several well-known algorithms, to obtain the results in Table 1. The GEP and LCS algorithms are presented in [10,9], while the others are described in [14] (where their false sharing costs were analyzed). ...
Article
We analyze the caching overhead incurred by a class of multithreaded algorithms when scheduled by an arbitrary scheduler. We obtain bounds that match or improve upon the well-known O(Q+S(M/B))O(Q+S \cdot (M/B)) caching cost for the randomized work stealing (RWS) scheduler, where S is the number of steals, Q is the sequential caching cost, and M and B are the cache size and block (or cache line) size respectively.
... Ω( δn B ) additional misses are required to write the output sequence reliably. • In Section 5 we settle challenging non-local problems that fit in the Gaussian Elimination Paradigm (GEP), by exploiting a recursive framework introduced in [17,18]. The GEP class includes problems solvable by triply-nested for loops of the type that occur in the standard Gaussian elimination algorithm, most notably matrix multiplication and Floyd-Warshall all-pairs shortest paths. ...
... In this section we show that, if we can afford a private memory of logarithmic size, we can achieve both resiliency and cache-efficiency. Our approach hinges upon a recursive framework for dynamic programming, introduced in [17,18], that we briefly describe in Section 4.1 focusing on the LCS problem. We assume that both input sequences have length n and, without loss of generality, that n and δ are powers of two (we will then show how to remove the former assumption). ...
... As shown in Figure 2a, for any subtable Q of the DP table C we can naturally identify its left, right, top, and down boundaries (denoted by L, R, T , and D) and two projections of the input sequences X and Y on Q. The algorithm presented in [17,18] is implemented by two recursive functions, Boundary and Traceback-Path, that use a divide-and-conquer strategy, logically splitting table C into four quadrants: ...
Article
Full-text available
We investigate the design of dynamic programming algorithms in unreliable memories, i.e., in the presence of errors that lead the logical state of some bits to be read differently from how they were last written. Assuming that a limited number of memory faults can be inserted at run-time by an adversary with unbounded computational power, we obtain the first resilient algorithms for a broad range of dynamic programming problems, devising a general framework that can be applied to both iterative and recursive implementations. Besides all local dependency problems, where updates to table entries are determined by the contents of neighboring cells, we also settle challenging non-local problems, such as all-pairs shortest paths and matrix multiplication. All our algorithms are correct with high probability and match the running time of their standard non-resilient counterparts while tolerating a polynomial number of faults. The recursive algorithms are also cache-efficient and can tolerate faults at any level of the memory hierarchy. Our results exploit a careful combination of data replication, majority techniques, fingerprint computations, and lazy fault detection. To cope with the complex data access patterns induced by some of our algorithms, we also devise amplified fingerprints, which might be of independent interest in the design of resilient algorithms for different problems.
... Based on these observations and the resulting optimized row worker (Figure 8), we have designed three different algorithms to solve the edit distance problem, namely: naïve, strip mining, and tiling. For each of these algorithms, we primarily focus on computing the score, as prior work has shown that the edits can be reconstructed by recomputing the required subsections of the cost matrix while tracing in the backward direction [12], [13], [14], [15]. Hence, showing that our algorithms do better in computing the score should mean that they will also perform better in computing the edits. ...
... It is also possible use a separate two-dimensional path matrix while using linear memory space for the cost matrix, which allows linear time reconstruction of edits. Finally, it is also possible to use only linear memory space for the cost matrix, and reconstruct the edits in quadratic time without saving any path matrix [14], [15] which is discussed later in this section. ...
... Storing the edits for each cell in the cost matrix often requires quadratic memory space which is very expensive for large string inputs. Fortunately, there are algorithms that can reproduce the edits without storing the edits initially (Hirschberg [12] and Chowdhury [14]) and without the requirement for quadratic memory space. However, such algorithms require extra O(mn) work to do so. ...
... Based on this classification criterion, four classes of DP are defined: serial monadic, serial polyadic, nonserial monadic, and nonserial polyadic. Considering the cache-efficient parallel execution, Chowdhury et al. [11] provide cacheefficient algorithms for three different classes of DP: LDDP (Local Dependency DP) problem, GEP (Gaussian Elimination Paradigm), and Parenthesis problem, each of which embraces one class of DP applications [12]. In this paper, we consider another classification method. ...
... The application of completely iterative DP often results in inefficient cache usage. In [11] [12], the authors proposed a cache-efficient divide-and-conquer algorithm which divides the DP problem into lots of subproblems and solves them concurrently in one direction. In contrast, our EasyPDP automatically partitions the DP into lots of blocks represented by DAG according to the argument BlockSize set by the user, and solves each by working threads iteratively. ...
... We explained in Section 5.2.2 that the fault tolerance is crucial and necessary for parallel DP algorithms, whereas there are no fault tolerance and recovery mechanisms which have been considered and supported in previous work except our EasyPDP. Chowdhury [11] [12][13] [14] proposed a cache-efficient divide-and-conquer algorithm which divides the DP problem into lots of subproblems and solves them concurrently in one direction. It can achieve a good cache efficiency and space complexity. ...
Article
Full-text available
Dynamic programming is a popular and efficient technique in many scientific applications such as computational biology. Nevertheless, its performance is limited due to the burgeoning volume of scientific data, and parallelism is necessary and crucial to keep the computation time at acceptable levels. The intrinsically strong data dependency of dynamic programming makes it difficult and error-prone for the programmer to write a correct and efficient parallel program. Therefore this paper builds a runtime system named EasyPDP aiming at parallelizing dynamic programming algorithms on multi-core and multi-processor platforms. Under the concept of software reusability and complexity reduction of parallel programming, a DAG Data Driven Model is proposed, which supports those applications with a strong data interdependence relationship. Based on the model, EasyPDP runtime system is designed and implemented. It automatically handles thread creation, dynamic data task allocation and scheduling, data partitioning, and fault tolerance. Five frequently used DAG patterns from biological dynamic programming algorithms have been put into the DAG pattern library of EasyPDP, so that the programmer can choose to use any of them according to his/her specific application. Besides, an ideal computing distribution model is proposed to discuss the optimal values for the performance tuning arguments of EasyPDP. We evaluate the performance potential and fault tolerance feature of EasyPDP in multi-core system. We also compare EasyPDP with other methods such as Block-Cycle Wavefront(BCW). The experimental results illustrate that EasyPDP system is fine and provides an efficient infrastructure for dynamic programming algorithms.
... [8], the class of problems that can be solved resiliently via dynamic programming in the presence of faults. Hinging upon a recursive framework introduced in [10,11], we design resilient algorithms for all problems that can be solved by triply-nested loops of the type that occur in the standard Gaussian elimination algorithm, most notably all-pairs shortest paths and matrix multiplication. Similar results also apply to the Fast Fourier Transform. ...
... for completeness. We refer to [10,11] for a detailed description and analysis. Let X and Y be two sequences of length n and m, respectively (w.l.o.g., let m ≥ n). ...
... be the subtable of the dynamic programming table C ranging from row i to row j and from column h to column k. [10,11] is implemented by two recursive functions, Boundary and Traceback-Path, that use a divide-and-conquer strategy, logically splitting table C into four quadrants. Boundary performs a forward computation by recursively solving four subproblems: it returns the output boundaries (R and D) of a quadrant, starting from the projections of X and Y on the quadrant and the input boundaries (L and T ). ...
Conference Paper
Full-text available
Random access memories suffer from transient errors that lead the logical state of some bits to be read differently from how they were last written. Due to technological constraints, caches in the memory hierarchy of modern computer platforms appear to be particularly prone to bit flips. Since algorithms implicitly assume data to be stored in reliable memories, they might easily exhibit unpredictable behaviors even in the presence of a small number of faults. In this paper we investigate the design of dynamic programming algorithms in faulty memory hierarchies. Previous works on resilient algorithms considered a one-level faulty memory model and, with respect to dynamic programming, could address only problems with local dependencies. Our improvement upon these works is two-fold: (1) we significantly extend the class of problems that can be solved resiliently via dynamic programming in the presence of faults, settling challenging non-local problems such as all-pairs shortest paths and matrix multiplication; (2) we investigate the connection between resiliency and cache-efficiency, providing cache-oblivious implementations that incur an (almost) optimal number of cache misses. Our approach yields the first resilient algorithms that can tolerate faults at any level of the memory hierarchy, while maintaining cacheefficiency. All our algorithms are correct with high probability and match the running time and cache misses of their standard non-resilient counterparts while tolerating a large (polynomial) number of faults. Our results also extend to Fast Fourier Transform. 1998 ACM Subject Classification B.8 [Performance and reliability]; F.2 [Analysis of algorithms and problem complexity]; I.2.8 [Dynamic programming].
... They Shanghai Natural Science Funding (No. 18ZR1403100) raised an open problem in their paper if there exists a better algorithm for the general edit distance problem when allowing gaps of insertions and deletions [5] (the GAP problem). On the other hand, Chowdhury, Le, and Ramachandran [2], [3], [10] developed a set of cache-oblivious parallel (COP) and cache-efficient algorithm for DP problems with O(1) or more than O(1) dependency. Their algorithms are usually work-efficient but super-linear in time. ...
... However their algorithm for the general GAP problem is not work-efficient. Classic COP approach [2], [3], [10] usually attains optimal work, space, and cache bounds in a cache-oblivious fashion, but with a super-linear time bound due to both the updating order imposed by DP recurrences and excessive control dependency introduced by their approach. In this paper, we present a new framework to parallelize a DP computation based on a novel combination of the closure method and ND method. ...
Preprint
Dynamic programming problems have wide applications in real world and have been studied extensively in both serial and parallel settings. In 1994, Galil and Park developed work-efficient and sublinear-time algorithms for several important dynamic programming problems based on the closure method and matrix product method. However, in the same paper, they raised an open question whether such an algorithm exists for the general GAP problem. % In this paper, we answer their question by developing the first work-efficient and sublinear-time GAP algorithm based on the closure method and Nested Dataflow method. % We also improve the time bounds of classic work-efficient, cache-oblivious and cache-efficient algorithms for the 1D problem and GAP problem, respectively.
... Various approaches have been introduced to reduce the complexity of dynamic programming (Qingguo, 2011;Chowdhury, 2010). Dominant points (Korkin, 2001) are minimal points in a multidimensional search space. ...
... The main idea behind the dominant point approach (Chowdhury, 2010). is to identify exclusively the dominant point values instead of identifying values of all positions in matrix L. It consists of the following two parts: Many parallel (Babu, 1997;Chen, 2006). ...
Article
A biological sequence is a single, continuous molecule of nucleic acid or protein. Classical methods for the Multiple Longest Common Subsequence problem (MLCS) problem are based on dynamic programming. The Multiple Longest Common Subsequence problem (MLCS) is used to find the longest subsequence shared between two or more strings. For over 30 years, significant efforts have been made to find efficient algorithms for the MLCS problem. Many of them have been proposed for the general case of any given number of strings. They could benefit greatly from improving their computation times. (Qingguo et al.,) proposed a new algorithm for the general case of Multiple LCS problem, which is finding the LCS of any number of strings. This algorithm is based on the dominant point approach and employs a fast divide-and-conquers technique to compute the dominant points. From this existing work, it is observed that, when this approach is applied to a case of three strings, this algorithm demonstrates the same performance as the fastest existing MLCS algorithm. When applied to more than three strings, this technique is significantly faster than the existing sequential methods, reaching up to 2-3 orders of magnitude faster speed on large-size problems. However, from our experimental results, it is observed that as the size of the Data Set is increasing, its performance decreases in terms of execution time. To overcome this major issue, we have developed an efficient model called Cache Oblivious based Multiple Longest Common Subsequence (CMLCS). From our experimental results, it is observed that our proposed work performs better as compared with the existing system in terms of Execution Time and Memory Usage.
... The dynamic programming recurrences discussed in this paper have non-local dependencies (definition given in [57]), and we point out that they are pretty different from the problems like edit distance or stencil computations (e.g., [36,52,59,77]) that only have local dependencies. We did not consider other types of dynamic programming approaches like rank convergence or hybrid r-way DAC algorithms [38,79,80] that cannot guarantee processor-and cache-obliviousness simultaneously. ...
Preprint
For many cache-oblivious algorithms for dynamic programming and linear algebra, we observe that the key factor that affects the cache complexity is the number of input entries involved in each basic computation cell. In this paper, we propose a level of abstraction to capture this property, and refer to it as the k-d grid computation structure. We then show the computational lower bounds for this grid structure, and propose efficient and highly-parallel algorithms to compute such grid structure that optimize the number of arithmetic operations, parallel depth, and the cache complexity in both the classic setting when reads and writes have the same cost, and the asymmetric variant that considers writes to be more expensive than reads. Using the abstraction with the proposed algorithms as the implementation, we propose cache-oblivious algorithms for many fundamental problems with improved cache complexities in both the classic and asymmetric settings. The cache bounds are optimal in most applications we consider. Meanwhile, we also reduce the parallel depths of many problems. We believe that the novelty of our framework is of interests and leads to many new questions for future work.
... Serial and parallel implementations of recursive divide and conquer algorithms with optimal cache complexity have been developed and evaluated for a specific dynamic programming algorithm such as Longest Common Subsequence [10,13], global pairwise sequence alignment problem in bioinformatics [12], Gaussian Elimination Paradigm [14], etc. Tithi et al. [63] described a way to obtain divide and conquer algorithms manually for a class of dynamic programming problems. The base cases of these programs are similar to matrix multiplication computations. ...
Article
Full-text available
This paper studies two variants of tiling: iteration space tiling (or loop blocking) and cache-oblivious methods that recursively split the iteration space with divide-and-conquer. The key question to answer is when we should be using one over the other. The answer to this question is complicated for modern architecture due to a number of reasons. In this paper, we present a detailed empirical study to answer this question for a range of kernels that fit the polyhedral model. Our study is based on a generalized cache oblivious code generator that support this class, which is a superset of those supported by existing tools. The conclusion is that cache oblivious code is most useful when the aim is to have reduced off-chip memory accesses, e.g., lower energy, albeit certain situations that diminish its effectiveness exist.
... Dynamic programs are usually described through recurrence relations that specify how to decompose sub-problems, and is typically implemented using a DP table where each cell holds the computed solution for one of these sub-problems. The table can be filled by visiting each cell once in some predetermined order, but recent research has shown that it is possible to achieve order-of-magnitude performance improvements over this standard implementation approach by developing divide-and-conquer implementation strategies that recursively partition the space of subproblems into smaller subspaces [4,[8][9][10][11]32]. ...
Article
We introduce a framework allowing domain experts to manipulate computational terms in the interest of deriving better, more efficient implementations.It employs deductive reasoning to generate provably correct efficient implementations from a very high-level specification of an algorithm, and inductive constraint-based synthesis to improve automation. Semantic information is encoded into program terms through the use of refinement types. In this paper, we develop the technique in the context of a system called Bellmania that uses solver-aided tactics to derive parallel divide-and-conquer implementations of dynamic programming algorithms that have better locality and are significantly more efficient than traditional loop-based implementations. Bellmania includes a high-level language for specifying dynamic programming algorithms and a calculus that facilitates gradual transformation of these specifications into efficient implementations. These transformations formalize the divide-and conquer technique; a visualization interface helps users to interactively guide the process, while an SMT-based back-end verifies each step and takes care of low-level reasoning required for parallelism. We have used the system to generate provably correct implementations of several algorithms, including some important algorithms from computational biology, and show that the performance is comparable to that of the best manually optimized code.
... See [21,30] for surveys of cache-oblivous algorithms. See [8,9,14,16,17,19,22,23,31,44,45] for discussions of implementations and performance analysis of cache-oblivious algorithms. See [6,7,13] for a discussion of the limits of cache-obliviousness. ...
Conference Paper
Full-text available
We introduce the cache-adaptive model, which generalizes the external-memory model to apply to environments in which the amount of memory available to an algorithm can fluctuate. The cache-adaptive model applies to operating systems, databases, and other systems where the allocation of memory to processes changes over time. We prove that if an optimal cache-oblivious algorithm has a particular recursive structure, then it is also an optimal cache-adaptive algorithm. Cache-oblivious algorithms having this form include Floyd-Warshall all pairs shortest paths, na¨ve recursive matrix multiplication, matrix transpose , and Gaussian elimination. While the cache-oblivious sorting algorithm Lazy Funnel Sort does not have this re-cursive structure, we prove that it is nonetheless optimally cache-adaptive. We also establish that if a cache-oblivious algorithm is optimal on " square " (well-behaved) memory profiles then, given resource augmentation it is optimal on all memory profiles. We give paging algorithms for the case where the cache size changes dynamically. We prove that LRU with 4-memory and 4-speed augmentation is competitive with optimal. Moreover, Belady's algorithm remains optimal even when the cache size changes. Cache-obliviousness is distinct from cache-adaptivity. We exhibit a cache-oblivious algorithm that is not cache-adaptive and a cache-adaptive algorithm for a problem having no optimal cache-oblivious solution.
... y n , a sequence of consecutive deletes corresponds to a gap in X, and a sequence of consecutive inserts corresponds to a gap in Y . An affine gap penalty function is predominantly used in bioinformatics, for which O(n 2 ) algorithms are available [31], [6]. However, in many applications the cost of such a gap is not necessarily equal to the sum of the costs of each individual deletion (or insertion) in that gap. ...
Conference Paper
Full-text available
Dynamic Programming (DP) problems arise in a wide range of application areas spanning from logistics to computational biology. In this paper, we show how to obtain high-performing parallel implementations for a class of DP problems by reducing them to highly optimizable flexible kernels through cache-oblivious recursive divide-and-conquer (CORDAC). We implement parallel CORDAC algorithms for four non-trivial DP problems, namely the parenthesization problem, Floyd-Warshall’s all-pairs shortest path (FW-APSP), sequence alignment with general gap penalty (gap problem) and protein accordion folding. To the best of our knowledge our algorithms for protein accordion folding and the gap problem are novel. All four algorithms have asymptotically optimal cache performance, and all but FW-APSP have asymptotically more parallelism than their looping counterparts. We show that the base cases of our CORDAC algorithms are predominantly matrix-multiplication-like (MM-like) flexible kernels that expose many optimization opportunities not offered by traditional looping DP codes. As a result, one can obtain highly efficient DP implementations by optimizing those flexible kernels only. Our implementations achieve 5 − 150 × speedup over their standard loop based DP counterparts while consuming order-of-magnitude less energy on modern multicore machines with 16 − 32 cores. We also compare our implementations with parallel tiled codes generated by existing polyhedral compilers: Polly, PoCC and PLuTo, and show that our implementations run significantly faster. Finally, we present results on manycores (Intel Xeon Phi) and clusters of multicores obtained using simple extensions for SIMD and shared-distributed-shared-memory architectures, respectively, demonstrating the versatility of our approach. Our optimization approach is highly systematic and suitable for automation.
... Gotoh's algorithm solves three interdependent recurrences that update three different fields -D, I, and G -on a 2D rectangular grid (see [3,10] for details). This grid cannot be directly evaluated as a stencil because of the dependence of each cell on cells in the same row/column. ...
Article
Pochoir is a compiler for a domain-specific language embedded in C++ which produces excellent code from a simple specifica-tion of a desired stencil computation. Pochoir allows a wide variety of boundary conditions to be specified, and it automat-ically parallelizes and optimizes cache performance. Bench-marks of Pochoir-generated code demonstrate a performance advantage of 2–10 times over standard parallel loop implemen-tations. This paper describes the Pochoir specification language and shows how a wide range of stencil computations can be easily specified.
... compiler with Intel Cilk Plus [23] on a 12-core Intel Core i7 (Nehalem) machine with a private 32-KB L1-data-cache, a private 256-KB L2-cache, and a shared 12-MB L3-cache. The code based on LOOPS ran in 248 seconds, whereas the Pochoir-generated code based on TRAP required about 24 seconds , more than a factor of 10 performance advantage.Figure 3 shows Pochoir's performance on a wider range of benchmarks, including heat equation (Heat) [13] on a 2D grid, a 2D torus, and a 4D grid; Conway's game of Life (Life) [18]; 3D finite-difference wave equation (Wave) [32]; lattice Boltzmann method (LBM) [30]; RNA secondary structure prediction (RNA) [1, 6]; pairwise sequence alignment (PSA) [19]; longest common subsequence (LCS) [7]; and American put stock option pricing (APOP) [24] . Pochoir achieves a substantial performance improvement over a straightforward loop parallelization for typical stencil applications, such as Heat and Life. ...
Conference Paper
A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochoir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm. Pochoir supports general d-dimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user's stencil specification to be executed directly in C++ without the Pochoir compiler (albeit more slowly), which simplifies user debugging and greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallelloop implementations, typically running 2-10 times faster. The algorithm behind Pochoir improves on prior cache-efficient algorithms on multidimensional grids by making "hyperspace" cuts, which yield asymptotically more parallelism for the same cache efficiency.
Article
Molecular biologists rely very heavily on computer science algorithms as research tools. The process of finding the longest common subsequence of two DNA sequences has a wide range of applications in modern bioinformatics. Genetics databases can hold enormous amounts of raw data, for example the human genome consists of approximately three billion DNA base pairs. The processing of this gigantic volume of data necessitates the use of extremely efficient string algorithms. This paper introduces a space and time effective technique for retrieving the longest common subsequence of DNA sequences.
Article
Full-text available
As we move towards the exascale era, the new architectures must be capable of running the massive computational problems efficiently. Scientists and researchers are continuously investing in tuning the performance of extreme-scale computational problems. These problems arise in almost all areas of computing, ranging from big data analytics, artificial intelligence, search, machine learning, virtual/augmented reality, computer vision, image/signal processing to computational science and bioinformatics. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. Therefore, the major challenge that we face in computing systems research is: "how to solve massive-scale computational problems in the most time/power/energy efficient manner?" The architectures are constantly evolving making the current performance optimizing strategies less applicable and new strategies to be invented. The solution is for the new architectures, new programming models, and applications to go forward together. Doing this is, however, extremely hard. There are too many design choices in too many dimensions. We propose the following strategy to solve the problem: (i) Models - Develop accurate analytical models (e.g. execution time, energy, silicon area) to predict the cost of executing a given program, and (ii) Complete System Design - Simultaneously optimize all the cost models for the programs (computational problems) to obtain the most time/area/power/energy efficient solution. Such an optimization problem evokes the notion of codesign.
Article
Full-text available
Stencil computations are an important class of compute and data intensive programs that occur widely in scientific and engineeringapplications. A number of tools use sophisticated tiling, parallelization, and memory mapping strategies, and generate code that relies on vendor-supplied compilers. This code has a number of parameters, such as tile sizes, that are then tuned via empirical exploration. We develop a model that guides such a choice. Our model is a simple set of analytical functions that predict the execution time of the generated code. It is deliberately optimistic, since tile sizes and, moreover, the optimistic assumptions are intended to enable we are targeting modeling and parameter selections yielding highly tuned codes. We experimentally validate the model on a number of 2D and 3D stencil codes, and show that the root mean square error in the execution time is less than 10% for the subset of the codes that achieve performance within 20% of the best. Furthermore, based on using our model, we are able to predict tile sizes that achieve a further improvement of 9% on average.
Conference Paper
We analyze the caching overhead incurred by a class of multithreaded algorithms when scheduled by an arbitrary scheduler. We obtain bounds that match or improve upon the well-known O(Q+S · (M/B)) caching cost for the randomized work stealing (RWS) scheduler, where S is the number of steals, Q is the sequential caching cost, and M and B are the cache size and block (or cache line) size respectively.
Conference Paper
Full-text available
Stencil computations are an important class of compute and data intensive programs that occur widely in scientific and engineeringapplications. A number of tools use sophisticated tiling, parallelization, and memory mapping strategies, and generate code that relies on vendor-supplied compilers. This code has a number of parameters, such as tile sizes, that are then tuned via empirical exploration. We develop a model that guides such a choice. Our model is a simple set of analytical functions that predict the execution time of the generated code. It is deliberately optimistic, since tile sizes and, moreover, the optimistic assumptions are intended to enable we are targeting modeling and parameter selections yielding highly tuned codes. We experimentally validate the model on a number of 2D and 3D stencil codes, and show that the root mean square error in the execution time is less than 10% for the subset of the codes that achieve performance within 20% of the best. Furthermore, based on using our model, we are able to predict tile sizes that achieve a further improvement of 9% on average.
Conference Paper
We introduce a framework allowing domain experts to manipulate computational terms in the interest of deriving better, more efficient implementations.It employs deductive reasoning to generate provably correct efficient implementations from a very high-level specification of an algorithm, and inductive constraint-based synthesis to improve automation. Semantic information is encoded into program terms through the use of refinement types. In this paper, we develop the technique in the context of a system called Bellmania that uses solver-aided tactics to derive parallel divide-and-conquer implementations of dynamic programming algorithms that have better locality and are significantly more efficient than traditional loop-based implementations. Bellmania includes a high-level language for specifying dynamic programming algorithms and a calculus that facilitates gradual transformation of these specifications into efficient implementations. These transformations formalize the divide-and conquer technique; a visualization interface helps users to interactively guide the process, while an SMT-based back-end verifies each step and takes care of low-level reasoning required for parallelism. We have used the system to generate provably correct implementations of several algorithms, including some important algorithms from computational biology, and show that the performance is comparable to that of the best manually optimized code.
Conference Paper
The Viterbi algorithm is used to find the most likely path through a hidden Markov model given an observed sequence, and has numerous applications. Due to its importance and high computational complexity, several algorithmic strategies have been developed to parallelize it on different parallel architectures. However, none of the existing Viterbi decoding algorithms designed for modern computers with cache hierarchies is simultaneously cache-efficient and cache-oblivious. Being oblivious of machine resources (e.g., caches and processors) while also being efficient promotes portability. In this paper, we present an efficient cache- and processor-oblivious Viterbi algorithm based on rank convergence. The algorithm builds upon the parallel Viterbi algorithm of Maleki et al. (PPoPP 2014). We provide empirical analysis of our algorithm by comparing it with Maleki et al.’s algorithm.
Conference Paper
We present AUTOGEN — an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run on a DP table of small size and automatically identifies a recursive access pattern and a corresponding provably correct recursive algorithm for solving the DP recurrence. We use AUTOGEN to auto-discover efficient algorithms for several well-known problems. Our experimental results show that several auto-discovered algorithms significantly outperform parallel looping and tiled loop-based algorithms. Also, these algorithms are less sensitive to fluctuations of memory and bandwidth compared with their looping counterparts, and their running times and energy profiles remain relatively more stable. To the best of our knowledge, AUTOGEN is the first algorithm that can automatically discover new nontrivial divide-and-conquer algorithms.
Conference Paper
State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best parallelism at the same time. While these algorithms achieve cache-optimality through the use of a recursive divide-and-conquer (DAC) strategy, scheduling tasks at the granularity of task dependency introduces artificial dependencies in addition to those arising from the defining recurrence equations. We removed the artificial dependency by scheduling tasks ready for execution as soon as all its real dependency constraints are satisfied, while preserving the cache-optimality by inheriting the DAC strategy. We applied our approach to a set of widely known dynamic programming problems, such as Floyd-Warshall's All-Pairs Shortest Paths, Stencil, and LCS. Theoretical analyses show that our techniques improve the span of 2-way DAC-based Floyd Warshall's algorithm on an n node graph from Θ(nlog2 n) to Θ(n), stencil computations on a d-dimensional hypercubic grid of width w for h time steps from Θ ((d2h) wlog(d+2)-1) to Θ(h), and LCS on two sequences of length n each from Θ (nlog23) to Θ(n). In each case, the total work and cache complexity remain asymptotically optimal. Experimental measurements exhibit a 3-5 times improvement in absolute running time, 10-20 times improvement in burdened span by Cilkview, and approximately the same L1/L2 cache misses by PAPI.
Conference Paper
The state-of-the-art "trapezoidal decomposition algorithm" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism. In this paper we present a variant of the parallel trapezoidal decomposition algorithm called "cache-oblivious wavefront" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer. We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.
Conference Paper
In this paper, we demonstrate the ability of spatial architectures to significantly improve both runtime performance and energy efficiency on edit distance, a broadly used dynamic programming algorithm. Spatial architectures are an emerging class of application accelerators that consist of a network of many small and efficient processing elements that can be exploited by a large domain of applications. In this paper, we utilize the dataflow characteristics and inherent pipeline parallelism within the edit distance algorithm to develop efficient and scalable implementations on a previously proposed spatial accelerator. We evaluate our edit distance implementations using a cycle-accurate performance and physical design model of a previously proposed triggered instruction-based spatial architecture in order to compare against real performance and power measurements on an x86 processor. We show that when chip area is normalized between the two platforms, it is possible to get more than a 50× runtime performance improvement and over 100× reduction in energy consumption compared to an optimized and vectorized x86 implementation. This dramatic improvement comes from leveraging the massive parallelism available in spatial architectures and from the dramatic reduction of expensive memory accesses through conversion to relatively inexpensive local communication.
Conference Paper
String comparison such as sequence alignment, edit distance computation, longest common subsequence computation, and approximate string matching is a key task (and often computational bottleneck) in large-scale textual information retrieval. For instance, algorithms for sequence alignment are widely used in bioinformatics to compare DNA and protein sequences. These problems can all be solved using essentially the same dynamic programming scheme over a two-dimensional matrix, where each entry depends locally on at most 3 neighboring entries. We present a simple, fast, and cache-oblivious algorithm for this type of local dynamic programming suitable for comparing large-scale strings. Our algorithm outperforms the previous state-of-the-art solutions. Surprisingly, our new simple algorithm is competitive with a complicated, optimized, and tuned implementation of the best cache-aware algorithm. Additionally, our new algorithm generalizes the best known theoretical complexity trade-offs for the problem.
Conference Paper
Finding Longest Common Subsequence (LCS) involves comparison of two or more sequences and find the Longest Subsequence which is common to all sequences. This NP hard problem is useful in various Sequence Database (SDB) applications such as finding Motif, gene/protein sequence comparisons in Biological SDB, finding the highly demanded item list from the Inventory SDB, finding the most traded Shares list in Stock Trading SDB. The existing algorithms have drawbacks of high computational time as they focused on both matched and unmatched positions of sequences to get LCS. This is overcome by proposed Positional_LCS by focusing only on matched positions. Thus the Positional_LCS reduces the time complexity than the existing algorithms.
Article
Recently, it has been proven that evolutionary algorithms produce good results for a wide range of combinatorial optimization problems. Some of the considered problems are tackled by evolutionary algorithms that use a representation which enables them to construct solutions in a dynamic programming fashion. We take a general approach and relate the construction of such algorithms to the development of algorithms using dynamic programming techniques. Thereby, we give general guidelines on how to develop evolutionary algorithms that have the additional ability of carrying out dynamic programming steps. Finally, we show that for a wide class of the so-called DP-benevolent problems (which are known to admit FPTAS) there exists a fully polynomial-time randomized approximation scheme based on an evolutionary algorithm.
Article
Full-text available
The input/output complexity of sorting and related problems
Article
Full-text available
Abstract We present theoretical and experimental results on cache-efficient and parallel algorithms for some well-studied string problems in bioinformatics: 1. Pairwise alignment. Optimal pairwise global sequence alignment using affine gap penalty; 2. Median. Optimal alignment of three sequences using affine gap penalty; 3. RNA secondary structure prediction. Maximizing number,of base pairs in RNA secondary structure with simple pseudoknots. For each of these three problems we present cache-oblivious algorithms that match the best-known
Article
Full-text available
We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multi-level caching model for multicores, and we propose a multicore-oblivious approach to algorithms and schedulers for HM. We instantiate this approach with provably efficient multicore-oblivious algorithms for matrix and prefix sum computations, FFT, the Gaussian Elimination paradigm (which represents an important class of computations including Floyd-Warshall's all-pairs shortest paths, Gaussian Elimination and LU decomposition without pivoting), sorting, list ranking, Euler tours and connected components. We then use the network oblivious framework proposed earlier as an oblivious framework for a network of processors, and we present provably efficient network-oblivious algorithms for sorting, the Gaussian Elimination paradigm, list ranking, Euler tours and connected components. Many of these network-oblivious algorithms perform efficiently also when executed on the Decomposable-BSP.
Article
Full-text available
We consider triply-nested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cache-oblivious methods I-GEP and C-GEP, both of which reduce the number of cache misses incurred (or I/Os performed) by the computation over that performed by standard GEP by a factor of M\sqrt{M}, where M is the size of the cache. Cache-oblivious I-GEP computes in-place and solves most of the known applications of GEP including Gaussian elimination and LU-decomposition without pivoting and Floyd-Warshall all-pairs shortest paths. Cache-oblivious C-GEP uses a modest amount of additional space, but is completely general and applies to any code in GEP form. Both I-GEP and C-GEP produce system-independent cache-efficient code, and are potentially applicable to being used by optimizing compilers for loop transformation. We present parallel I-GEP and C-GEP that achieve good speed-up and match the sequential caching performance cache-obliviously for both shared and distributed caches for sufficiently large inputs. We present extensive experimental results for both in-core and out-of-core performance of our algorithms. We consider both sequential and parallel implementations, and compare them with finely-tuned cache-aware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cache-oblivious GEP offers an attractive trade-off between efficiency and portability.
Conference Paper
Full-text available
We present cache-efficient chip multiprocessor (CMP) algorithms with good speed-up for some widely used dynamic programming algorithms. We consider three types of caching systems for CMPs: D-CMP with a private cache for each core, S-CMP with a single cache shared by all cores, and Multicore, which has private L1 caches and a shared L2 cache. We derive results for three classes of problems: local dependency dynamic programming (LDDP), Gaussian Elimination Paradigm (GEP), and parenthesis problem. For each class of problems, we develop a generic CMP algorithm with an associated tiling sequence. We then tailor this tiling sequence to each caching model and provide a parallel schedule that results in a cache-efficient parallel execution up to the critical path length of the underlying dynamic programming algorithm. We present experimental results on an 8-core Opteron for two sequence alignment problems that are important examples of LDDP. Our experimental results show good speed-ups for simple versions of our algorithms.
Article
Full-text available
Given two strings ofsize n over a constant alphabet, the classical algorithm for computing the similarity between two sequences (D. Sankoff and J. B. Kruskal, eds., Time Warps, String Edits, and Macromolecules; Addison-Wesley, Reading, MA, 1983; T. F. Smith and M. S. Waterman, J. Molec. Biol., 147 (1981), pp. 195-197) uses a dynamic programming matrix and compares the two strings in O(n2) time. We address the challenge ofcomputing the similarity of two strings in subquadratic time for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global similarity computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing ofboth strings, and utilizing the inherent periodic nature ofboth strings. This leads to an O(n2/ log n), algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn2/ log n), where h ≤ 1 is the entropy ofthe text. We also present an algorithm for comparing two run-length encoded strings oflength m and n, compressed into mand nruns, respectively, in O(mn + nm) complexity. This result extends to all distance or similarity scoring schemes that use an additive gap penalty.
Article
Full-text available
We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match. Secondary storage is modeled as a magnetic disk capable of transferring P blocks each containing B records in a single time unit; the records in each block must be input from or output to B contiguous locations on the disk. We give two optimal algorithms for the problems, which are variants of merge sorting and distribution sorting. In particular we show for P = 1 that the standard merge sorting algorithm is an optimal external sorting method, up to a constant factor in the number of I/Os. Our sorting algorithms use the same number of I/Os as does the permutation phase of key sorting, except when the internal memory size is extremely small, thus affirming the popular adage that key sorting is not faster. We also give a simpler and more direct derivation of Hong and Kung's lower bound for the FFT for the special case B = P = O(1).
Article
Full-text available
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Article
Full-text available
When comparing two biological sequences, it is often desirable for a gap to be assigned a cost not directly proportional to its length. If affine gap costs are employed, in other words if opening a gap costsv and each null in the gap costsu, the algorithm of Gotoh (1982,J. molec. Biol. 162, 705) finds the minimum cost of aligning two sequences in orderMN steps. Gotoh's algorithm attempts to find only one from among possibly many optimal (minimum-cost) alignments, but does not always succeed. This paper provides an example for which this part of Gotoh's algorithm fails and describes an algorithm that finds all and only the optimal alignments. This modification of Gotoh's algorithm still requires orderMN steps. A more precise form of path graph than previously used is needed to represent accurately all optimal alignments for affine gap costs.
Article
Full-text available
Motivation: Sequence alignment is the problem of finding the optimal character-by-character correspondence between two sequences. It can be readily solved in O(n2) time and O(n2) space on a serial machine, or in O(n) time with O(n) space per O(n) processing elements on a parallel machine. Hirschberg's divide-and-conquer approach for finding the single best path reduces space use by a factor of n while inducing only a small constant slowdown to the serial version. Results: This paper presents a family of methods for computing sequence alignments with reduced memory that are well suited to serial or parallel implementation. Unlike the divide-and-conquer approach, they can be used in the forward-backward (Baum-Welch) training of linear hidden Markov models, and they avoid data-dependent repartitioning, making them easier to parallelize. The algorithms feature, for an arbitrary integer L, a factor proportional to L slowdown in exchange for reducing space requirement from O(n2) to O(n1 square root of n). A single best path member of this algorithm family matches the quadratic time and linear space of the divide-and-conquer algorithm. Experimentally, the O(n1.5)-space member of the family is 15-40% faster than the O(n)-space divide-and-conquer algorithm.
Article
Full-text available
RNA molecules are sequences of nucleotides that serve as more than mere intermediaries between DNA and proteins, e.g., as catalytic molecules. Computational prediction of RNA secondary structure is among the few structure prediction problems that can be solved satisfactorily in polynomial time. Most work has been done to predict structures that do not contain pseudoknots. Allowing pseudoknots introduces modeling and computational problems. In this paper we consider the problem of predicting RNA secondary structures with pseudoknots based on free energy minimization. We first give a brief comparison of energy-based methods for predicting RNA secondary structures with pseudoknots. We then prove that the general problem of predicting RNA secondary structures containing pseudoknots is NP complete for a large class of reasonable models of pseudoknots.
Article
Full-text available
Comparative analysis of RNA sequences is the basis for the detailed and accurate predictions of RNA structure and the determination of phylogenetic relationships for organisms that span the entire phylogenetic tree. Underlying these accomplishments are very large, well-organized, and processed collections of RNA sequences. This data, starting with the sequences organized into a database management system and aligned to reveal their higher-order structure, and patterns of conservation and variation for organisms that span the phylogenetic tree, has been collected and analyzed. This type of information can be fundamental for and have an influence on the study of phylogenetic relationships, RNA structure, and the melding of these two fields. We have prepared a large web site that disseminates our comparative sequence and structure models and data. The four major types of comparative information and systems available for the three ribosomal RNAs (5S, 16S, and 23S rRNA), transfer RNA (tRNA), and two of the catalytic intron RNAs (group I and group II) are: (1) Current Comparative Structure Models; (2) Nucleotide Frequency and Conservation Information; (3) Sequence and Structure Data; and (4) Data Access Systems. This online RNA sequence and structure information, the result of extensive analysis, interpretation, data collection, and computer program and web development, is accessible at our Comparative RNA Web (CRW) Site http://www.rna.icmb.utexas.edu. In the future, more data and information will be added to these existing categories, new categories will be developed, and additional RNAs will be studied and presented at the CRW Site.
Article
Full-text available
Motivation: Prokaryotic organisms have been identified utilizing the sequence variation of the 16S rRNA gene. Variations steer the design of DNA probes for the detection of taxonomic groups or specific organisms. The long-term goal of our project is to create probe arrays capable of identifying 16S rDNA sequences in unknown samples. This necessitated the authentication, categorization and alignment of the >75 000 publicly available '16S' sequences. Preferably, the entire process should be computationally administrated so the aligned collection could periodically absorb 16S rDNA sequences from the public records. A complete multiple sequence alignment would provide a foundation for computational probe selection and facilitates microbial taxonomy and phylogeny. Results: Here we report the alignment and similarity clustering of 62 662 16S rDNA sequences and an approach for designing effective probes for each cluster. A novel alignment compression algorithm, NAST (Nearest Alignment Space Termination), was designed to produce the uniform multiple sequence alignment referred to as the prokMSA. From the prokMSA, 9020 Operational Taxonomic Units (OTUs) were found based on transitive sequence similarities. An automated approach to probe design was straightforward using the prokMSA clustered into OTUs. As a test case, multiple probes were computationally picked for each of the 27 OTUs that were identified within the Staphylococcus Group. The probes were incorporated into a customized microarray and were able to correctly categorize Staphylococcus aureus and Bacillus anthracis into their correct OTUs. Although a successful probe picking strategy is outlined, the main focus of creating the prokMSA was to provide a comprehensive, categorized, updateable 16S rDNA collection useful as a foundation for any probe selection algorithm.
Article
Full-text available
The systematic comparison of genomic sequences from different organisms represents a central focus of contemporary genome analysis. Comparative analyses of vertebrate sequences can identify coding and conserved non-coding regions, including regulatory elements, and provide insight into the forces that have rendered modern-day genomes. As a complement to whole-genome sequencing efforts, we are sequencing and comparing targeted genomic regions in multiple, evolutionarily diverse vertebrates. Here we report the generation and analysis of over 12 megabases (Mb) of sequence from 12 species, all derived from the genomic region orthologous to a segment of about 1.8 Mb on human chromosome 7 containing ten genes, including the gene mutated in cystic fibrosis. These sequences show conservation reflecting both functional constraints and the neutral mutational events that shaped this genomic region. In particular, we identify substantial numbers of conserved non-coding segments beyond those previously identified experimentally, most of which are not detectable by pair-wise sequence comparisons alone. Analysis of transposable element insertions highlights the variation in genome dynamics among these species and confirms the placement of rodents as a sister group to the primates.
Conference Paper
Full-text available
This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cache-line length L where Z=Ω(L2 ) the number of cache misses for an m×n matrix transpose is Θ(1+mn/L). The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ(1+(n/L)(1+logZn)). We also give an Θ(mnp)-work algorithm to multiply an m×n matrix by an n×p matrix that incurs Θ(1+(mn+np+mp)/L+mnp/L√Z) cache faults. We introduce an “ideal-cache” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model. Can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on the effectiveness of cache-oblivious algorithms in practice
Article
Full-text available
The classical algorithm for computing the similarity between two sequences [45, 48] uses a dynamic programming matrix, and compares two strings of size n in O(n ) time. We address the challenge of computing the similarity of two strings in sub-quadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global similarity computations.
Article
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Conference Paper
The Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to represent the triply-nested loop computation that occurs in several important algorithms including Gaussian elimination without pivoting and Floyd-Warshall's all-pairs shortest paths algorithm. An efficient cache-oblivious algorithm for these instances of GEP was presented in [6]. In this paper we establish several important properties of this cache-oblivious framework, and extend the framework to solve GEP in its full generality within the same time and I/O bounds. We then analyze a parallel implementation of the framework and its caching performance for both shared and distributed caches. We present extensive experimental results for both in-core and out-of-core performance of our algorithms. We consider both sequential and parallel implementations of our algorithms, and compare them with finely-tuned cache-aware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cache-oblivious GEP offers an attractive tradeoff between efficiency and portability.
Article
Acknowledgments I am grateful for the supervision of Dr. Vijaya Ramachandran during my undergraduate research. I have learned from her the fundamentals of theoretical Computer Science and the joy of doing research. Her tireless passion and motivation enlightens my way through many dicult problems. It is my honor to work with Dr. Ganeshkumar Ganapathy and Rezaul Chowdhury, with generous help and kindness in responding to my multitude of questions. I would like to thank my collabo- rators: Rezaul Chowdhury, Ganeshkumar Ganapathy, Vijaya Ramachandran and Tandy Warnow in preparation of papers, where this work was presented. Special thanks to Dr. Tandy Warnow and Dr. Anna G al for serving as committee members of this thesis. Finally, I am indebted to my parents for their support of my education and the endless encouragement during my stay at the University of Texas. Hai-Son Le The University of Texas at Austin December 2006 ii Contents
Book
This talk will review a little over a decade's research on applying certain stochastic models to biological sequence analysis. The models themselves have a longer history, going back over 30 years, although many novel variants have arisen since that time. The function of the models in biological sequence analysis is to summarize the information concerning what is known as a motif or a domain in bioinformatics, and to provide a tool for discovering instances of that motif or domain in a separate sequence segment. We will introduce the motif models in stages, beginning from very simple, non-stochastic versions, progressively becoming more complex, until we reach modern profile HMMs for motifs. A second example will come from gene finding using sequence data from one or two species, where generalized HMMs or generalized pair HMMs have proved to be very effective.
Article
The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space. An algorithm is presented which will solve this problem in quadratic time and in linear space.
Conference Paper
We address the design of algorithms for multicores that are oblivious to machine parameters. We propose HM, a multicore model consisting of a parallel shared-memory machine with hierarchical multi-level caching, and we introduce a multicore-oblivious (MO) approach to algorithms and schedulers for HM. An MO algorithm is specified with no mention of any machine parameters, such as the number of cores, number of cache levels, cache sizes and block lengths. However, it is equipped with a small set of instructions that can be used to provide hints to the run-time scheduler on how to schedule parallel tasks. We present efficient MO algorithms for several fundamental problems including matrix transposition, FFT, sorting, the Gaussian Elimination Paradigm, list ranking, and connected components. The notion of an MO algorithm is complementary to that of a network-oblivious (NO) algorithm, recently introduced by Bilardi et al. for parallel distributed-memory machines where processors communicate point-to-point. We show that several of our MO algorithms translate into efficient NO algorithms, adding to the body of known efficient NO algorithms.
Article
Alignment algorithms can be used to infer a relationship between sequences when the true relationship is unknown. Simple alignment algorithms use a cost function that gives a fixed cost to each possible point mutation—mismatch, deletion, insertion. These algorithms tend to find optimal alignments that have many small gaps. It is more biologically plausible to have fewer longer gaps rather than many small gaps in an alignment. To address this issue, linear gap cost algorithms are in common use for aligning biological sequence data. More reliable inferences are obtained by aligning more than two sequences at a time. The obvious dynamic programming algorithm for optimally aligning k sequences of length n runs inO(nk ) time. This is impractical if k⩾3 and n is of any reasonable length. Thus, for this problem there are many heuristics for aligning k sequences, however, they are not guaranteed to find an optimal alignment. In this paper, we present a new algorithm guaranteed to find the optimal alignment for three sequences using linear gap costs. This gives the same results as the dynamic programming algorithm for three sequences, but typically does so much more quickly. It is particularly fast when the (three-way) edit distance is small. Our algorithm uses a speed-up technique based on Ukkonen's greedy algorithm (Ukkonen, 1983) which he presented for two sequences and simple costs.
Article
The edit distance between two character strings can be defined as the minimum cost of a sequence of editing operations which transforms one string into the other. The operations we admit are deleting, inserting and replacing one symbol at a time, with possibly different costs for each of these operations. The problem of finding the longest common subsequence of two strings is a special case of the problem of computing edit distances. We describe an algorithm for computing the edit distance between two strings of length n and m, n ⪖ m, which requires steps whenever the costs of edit operations are integral multiples of a single positive real number and the alphabet for the strings is finite. These conditions are necessary for the algorithm to achieve the time bound.
Article
Space saving techniques in computations of a longest common subsequence (LCS) of two strings are crucial in many applications, notably, in molecular sequence comparisons. For about ten years, however, the only linear-space LCS algorithm known required time quadratic in the length of the input, for all inputs. This paper reviews linear-space LCS computations in connection with two classical paradigms originally designed to take less than quadratic time in favorable circumstances. The objective is to achieve the space reduction without alteration of the asymptotic time complexity of the original algorithm. The first one of the resulting constructions takes time O(n(m−l)), and is thus suitable for cases where the LCS is expected to be close to the shortest input string. The second takes time O(ml log(min[s, m, and suits cases where one of the inputs is much shorter than the other. Here m and n (m⩽n) are the lengths of the two input strings, l is the length of the longest common subsequences and s is the size of the alphabet. Along the way, a very simple O(m(m−l)) time algorithm is also derived for the case of strings of equal length.
Conference Paper
Many methods in bioinformatics rely on evolutionary relationships between protein, DNA, or RNA sequences. Alignment is a crucial first step in most analyses, since it yields information about which regions of the sequences are related to each other. Here, a new method for multiple parsimony alignment over a tree is presented. The novelty is that an affine gap cost is used rather than a simple linear gap cost. Affine gap costs have been used with great success for pairwise alignments and should prove useful in the multiple alignment scenario. The algorithmic challenge of using an affine gap cost in multiple alignment is the introduction of dependence between different columns in the alignment. The utility of the new method is illustrated by a number of protein sequences where increased alignment accuracy is obtained by using multiple sequences.
Conference Paper
We consider a model of recommendation systems, where each member from a given set of players has a binary preference to each element in a given set of objects: intuitively, each player either likes or dislikes each object. However, the ...
Conference Paper
We present efficient cache-oblivious algorithms for sev- eral fundamental dynamic programs. These include new algorithms with improved cache performance for longest common subsequence (LCS), edit distance, gap (i.e., edit distance with gaps), and least weight subse- quence. We present a new cache-oblivious framework called the Gaussian Elimination Paradigm (GEP) for Gaussian elimination without pivoting that also gives cache-oblivious algorithms for Floyd-Warshall all-pairs shortest paths in graphs and 'simple DP', among other problems. � block transfers. Experimental results show that this algo- rithm runs two to six times faster than the widely used linear-space LCS algorithm by Hirschberg (13). We show that our algorithm is I/O-optimal in that it performs the minimum number of block transfers (to within a constant factor) of any implementation of the dynamic programming algorithm for LCS. This algorithm can be adapted to solve the edit distance problem (17, 7) within the same bounds; this latter problem asks for the minimum cost of an edit sequence that transforms a given sequence into another one with the allowable edit operations being insertion, deletion and substitution of symbols each having a cost based on the symbol(s) on which it is to be applied.
Book
Probablistic models are becoming increasingly important in analyzing the huge amount of data being produced by large-scale DNA-sequencing efforts such as the Human Genome Project. For example, hidden Markov models are used for analyzing biological sequences, linguistic-grammar-based probabilistic models for identifying RNA secondary structure, and probabilistic evolutionary models for inferring phylogenies of sequences from different organisms. This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis. Written by an interdisciplinary team of authors, it is accessible to molecular biologists, computer scientists, and mathematicians with no formal knowledge of the other fields, and at the same time presents the state of the art in this new and important field.
Article
This paper shows simple dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. For a basic version of the problem (i.e., maximizing the number of base pairs), this paper presents an O(n4) time exact algorithm and an O(n4−δ) time approximation algorithm. The latter one outputs, for most RNA sequences, a secondary structure in which the number of base pairs is at least 1−ε of the optimal, where ε,δ are any constants satisfying 0<ε,δ<1. Several related results are shown too.
Article
This issue's expert guest column is by Eric Allender, who has just taken over the Structural Complexity Column in the Bulletin of the EATCS.Regarding "Journals to Die For" (SIGACT News Complexity Theory Column 16), Joachim von zur Gathen, ...
Article
The problem of finding a longest common subsequence of two strings is discussed. This problem arises in data processing applications such as comparing two files and in genetic applications such as studying molecular evolution. The difficulty of computing a longest common subsequence of two strings is examined using the decision tree model of computation, in which vertices represent “equal - unequal” comparisons. It is shown that unless a bound on the total number of distinct symbols is assumed, every solution to the problem can consume an amount of time that is proportional to the product of the lengths of the two strings. A general lower bound as a function of the ratio of alphabet size to string length is derived. The case where comparisons between symbols of the same string are forbidden is also considered and it is shown that this problem is of linear complexity for a two-symbol alphabet and quadratic for an alphabet of three or more symbols.
Article
The complexity of finding the Longest Common Subsequence (LCS) and the Shortest Common Supersequence (SCS) of an arbRrary number of sequences IS considered We show that the yes/no version of the LCS problem is NP-complete for sequences over an alphabet of size 2, and that the yes/no SCS problem is NP- complete for sequences over an alphabet of size 5
Article
The LCS problem is to determine the longest common subsequence (LCS) of two strings. A new linear-space algorithm to solve the LCS problem is presented. The only other algorithm with linear-space complexity is by Hirschberg and has runtime complexity O(mn). Our algorithm, based on the divide and conquer technique, has runtime complexity O(n(m-p)), where p is the length of the LCS.
Conference Paper
We present a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an "ideal cache" of size Z, our algorithm saves a factor of Θ(Z1/n) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy.
Article
Space, not time, is often the limiting factor when computing optimal sequence alignments, and a number of recent papers in the biology literature have proposed space-saving strategies. However, a 1975 computer science paper by Hirschberg presented a method that is superior to the new proposals, both in theory and in practice. The goal of this paper is to give Hirschberg's idea the visibility it deserves by developing a linear-space version of Gotoh's algorithm, which accommodates affine gap penalties. A portable C-software package implementing this algorithm is available on the BIONET free of charge.
Article
The algorithm of Waterman et al. (1976) for matching biological sequences was modified under some limitations to be accomplished in essentially MN steps, instead of the M2N steps necessary in the original algorithm. The limitations do not seriously reduce the generality of the original method, and the present method is available for most practical uses. The algorithm can be executed on a small computer with a limited capacity of core memory.
Article
We describe a dynamic programming algorithm for predicting optimal RNA secondary structure, including pseudoknots. The algorithm has a worst case complexity of O(N6) in time and O(N4) in storage. The description of the algorithm is complex, which led us to adopt a useful graphical representation (Feynman diagrams) borrowed from quantum field theory. We present an implementation of the algorithm that generates the optimal minimum energy structure for a single RNA sequence, using standard RNA folding thermodynamic parameters augmented by a few parameters describing the thermodynamic stability of pseudoknots. We demonstrate the properties of the algorithm by using it to predict structures for several small pseudoknotted and non-pseudoknotted RNAs. Although the time and memory demands of the algorithm are steep, we believe this is the first algorithm to be able to fold optimal (minimum energy) pseudoknotted RNAs with the accepted RNA thermodynamic model.
Conference Paper
The problem of finding a longest common subsequence of two strings is discussed. This problem arises in data processing applications such as comparing two files and in genetic applications such as studying molecular evolution. The difficulty of computing a longest common subsequence of two strings is examined using the decision tree model of computation, in which vertices represent ″equal - unequal″ comparisons. It is shown that unless a bound on the total number of distinct symbols is assumed, every solution to the problem can consume an amount of time that is proportional to the product of the lengths of the two strings. A general lower bound as a function of the ratio of alphabet size to string length is derived. The case where comparisons between symbols of the same string are forbidden is also considered and it is shown that this problem is of linear complexity for a two-symbol alphabet and quadratic for an alphabet of three or more symbols.
Conference Paper
The aim of this paper is to give a comprehensive comparison of well-known longest common subsequence algorithms (for two input strings) and study their behaviour in various application environments. The performance of the methods depends heavily on the properties of the problem instance as well as the supporting data structures used in the implementation. We want to make also a clear distinction between methods that determine the actual lcs and those calculating only its length, since the execution time and more importantly, the space demand depends crucially on the type of the task. To our knowledge, this is the first time this kind of survey has been done. Due to the page limits, the paper gives only a coarse overview of the performance of the algorithms; more detailed studies are reported elsewhere