Article

Retrieving Smith-Waterman Alignments with Optimizations for Megabase Biological Sequences Using GPU

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In Genome Projects, biological sequences are aligned thousands of times, in a daily basis. The Smith-Waterman algorithm is able to retrieve the optimal local alignment with quadratic time and space complexity. So far, aligning huge sequences, such as whole chromosomes, with the Smith-Waterman algorithm has been regarded as unfeasible, due to huge computing and memory requirements. However, high-performance computing platforms such as GPUs are making it possible to obtain the optimal result for huge sequences in reasonable time. In this paper, we propose and evaluate CUDAlign 2.1, a parallel algorithm that uses GPU to align huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity. In order to achieve that, we propose optimizations which are able to reduce significantly the amount of data processed, while enforcing full parallelism most of the time. Using the NVIDIA GTX 560 Ti board and comparing real DNA sequences that range from 162 KBP (Thousand Base Pairs) to 59 MBP (Million Base Pairs), we show that CUDAlign 2.1 is scalable. Also, we show that CUDAlign 2.1 is able to produce the optimal alignment between the chimpanzee chromosome 22 (33 MBP) and the human chromosome 21 (47 MBP) in 8.4 hours and the optimal alignment between the chimpanzee chromosome Y (24 MBP) and the human chromosome Y (59 MBP) in 13.1 hours.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Web applications tend to have more accessibility to hardware of mobile devices. According to World Wide Web Consortium (W3C), mobile web applications can access the hardware data such as battery status [25], GPS [26], vibration [27], ambient light sensor [28], multimedia (camera, microphone) [29], and motion sensor [30]. ...
... We followed the common approach of having participants act as shoulder surfers [26][27][28][29]. Participants acted as shoulder surfers and the experimenter acted the victim. ...
... There are also several linear-time and linear-space sub-optimal algorithms [11], which make local sequence alignment even more practical. Furthermore, there are several recent attempts to leverage GPU to accelerate the Smith-Waterman algorithm [26,28]. Fig. 1. ...
Chapter
Full-text available
The most common issue of alphanumeric passwords is users normally create weak passwords for the reason that strong passwords are difficult to recognise and memorise. Graphical password authentication system is one of the approaches to address the issues of alphanumeric passwords memorability. Wiedenbeck et al. propose PassPoints in which a password is a sequence of any 5 to 8 user-selected click points on a system-assigned image. Nevertheless, PassPoints still faces the problem of predictable click points and shoulder surfing attack. In this paper, we propose an alternative graphical password system on smartphones called HapticPoints. By adding haptic feedback to PassPoints as additional decoy click points, the aforementioned problems can be prevented without needing users to do any additional memory task. We also conduct a user study to evaluate and compare the usability of HapticPoints and PassPoints.
... Both scenarios have been parallelized in the literature [26,27], but fine-grained parallelism applies better to the first scenario due to the amount of data and computation involved, and therefore fits better into many-core platforms. Among them, we find Intel Xeon Phis [28], Nvidia GPUs using CUDA [29], and even multi-GPU using CUDAlign 4.0 [30], which is our departure point to analyze cost, performance and power efficiency along this work. ...
... CUDAlign [29] obtains the alignment of long sequences with variants of SW and Myers-Miller (see stages and phases summarized in Table 1). The GPU calculates a single SW matrix using all many-cores in a fine grained way, and data dependencies force neighbour cores to communicate in order to exchange border elements. ...
... We compare homologous chromosomes from human and chimpanzee genomes, as it has been observed high similarity in evolutionary studies on the human species [37], in particular for chromosomes 16 [38], 22 [39] and Y [40]. Our selection is summarized in Table 4, where comparisons are named chr22, chr21, 47M, chrY, following names found in [29,30]. DNA sequences from all those chromosomes are compared in the results presented in Table 5. ...
Article
Full-text available
Background We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal pairwise alignment of huge DNA sequences in multi-GPU platforms using the exact Smith-Waterman method. Results Our study includes acceleration factors, performance, scalability, power efficiency and energy costs. We also quantify the influence of the contents of the compared sequences, identify potential scenarios for energy savings on speculative executions, and calculate performance and energy usage differences among distinct GPU generations and models. For a sequence alignment on chromosome-wide scale (around 2 Petacells), we are able to reduce execution times from 9.5 h on a Kepler GPU to just 2.5 h on a Pascal counterpart, with energy costs cut by 60%. Conclusions We find GPUs to be an order of magnitude ahead in performance per watt compared to Xeon Phis. Finally, versus typical low-power devices like FPGAs, GPUs keep similar GFLOPS/w ratios in 2017 on a five times faster execution.
... Block pruning is a pruning strategy proposed in CUDAlign 2.1 [14] and further used in SW# [15] and [16]. It calculates optimal local sequence alignments with the affine gap model using GPU (Graphics Processing Unit). ...
... For instance, MASA-CUDAlign [17] uses GPUs to accelerate the alignment computation and the GPU parallelism is based on blocks. The performance of MASA-CUDAlign is highly dependent on the size of the blocks and a study of this effect was already made in [14], where it was shown that using few large blocks reduces the parallelism whereas a great number of small blocks increases the synchronization overhead. ...
... In this paper we explored Block Pruning, which is a pruning optimization able to reduce significantly the execution time of algorithms that compute the optimal alignment of similar sequences. This optimization was originally developed for the CUDAlign 2.1 tool [14] using diagonal processing and, at that moment, it was already clear that it was able to attain high pruning efficiency. In a tool called MASA (Multi-Platform Architecture for Sequence Aligners) [17], Block Pruning was extended to the dataflow (generic) processing, attaining in this case higher efficiency than the diagonal processing. ...
Article
Full-text available
Biological sequence comparison algorithms that compute the optimal local and global alignments calculate a dynamic programming (DP) matrix with quadratic time complexity. The DP matrix H is calculated with a recurrence relation in which the value of each cell Hi,j is the result of a maximum operation on the cells' values Hi-1, j-1, Hi-1, j and Hi,j-1 added or subtracted by a constant value. Therefore, it can be noticed that the difference between the value of cell Hi,j being calculated and the values of direct neighbor cells previously computed respect well-defined upper and lower bounds. Using these bounds, we can show that it is possible to determine the maximum and the minimum value of every cell in H, for a given reference cell. We use this result to define a generic pruning method which determines the cells that can pruned (i.e. no need to be computed since they will not contribute to the final solution), accelerating the computation but keeping the guarantee that the optimal result will be produced. The goal of this paper is thus to investigate and formalize properties of the DP matrix in order to estimate and increase the pruning method efficiency. We also show that the pruning efficiency depends mainly on three characteristics: (a) the order in which the cells of H are calculated, (b) the values of the parameters used in the recurrence relation and (c) the contents of the sequences compared.
... SW has become very popular over the last decade to compute (1) the exact pairwise comparison of DNA/RNA sequences or (2) a protein sequence (query) to a genomic database involving a bunch of them. Both scenarios have been parallelized in the literature [8], but fine-grained parallelism applies better to the first scenario, and therefore fits better into many-core platforms like Intel Xeon Phis [13], Nvidia GPUs using CUDA [26], and even multi-GPU using CUDAlign 4.0 [27], which is our departure point to analyze performance, power, energy and cost along this work. ...
... GPUs calculate a single SW matrix using all many-cores, but data dependencies force neighbour cores to communicate in order to exchange border elements. For Megabase DNA sequences, the SW matrix is several Petabytes long, and so, very few GPU strategies [11,26] allow the comparison of Megabase sequences longer than 10 Million Base Pairs (MBP). SW# [11] is able to use 2 GPUs in a single Megabase comparison to calculate the Myers-Miller [15] linear space variant of SW. ...
... SW# [11] is able to use 2 GPUs in a single Megabase comparison to calculate the Myers-Miller [15] linear space variant of SW. CUDAlign [26] obtains the alignment of Megabase sequences with a combined SW and Myers-Miller strategy. When compared to SW#, CUDAlign presents shorter execution times for huge sequences on a single GPU [11]. ...
Conference Paper
We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA sequences in multi-GPU platforms using the exact Smith-Waterman method. Speed-up factors and energy consumption are monitored on different stages of the algorithm with the goal of identifying advantageous scenarios to maximize acceleration and minimize power consumption. Experimental results using CUDA on a set of GeForce GTX 980 GPUs illustrate their capabilities as high-performance and low-power devices, with a energy cost to be more attractive when increasing the number of GPUs. Overall, our results demonstrate a good correlation between the performance attained and the extra energy required, even in scenarios where multi-GPUs do not show great scalability.
... Multiple HW/SW Platforms [Aldinucci et al. 2010] 10 4 local score CPU (OpenMP or Cilk or TBB or FastFlow) [Benkrid et al. 2012] 10 3 local score FPGA (Handel-C) CellBE (Cell SDK) GPU (CUDA) [Liu et al. 2013] 10 3 local score GPU (CUDA) CPU (SSE) [Hamidouche et al. 2013] 10 3 local score CellBE (BSP++) 10 7 CPU (BSP++) [Rajko and Aluru 2004] 10 6 local score,align CPU (MPI) [Sandes and Melo 2013b] 10 7 local score,align GPU (CUDA) [Korpar and Sikic 2013] 10 7 local score,align GPU (CUDA) 10 7 local score Phi (IMCI) A parallel exact algorithm that generates global alignments for Megabase sequences was proposed in [Rajko and Aluru 2004] for a cluster of CPUs. The key idea of this algorithm is to use Hirshberg's algorithm (Section 2.3) combined with Parallel Prefix (PP) computations to find a partial balanced partition between subsequences of S 0 and S 1 . ...
... This partition allows the subdivision of the original problem in independent subproblems that can be solved in parallel. CUDAlign 2.1 was proposed by [Sandes and Melo 2013b] and it is a combination of the Gotoh (Section 2.1) and MM algorithms (Section 2.3), retrieving optimal local alignments in linear space for Megabase sequences in GPU. It computes optimal local alignments in 5 stages, where stage 1 obtains the optimal score with antidiagonal parallelism (Figure 2(c)) and the block pruning optimization, saving some rows to disk. ...
... Stages 2 to 5 implement the traceback, executing a modified version of MM, retrieving the coordinates of the points that belong to the optimal local alignment in a divide-and-conquer way. [Korpar and Sikic 2013] proposed SW#, which is an approach that implements the MM algorithm (Section 2.3) with the parallelization strategy and the block pruning optimization proposed in CUDAlign 2.1 [Sandes and Melo 2013b] for retrieving the local alignment between Megabase DNA sequences in GPU. proposed SWAPHI-LS, a strategy to execute local comparisons of Megabase sequences with one or more Intel Phis. ...
Article
Full-text available
Biological sequence alignment is a very popular application in Bioinformatics, used routinely worldwide. Many implementations of biological sequence alignment algorithms have been proposed for multicores, GPUs, FPGAs and CellBEs. These implementations are platform-specific; porting them to other systems requires considerable programming effort. This article proposes and evaluates MASA, a flexible and customizable software architecture that enables the execution of biological sequence alignment applications with three variants (local, global, and semiglobal) in multiple hardware/software platforms with block pruning, which is able to reduce significantly the amount of data processed. To attain our flexibility goals, we also propose a generic version of block pruning and developed multiple parallelization strategies as building blocks, including a new asynchronous dataflow-based parallelization, which may be combined to implement efficient aligners in different platforms. We provide four MASA aligner implementations for multicores (OmpSs and OpenMP), GPU (CUDA), and Intel Phi (OpenMP), showing that MASA is very flexible. The evaluation of our generic block pruning strategy shows that it significantly outperforms the previously proposed block pruning, being able to prune up to 66.5% of the cells when using the new dataflow-based parallelization strategy.
... In the last decades, SW approaches for both cases have been parallelized in the literature, using multiprocessor/multicores [3,4], CellBEs (Cell Broadband Engines) [5], FPGAs (Field Programmable Gate Arrays) [6], ASICs (Application Specific Integrated Circuits) [7], Intel Xeon Phis [8] and GPUs (Graphics Processing Units) [9,10,11,12]. The SW algorithm is widely used by biologists to compare sequences in many practical applications, such as identification of orthologs [13], and virus integration detection [14]. ...
... A number of works have already examined the use of GPUs to accelerate SW computation. Some of them use only one GPU [19,20,11,21,22], whereas several approaches have been recently proposed to execute SW in multiple GPUs [12,9,10]. ...
... Some strategies [3,20,11,10] are able to obtain the optimal local alignment of Megabase sequences longer than 1 Million Base Pairs (MBP). These strategies use linear space techniques to obtain the alignment with a reasonable amount of memory. ...
Article
Full-text available
This paper proposes and evaluates CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA sequences in multi-GPU platforms, using the exact Smith-Waterman (SW) algorithm. In the first phase of CUDAlign 4.0, a huge Dynamic Programming (DP) matrix is computed by multiple GPUs, which asynchronously communicate border elements to the right neighbor in order to find the optimal score. After that, the traceback phase of SW is executed. The efficient parallelization of the traceback phase is very challenging because of the high amount of data dependency, which particularly impacts the performance and limits the application scalability. In order to obtain a multi-GPU highly parallel traceback phase, we propose and evaluate a new parallel traceback algorithm called Incremental Speculative Traceback (IST), which pipelines the traceback phase, speculating incrementally over the values calculated so far, producing results in advance. With CUDAlign 4.0, we were able to calculate SW matrices with up to 60 Peta cells, obtaining the optimal local alignments of all Human and Chimpanzee homologous chromosomes, whose sizes range from 26 Millions of Base Pairs (MBP) up to 249 MBP. As far as we know, this is the first time such comparison was made with the SW exact method. We also show that the IST algorithm is able to reduce the traceback time from 2.15× up to 21.03×, when compared with the baseline traceback algorithm. The human×chimpanzee chromosome 5 comparison (180 MBP×183 MBP) attained 10,370.00 GCUPS (Billions of Cells Updated per Second) using 384 GPUs, with a speculation hit ratio of 98.2 percent.
... General–Purpose Computing on Graphics Processing Units (GPGPU) provides a powerful platform to implement the sequence alignment using Smith–Waterman algorithm[7],[8]in parallel. In our parallel SPADE (P–SPADE) implementation, we have applied the wavefront propagation method[9]of parallel sequence alignment to accelerate the signature generation of packers on a single GPU. To construct packer signatures in this approach, we have used CUDA but testing of signatures is performed solely on CPU. ...
... In SPADE, Smith– Waterman algorithm is incorporated for each pairwise alignment of two sequences S 1 and S 2. This algorithm constructs a matrix of size (m + 1) × (n + 1) where m and n are the length of source and target sequences respectively. The values in the matrix can be computed independently using wavefront method[9]which is described in Section V. ...
... All cells along anti–diagonal R can be computed in parallel using anti–diagonals R 1 and R 2 1) Wavefront Method:Fig. 5depicts the wavefront propagation[9]in the score matrix. The cells along the i th anti– diagonal R can be computed parallel from the (i − 1) th and (i − 2) th anti–diagonals (R 1 and R 2 ). ...
... To obtain complete alignment results, the tool was further extended to integrate the score-only Smith-Waterman algorithm with the Myers-Miller algorithm [15], which computes optimal global alignments in linear space. In addition, the tool realized efficient matrix filling with intrapair block pruning, after which it achieved a further acceleration of up to 51 % [16]. Meanwhile, SW# [5] implemented a parallel algorithm that can achieve further acceleration on two GPUs; that dual-GPU implementation aligned the human chromosome 21 with the chimpanzee chromosome 22 in 6.5 h on a GeForce GTX 690. ...
... Sandes et al. [16] developed CUDAlign 2.1, which computes optimal local alignments in three phases, as Chao et al. [19] did: (1) the forward matrix-filling phase, which computes the highest alignment score and the ending alignment position, (2) the backward matrix-filling phase, which obtains the starting alignment position from the computed ending position, and (3) the reconstruction phase, which obtains the full alignment by applying the Myers-Miller algorithm [15] to subsequences between the starting and ending alignment positions. This tool also realized intrapair block pruning for efficient SW alignment. ...
... However, the lower bound is obtained from the ongoing pair to be aligned. Consequently, there is a limitation on the maximum number of matrix cells that can be pruned, for which the researchers provide a proof [16]. ...
Article
Full-text available
The Smith-Waterman algorithm is known to be a more sensitive approach than heuristic algorithms for local sequence alignment algorithms. Despite its sensitivity, a greater time complexity associated with the Smith-Waterman algorithm prevents its application to the all-pairs comparisons of base sequences, which aids in the construction of accurate phylogenetic trees. The aim of this study is to achieve greater acceleration using the Smith-Waterman algorithm (by realizing interpair block pruning and band optimization) compared with that achieved using a previous method that performs intrapair block pruning on graphics processing units (GPUs). We present an interpair optimization method for the Smith-Waterman algorithm with the aim of accelerating the all-pairs comparison of base sequences. Given the results of the pairs of sequences, our method realizes efficient block pruning by computing a lower bound for other pairs that have not yet been processed. This lower bound is further used for band optimization. We integrated our interpair optimization method into SW#, a previous GPU-based implementation that employs variants of a banded Smith-Waterman algorithm and a banded Myers-Miller algorithm. Evaluation using the six genomes of Bacillus anthracis shows that our method pruned 88 % of the matrix cells on a single GPU and 73 % of the matrix cells on two GPUs. For the genomes of the human chromosome 21, the alignment performance reached 202 giga-cell updates per second (GCUPS) on two Tesla K40 GPUs. Efficient interpair pruning and band optimization makes it possible to complete the all-pairs comparisons of the sequences of the same species 1.2 times faster than the intrapair pruning method. This acceleration was achieved at the first phase of SW#, where our method significantly improved the initial lower bound. However, our interpair optimization was not effective for the comparison of the sequences of different species such as comparing human, chimpanzee, and gorilla. Consequently, our method is useful in accelerating the applications that require optimal local alignments scores for the same species. The source code is available for download from http://www-hagi.ist.osaka-u.ac.jp/research/code/.
... nucleotides. 9,10,14 These approaches are platform-specific because CUDA is only available for NVidia GPU cards. Another alternative is the use of a framework for heterogeneous platforms like Open Computing Language (OpenCL) 15 to implement this task, 16,17 allowing the execution of the same code in different architectures. ...
... Our MASA-OpenCL solution is focused on the first stage of CUDAlign, 9 which provides the optimal score and its coordinates in the DP matrix. The BP procedure executes on this stage, discarding unnecessary calculations of cells that cannot produce the optimal result. ...
Article
Full-text available
Biological sequence comparison is often used as an auxiliary task in the analysis of genetic material. Pairwise comparison algorithms like Smith‐Waterman evaluate two strings representing sequences of proteins, DNA or RNA to obtain optimal alignment between them. Many applications have been proposed to address the sequence comparison problem, prioritizing the use of graphics cards and proprietary languages such as CUDA. In this paper, we propose and evaluate MASA‐OpenCL, an OpenCL solution for comparing long DNA sequences that is based on the MASA sequence alignment framework, with pruning capability proportional to the similarity of the sequences compared. The results of MASA‐OpenCL were compared to its CUDA counterpart (MASA‐CUDAlign) and, in most cases, MASA‐OpenCL achieved better performance. In order to better understand the behavior of MASA‐OpenCL, we performed a statistical analysis considering 11 comparisons of sequences with high, medium and low similarity in 4 GPUs. As a result, we obtained a multiple linear regression model that considers (a) the sizes of the sequences, (b) the similarity between them, (c) the computational power of the GPU, and (d) the GPU memory bandwidth. We used this model to predict the performance in two other GPUs, with low error rates.
... However, the Smith-Waterman algorithm is still widely used because of its high sensitivity of sequence alignment even though it has higher time complexity of algorithm. To enable the Smith-Waterman algorithm to produce exact results in a reasonably shorter time, much research has been focusing on using various high-performance architectures to accelerate the processing speed of the algorithm [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27]. In particular, it becomes a recent trend to use the emerging accelerators and many-core architectures, such as field-programmable gate arrays (FPGAs) [10][11][12], cell/BEs [13][14][15][16][17], and generalpurpose graphics processing units (GPUs), to run the Smith-Waterman algorithm [18][19][20][21][22][23][24][25][26]. ...
... To enable the Smith-Waterman algorithm to produce exact results in a reasonably shorter time, much research has been focusing on using various high-performance architectures to accelerate the processing speed of the algorithm [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27]. In particular, it becomes a recent trend to use the emerging accelerators and many-core architectures, such as field-programmable gate arrays (FPGAs) [10][11][12], cell/BEs [13][14][15][16][17], and generalpurpose graphics processing units (GPUs), to run the Smith-Waterman algorithm [18][19][20][21][22][23][24][25][26]. ...
Article
Full-text available
Sequence alignment lies at heart of the bioinformatics. The Smith-Waterman algorithm is one of the key sequence search algorithms and has gained popularity due to improved implementations and rapidly increasing compute power. Recently, the Smith-Waterman algorithm has been successfully mapped onto the emerging general-purpose graphics processing units (GPUs). In this paper, we focused on how to improve the mapping, especially for short query sequences, by better usage of shared memory. We performed and evaluated the proposed method on two different platforms (Tesla C1060 and Tesla K20) and compared it with two classic methods in CUDASW++. Further, the performance on different numbers of threads and blocks has been analyzed. The results showed that the proposed method significantly improves Smith-Waterman algorithm on CUDA-enabled GPUs in proper allocation of block and thread numbers.
... Some of the most relevant results on accelerating SW for long strings with GPUs have been achieved by the several versions of the CUDAlign library [26,27,28,29]. Their goal is to recover the full alignment. ...
Article
Computing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity.
... Smith-Waterman finds the best match between the conserved domains of two sequences using a local alignment. Smith-Waterman is useful in contrasting sequences that are similar to co-patterns or similarities that are within the larger sequence [15]. ...
Article
Full-text available
Text similarity is critical in a variety of applications, including word processing, signal processing, imagery, data mining, wireless sensor networks, etc., where text similarity measurements can detect whether texts are lexical or semantic similar. Semantic text similarity is the term that uses to describe similarities based on meaning. Although this function is very challenging, it remains an active subject of study due to the complexities of natural language. The second type is lexical similarity whereby this type can be used to eliminate repetition by grouping similar texts together provided that two texts are very similar. It is important to remember that traditional text similarity approaches only look at the actual words in a phrase to compare two texts. Depending on the use case, it's easier to build and manage and offers a better trade-off. This paper examines current work on text similarity and divides it into four categories. Techniques based on strings, Corpus, knowledge, or hybrid similarities, these categories are all comparable. There are also examples of different combinations of these techniques for matching text and finding similarities between two texts. A smart method is proposed to find out the similarity between two texts called the fuzzy data similarity (FDS), and to prove the efficiency of the proposed method, it was compared with the most famous methods, where the results showed an accuracy of the FDS about 93%.
... The first is the usage of CPUs in combination with, e.g., SIMD-instructions (Single instruction, multiple data), many-core CPUs or distributed computing [18,19]. The second is the usage of either high-performance compute or consumer-grade GPUs as accelerators [20,21]. The last approach is the usage of FPGA-accelerators which, although they are not as fast as systems based on high-performance GPUs, tend to be more energy-efficient [22][23][24]. ...
Article
Full-text available
This paper is concerned with Field Programmable Gate Arrays (FPGA)-based systems for energy-efficient high-throughput string comparison. Modern applications which involve comparisons across large data sets—such as large sequence sets in molecular biology—are by their nature computationally intensive. In this work, we present a scalable FPGA-based system architecture to accelerate the comparison of binary strings. The current architecture supports arbitrary lengths in the range 16 to 2048-bit, covering a wide range of possible applications. In our example application, we consider DNA sequences embedded in a binary vector space through Locality Sensitive Hashing (LSH) one of several possible encodings that enable us to avoid more costly character-based operations. Here the resulting encoding is a 512-bit binary signature with comparisons based on the Hamming distance. In this approach, most of the load arises from the calculation of the O(m∗n) Hamming distances between the signatures, where m is the number of queries and n is the number of signatures contained in the database. Signature generation only needs to be performed once, and we do not consider it further, focusing instead on accelerating the signature comparisons. The proposed FPGA-based architecture is optimized for high-throughput using hundreds of computing elements, arranged in a systolic array. These core computing elements can be adapted to support other string comparison algorithms with little effort, while the other infrastructure stays the same. On a Xilinx Virtex UltraScale+ FPGA (XCVU9P-2), a peak throughput of 75.4 billion comparisons per second—of 512-bit signatures—was achieved, using a design with 384 parallel processing elements and a clock frequency of 200 MHz. This makes our FPGA design 86 times faster than a highly optimized CPU implementation. Compared to a GPU design, executed on an NVIDIA GTX1060, it performs nearly five times faster.
... For the problem of whole genome sequence alignment the Darwin platform (Turakhia et al., 2017) proposes new algorithms for whole genome alignment with constant memory on an FPGA and algorithms for short read mapping. The SW# (Korpar and Šikić, 2013) and CUDAlign (Sandes and de Melo, 2013) programs propose GPU implementations of Smith-Waterman for whole genome sequence alignment. However, their accuracy is unclear and their method is likely to miss inversions and transpositions since they perform Smith-Waterman on whole genomes instead of fragments. ...
... For the problem of whole genome sequence alignment the Darwin platform [16] proposes new algorithms for whole genome alignment with constant memory on an FPGA and algorithms for short read mapping. The SW# [17] and CUDAlign [18] programs propose GPU implementations of Smith-Waterman for whole genome sequence alignment. However, their accuracy is unclear and their method is likely to miss inversions and transpositions since they perform Smith-Waterman on whole genomes instead of fragments. ...
... There are also several linear-time and linear-space sub-optimal algorithms [11], which make local sequence alignment even more practical. Furthermore, there are several recent attempts to leverage GPU to accelerate the Smith-Waterman algorithm [26,28]. ...
Chapter
Security is not just a technical problem, but it is a business problem. Companies are facing highly-sophisticated and targeted cyber attacks everyday, and losing a huge amount of money as well as private data. Threat intelligence helps in predicting and reacting to such problems, but extracting well-organized threat intelligence from enormous amount of information is significantly challenging. In this paper, we propose a novel technique for visualizing security alerts, and implement it in a system that we call AlertVision, which provides an analyst with a visual summary about the correlation between security alerts. The visualization helps in understanding various threats in wild in an intuitive manner, and eventually benefits the analyst to build TI. We applied our technique on real-world data obtained from the network of 85 organizations, which include 5,801,619 security events in total, and summarized lessons learned.
... In [15], the authors partitioned the database sequences into two sections based on the sequence length by running the short sequences on the CPU and the long sequences on the GPU. In [16], the authors combined a sequence alignment algorithm with linear space complexity using a GPU. The authors in [17] have suggested a measurement of similarities across two web pages, as well as a clustering method of the web sessions via a Fast Optimal Global Sequence Alignment algorithm (FOGSAA). ...
Article
Full-text available
Database sequencing applications including sequence comparison, searching, and analysis are considered among the most computation power and time consumers. Heuristic algorithms suffer from sensitivity while traditional sequencing methods, require searching the whole database to find the most matched sequences, which requires high computation power and time. This paper introduces a dynamic programming technique based-on a measure of similarity between two sequential objects in the database using two components, namely frequency and mean. Additionally, database sequences that have the lowest scores in the comparison process were excluded such that the similarity algorithm between a query sequence and other database sequences is applied to meaningful parts of the database. The proposed technique was implemented and validated using a heterogeneous HW/SW FPGA-based embedded system platform. The implementation was partitioned into (1) hardware part (running on logic gates of FPGA) and (2) software part (running on ARM processor of FPGA). The validation study showed a significant reduction in computation time by accelerating the database sequencing processes by 60% comparing to traditional known methods.
... Most of the work focussing on vector-level and thread-level parallelization for alignment algorithms was done in the context of accelerating the Smith-Waterman kernel for either protein database searches or pairwise sequence alignments of long input sequences on either CPUs with SIMD vectorization (Farrar, 2007;Rognes, 2011;Rognes and Seeberg, 2000), GPUs (Khajeh-Saeed et al., 2010;Korpar and Siki c, 2013;Li et al., 2012a;Liu et al., 2013;Sandes and de Melo, 2013), cell broadband engines (Sarje and Aluru, 2008;Szalkowski et al., 2008) or on more recent architectures like the Xeon Phi TM from Intel V R (Liu and Schmidt, 2014;Liu et al., 2014;Rucci et al., 2017). However, the algorithmic components are hard to reuse since they are hidden within these applications and many tools work with outdated instruction sets. ...
Article
Motivation: Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (single instruction multiple data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we (a) distribute many independent alignments on multiple threads and (b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. Results: We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon PhiTM (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon PhiTM and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. Availability and implementation: The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4 under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME: SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. Supplementary information: Supplementary data are available at Bioinformatics online.
... On the other hand, many parallel sequence alignment methods have been developed using cluster and Grids [5][6][7], FPGAs [8,9], GPUs [10][11][12], and CellBEs [13][14][15]. However, only some of these approaches are capable of obtaining the alignment results in linear space, most of them only retrieve the score. ...
Conference Paper
Full-text available
The Myers-Miller algorithm is a widely used global alignment tool in quadratic time and linear space in computational biology. Because of the huge time consumption, it is unfeasible to aligning megabase sequences by using the Myers-Miller tool. However, cloud computing is a promising platform to achieve the alignment results for megabase sequences in feasible time. In this paper, we present Cloud Myers-Miller, a parallel algorithm that construct huge sequence alignment in cloud. Cloud Myers-Miller is divided into three stages, which are preparation, parallel processing, and collection stage. Our results on a five-machine cluster show high speed-up for long real DNA. It is possible to align more than 543 KBP (kilo base-pairs) DNA by using Cloud Myers-Miller.
... This algorithm is used to align a specific region of a sequence. Many researches were conducted in order to accelerate this algorithm since it is important in finding similarities between sequences [18]- [22]. ...
Conference Paper
Full-text available
Bioinformatics is a growing field that attracts many researchers and continues to prove its value and significance. Since the early days of discovering genomic martial and using it to identify new life forms, sequence alignment applications have become important in enabling discoveries of important biological or medical benefits. Finding similarities, or even relations between sequences, is a demanding process that requires time and high cost. However, nowadays there are plenty of algorithms that are used to find similarity and/or differences between sequences. Many of these algorithms still suffer from performance issues, such as slow performance and poor scalability. Therefore, parallelization is widely used to address these issues. In this paper, we utilize a multi-threading parallelism technique coupled with a block alignment idea in order to improve the sequence alignment performance. The experiments show that the proposed implementation outperforms the sequential implementation by 4.9 times for sequences of lengths ranging between 1024 and 8192.
... Dentre eles podemos citar dois importantes: o algoritmo Needleman-Wunsch (NW) e o algoritmo Smith-Waterman (SW). No trabalho [Sandes and de Melo 2013] foi introduzido o CUDAlign, uma aplicação real para alinhamento de sequências de DNA que implementa otimizações paralelas para GPUs nos algoritmos NW e SW. ...
Conference Paper
Full-text available
Quando uma sequência biológica é obtida, é comum alinhá-la com outra já estudada para determinar suas características. O desafio é processar este alinhamento em tempo útil. Neste trabalho exploramos o paralelismo em uma aplicação de alinhamento de sequências de DNA utilizando as bibliotecas FastFlow e Intel TBB. Os experimentos mostram que a versão TBB obteve até 4% melhor tempo de execução em comparação à versão original em OpenMP.
... There are several parallel pairwise alignment algorithms proposed based on the developments of high performance computing platforms, including GPU (Graphical Processing Unit) [8], FPGA (Field-Programmable Gate Array), and multiprocessor and cluster environments [9]. CUDAlign 2.1 is a parallel algorithm that uses GPU to align huge sequences, executing the SW algorithm combined with Myers-Miller, with linear space complexity [10]. CUDASW++ 2.0 is an enhanced SW protein alignment on GPUs based on the single instruction, multiple thread (SIMT) and the virtualized single instruction, multiple data (SIMD) abstraction [11]. ...
Conference Paper
Full-text available
Pairwise sequence alignment is a common and fundamental task in Computational Biology, which constitutes the basis for many Bioinformatics applications. In the post-genomic era, there is an increasing demand to align long DNA sequences to discover their functions. In this paper, we propose a parallel pairwise alignment algorithm for large genomic sequences by recursively dividing the whole genomic sequences into small pieces, with an effective pruning strategy to reduce search and computation space. We implemented rigorous tests on a 4-core computer using real genomic sequences and artificially generated sequences. The results show that our implementation can achieve speedup 10.64 with 99.75% accuracy compared to the sequential algorithm. As far as we know, this is the first time that MBP (mega base-pairs) sequences are globally aligned with an affine gap penalty.
... The benchmarks used are DTW (Dynamic Time Warping) [22], HEAT2D [25], HIST (Histogram) [23], INT_IMG (Integral Image) [9], SOR (Successive Over-Relaxation) [12] and SW (Smith Waterman) [26]. DTW is a common algorithm in time series analysis for measuring similarity between two time series with varying speeds. ...
Conference Paper
Full-text available
GPUs lack fundamental support for data-dependent parallelism and synchronization. While CUDA Dynamic Parallelism signals progress in this direction, many limitations and challenges still remain. This paper introduces Wireframe, a hardware-software solution that enables generalized support for data-dependent parallelism and synchronization. Wireframe enables applications to naturally express execution dependencies across different thread blocks through a dependency graph abstraction at run-time, which is sent to the GPU hardware at kernel launch. At run-time, the hardware enforces the dependencies specified in the dependency graph through a dependency-aware thread block scheduler. Overall, Wireframe is able to improve total execution time up to 65.20% with an average of 45.07%.
... Similarly, GPU solutions (e.g. [8,15,26]) accelerate the process at the cost of limiting the input sequence length (to 10 8 bp in this case). This limitation is the result of the quadratic space such algorithms require to report the alignment. ...
Article
Full-text available
Genome comparison poses important computational challenges, especially in CPU-time, memory allocation and I/O operations. Although there already exist parallel approaches of multiple sequence comparisons algorithms, they face a significant limitation on the input sequence length. GECKO appeared as a computational and memory efficient method to overcome such limitation. However, its performance could be greatly increased by applying parallel strategies and I/O optimisations. We have applied two different strategies to accelerate GECKO while producing the same results. First, a two-level parallel approach parallelising each independent internal pairwise comparison in the first level, and the GECKO modules in the second level. A second approach consists on a complete rewrite of the original code to reduce I/O. Both strategies outperform the original code, which was already faster than equivalent software. Thus, much faster pairwise and multiple genome comparisons can be performed, what is really important with the ever-growing list of available genomes.
... For example, the alignment among biological sequences is a common procedure in biological analysis, while the huge volume of sequence data results in the urgent requirement of computing power and storage space. Sandes and Melo implement the Smith-Waterman algorithm, which can give the optimal alignment, on GPU with linear space complexity [4]. Moreover, Zhu et al. reduce the bandwidth bottleneck of memory access caused by sequence data to fully play the GPU capability for accelerating MAFFT, which is one of the wellknown software packages for doing multiple sequence alignment [5]. ...
Article
Full-text available
The technology of mathematical expression identification and recognition extracts mathematical expressions in document images, and it has been studied for over a decade. Based on previous works, we develop an automatic recognition tool, named EqnEye, which leverages the OpenCV library to perform image processing and Tesseract tool to recognize mathematical symbols. We also apply correction methods before the recognition stage to improve the recognition accuracy. To improve the efficiency for processing images of high resolution, the parallel implementation of thresholding method on GPU is integrated into this work. Experimental results exhibit the success of our correction methods to enhance the accuracy and the slight improvement to the performance. In addition, porting the recognition tool to handy devices can produce more value-added applications.
... Compared with the Needleman-Wunsch algorithm, the Smith Waterman algorithm is more practical in finding similarities among DNA sequences. In addition, both methods are very useful and often used as building blocks for complex problem [8] because both methods can change the complex problems into simple subproblems. Therefore, both algorithms are known as dynamic programming. ...
Conference Paper
Full-text available
This paper presents the potential of the one-dimensional systolic processing element of the Viterbi algorithm in optimizing the DNA sequence alignment system processing engine. The objective of this paper was to optimize the sensitive DNA sequence alignment algorithm toward improving the performance and design complexity. In addition, theoretical study, design, and simulation were conducted using the Altera Quartus II version 9.1 software. The proposed architecture has been tested and is capable of accelerating more than 32 bits of input. As a conclusion, the proposed systolic design has been proven and is able to optimize the performance and design complexity of the most sensitive DNA sequence alignment algorithm on hardware-based accelerator platform.
... They do not produce alignment profiles, which makes them fast and memory efficient but prevents visual inspection of the results. Other implementations [21,22] are capable of producing the alignment profiles, but show only one profile per alignment. These parallel SW implementations focus on two distinct properties of how the alignment matrix is filled (Fig 1B). ...
Article
Full-text available
To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis. With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation.
Article
Full-text available
Penjajaran sekuen protein merupakan hal yang penting untuk menentukan kesamaan antara protein satu dengan protein lainnya. Namun algoritma penjajaran sekuen yang sudah ada mempunyai tingkat kompleksitas tinggi sehingga eksekusi program pencarian kesamaan protein memerlukan waktu yang lama apabila jumlah sekuen yang disejajarkan sangat banyak. Penelitian ini bertujuan mendapatkan matriks similaritas protein dengan pendekatan transfer learning menggunakan arsitektur transformer. Data yang digunakan sebagai masukan untuk transformer adalah data sekuen protein berformat teks. Hasil penelitian berupa matriks similaritas protein yang dapat digunakan oleh para peneliti lain di bidangnya masing-masing sebagai bahan untuk dianalisis lebih lanjut
Article
Full-text available
Sequence alignment is a critical computational problem in various domains, including genomics, proteomics, and natural language processing. The Needleman‐Wunsch (NW) algorithm is a classical dynamic programming approach for finding the optimal global alignment between two sequences. However, its quadratic time and space complexity make it impractical for aligning large‐scale sequences, which are increasingly common in modern applications. In this article, we propose a parallel variation of the NW algorithm that enables scalable global sequence alignment with customizable scoring schemes. Our approach re‐formulates the dependencies in the NW algorithm to enable parallel execution, thereby leveraging the computational power of modern parallel architectures, such as graphics processing unit (GPU). Furthermore, our algorithm supports arbitrary linear scoring schemes, which allows us to use domain‐specific knowledge to improve alignment accuracy. We establish the correctness of our algorithm and evaluate its performance using real DNA and user trajectory sequences on GPUs. Our parallel algorithm has shown impressive results in our experiments, with a peak performance of 27.99 GCUPS (giga cell updates per second) and a maximum speedup of 48.18 times compared to the traditional sequential implementation. Additionally, our algorithm demonstrates remarkable scalability, enabling the alignment of sequences of any length while ensuring balanced work distribution and optimal utilization of resources. Our primary objective is to harness the computational capabilities of a single GPU and fully utilize the processing power of multi‐core CPUs.
Article
In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization techniques found in 450 papers published in the last 14 years. We analyze the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
Chapter
Pairwise sequence alignment is an important application to identify regions of similarity that may indicate the relationship between two biological sequences. This is a computationally intensive task that usually requires parallel processing to provide realistic execution times. This work introduces a new framework for a deadline constrained application of sequence alignment, called MASA-CUDAlign, that exploits cloud computing with Spot GPU instances. Although much cheaper than On-Demand instances, Spot GPUs can be revoked at any time, so the framework is also able to restart MASA-CUDAlign from a checkpoint in a new instance when a revocation occurs. We evaluate the proposed framework considering five pairs of DNA sequences and different AWS instances. Our results show that the framework reduces financial costs when compared to On-Demand GPU instances while meeting the deadlines even in scenarios with several instances revocations.
Article
The parallelization of Smith-Waterman sequence comparison tools for long DNA sequences has been a big challenge over the years, requesting the use of several devices and sophisticated optimizations. Pruning is one of these optimizations, which can reduce considerably the amount of computation. This paper proposes MultiBP, a sequence comparison solution in multiple GPUs with block pruning. Two MultiBP strategies are proposed. In static score-sharing, workload is statically distributed to the GPUs, and the best score is sent to neighbor GPUs to simulate a global view. In the dynamic strategy, execution is divided into cycles and workload is dynamically assigned, according to the GPUs processing rate. MultiBP was integrated to MASA-CUDAlign and tested in homogeneous and heterogeneous platforms, with different NVidia GPU architectures. The best results in our homogeneous and heterogeneous platforms were mostly obtained by the static and dynamic approaches, respectively. We also show that our decision module is able to select the best strategy in most cases. Finally, the comparison of the human and chimpanzee chromosomes 1 in a cluster with 512 V100 NVidia GPUs took 11 minutes and obtained the impressive rate of 82,822 GCUPS which is, to our knowledge, the best performance for SW tools in GPUs.
Article
Purpose Over the past decade, the cost of product development has increased drastically, and this is due to the inability of most enterprises to locate suitable and optimal collaborators for knowledge sharing. Nevertheless, knowledge sharing is a mechanism that helps people find the best collaborators with relevant knowledge. Hence, a new approach for locating optimal collaborators with relevant knowledge is needed, which could help enterprise in reducing cost and time in a knowledge-sharing environment. The paper aims to discuss these issues. Design/methodology/approach One unique challenge in the domain of knowledge sharing is that collaborators do not possess the same number of events resident in the knowledge available for sharing. In this paper, the authors present a new approach for locating optimal collaborators in knowledge-sharing environment using the combinatorial algorithm (CA-KSE). Findings The proposed pattern-matching approach implemented in Java is considered efficient for solving the issue peculiar to collaboration in knowledge-sharing domain. The authors benchmarked the proposed approach with its semi-global pairwise alignment and global alignment counterparts through scores comparison and the receiver operating characteristic curve. The results obtained from the comparisons showed that CA-KSE is a perfect test having an area under curve of 0.9659, compared to the other approaches. Research limitations/implications The paper has proposed an efficient algorithm, which is considered better than related methods, for matching several collaborators (more than two) in KS environment. The method could be deployed in medical field for gene analysis, software organizations for distributed development and academics for knowledge sharing. Originality/value One sign of strength of this approach, compared to most sequence alignment approaches that can only match two collaborators at a time, is that it can match several collaborators at a faster rate.
Article
Full-text available
Although dynamic programming (DP) is an optimization approach used to solve a complex problem fast, the time required to solve it is still not efficient and grows polynomially with the size of the input. In this contribution, we improve the computation time of the dynamic programming based algorithms by proposing a novel technique, which is called “SDP: Segmented Dynamic programming”. SDP finds the best way of splitting the compared sequences into segments and then applies the dynamic programming algorithm to each segment individually. This will reduce the computation time dramatically. SDP may be applied to any dynamic programming based algorithm to improve its computation time. As case studies, we apply the SDP technique on two different dynamic programming based algorithms; “Needleman–Wunsch (NW)”, the widely used program for optimal sequence alignment, and the LCS algorithm, which finds the “Longest Common Subsequence” between two input strings. The results show that applying the SDP technique in conjunction with the DP based algorithms improves the computation time by up to 80% in comparison to the sole DP algorithms, but with small or ignorable degradation in comparing results. This degradation is controllable and it is based on the number of split segments as an input parameter. However, we compare our results with the well-known heuristic FASTA sequence alignment algorithm, “GGSEARCH”. We show that our results are much closer to the optimal results than the “GGSEARCH” algorithm. The results are valid independent from the sequences length and their level of similarity. To show the functionality of our technique on the hardware and to verify the results, we implement it on the Xilinx Zynq-7000 FPGA.
Chapter
High-throughput techniques for DNA sequencing have led to an exponential growth of biological databases. These biological sequences have to be analyzed and interpreted in order to determine their function and structure. Therefore biological sequence analysis tools need to deal with an ever-growing amount of data. Unfortunately, these databases grow faster than the core performance of single processor. Biological sequence analysis tools can take advantage of parallel computing and the emergence of multicore architectures, such as graphics processing units (GPUs), which provide the opportunity to significantly reduce the execution time of these tools. This chapter discusses the recent advances in GPU solutions for two main sequence comparison problems, pairwise sequence comparison and sequence-profile comparison.
Conference Paper
Nested loops with regular iteration dependencies span a large class of applications ranging from string matching to linear system solvers. Wavefront parallelism is a well-known technique to enable concurrent processing of such applications and is widely being used on GPUs to benefit from their massively parallel computing capabilities. Wavefront parallelism on GPUs uses global barriers between processing of tiles to enforce data dependencies. However, such diagonal-wide synchronization causes load imbalance by forcing SMs to wait for the completion of the SM with longest computation. Moreover, diagonal processing causes loss of locality due to elements that border adjacent tiles. In this paper, we propose PeerWave, an alternative GPU wavefront parallelization technique that improves inter-SM load balance by using peer-wise synchronization between SMs. and eliminating global synchronization. Our approach also increases GPU L2 cache locality through row allocation of tiles to the SMs. We further improve PeerWave performance by using flexible hyper-tiles that reduce inter-SM wait time while maximizing intra-SM utilization. We develop an analytical model for determining the optimal tile size. Finally, we present a run-time and a CUDA based API to allow users to easily implement their applications using PeerWave. We evaluate PeerWave on the NVIDIA K40c GPU using 6 different applications and achieve speedups of up to 2X compared to the most recent hyperplane transformation based GPU implementation.
Article
Full-text available
Many bioinformatics applications, such as the optimal pairwise biological sequence comparison, demand a great quantity of computing resource, thus are excellent candidates to run in high-performance computing (HPC) platforms. In the last two decades, a large number of HPC-based solutions were proposed for this problem that run in different platforms, targeting different types of comparisons with slightly different algorithms and making the comparative analysis of these approaches very difficult. This article proposes a classification of parallel optimal pairwise sequence comparison solutions, in order to highlight their main characteristics in a unifiedway.We then discuss several HPC-based solutions, including clusters ofmulticores and accelerators such as Cell Broadband Engines (CellBEs), Field-Programmable Gate Arrays (FPGAs), Graphics Processing Units (GPUs) and Intel Xeon Phi, as well as hybrid solutions, which combine two or more platforms, providing the actual landscape of the main proposals in this area. Finally, we present open questions and perspectives in this research field.
Article
The Smith-Waterman algorithm is used in Bio-informatics to perform pairwise local alignment between a query sequence and a subject sequence. We present a GPU based parallel version of this algorithm that is able to perform pair-wise alignment faster than previous algorithms. In particular, it parallelizes each alignment, rather than relying on parallelism across multiple pair alignments, which many other proposed GPU algorithms do. As a result it scales better. We further extend our algorithm to work efficiently on a cluster of GPUs. At a high level, our approach subdivides the iterative computation of elements of a matrix among blocks of processors such that each block can simply recompute the data it needs instead of waiting for other processors to compute them. Sometimes this may lead to excessive recomputation, however. We evaluate these cases and employ a hybrid approach, recomputing only limited data and communicating the rest. Our algorithm is also extended to produce not only the best but all 'best K' alignments. Our results on SSCA#1 benchmark show that our method is upto 5-24 times faster than previous method.
Article
Full-text available
Dynamic programming seeks to solve complex problems by breaking them down into multiple smaller problems. The solutions of these smaller problems are then combined to reach the overall solution. Deterministic algorithms have the advantage of accuracy but they need large computational power requirements. Heuristic algorithms have the advantage of speed but they provide less accuracy. This paper presents a hybrid design of dynamic programing technique that is used for sequence alignment. Our technique combines the advantages of deterministic and heuristic algorithms by delivering the optimal solution in suitable time. we implement our design on a Xilinx Zynq-7000 Artix-7 FPGA and show that our implementation improves the performance of sequence alignment by 63% for in comparison to the traditional known methods.
Article
Modern weather satellites provide more detailed observations of cloud and precipitation processes. To harness these observations for better satellite data assimilations, a cloud-resolving model, known as the Goddard Cumulus Ensemble (GCE) model, was developed and used by the Goddard Satellite Data Simulator Unit (G-SDSU). The GCE model has also been incorporated as part of the widely used weather research and forecasting (WRF) model. The computation of the cloud-resolving GCE model is time-consuming. This paper details our massively parallel design of GPU-based WRF GCE scheme. With one NVIDIA Tesla K40 GPU, the GPU-based GCE scheme achieves a speedup of bf361times{bf 361} times as compared to its original Fortran counterpart running on one CPU core, whereas the speedup for one CPU socket (four cores) with respect to one CPU core is only bf3.bf9times{bf 3}.{bf 9} times .
Article
Genomic science is now facing an explosive increase of data thanks to the fast development of sequencing technology. This situation poses serious challenges to genomic data storage and transferring. It is desirable to compress data to reduce storage and transferring cost, and thus to boost data distribution and utilization efficiency. Up to now, a number of algorithms / tools have been developed for compressing genomic sequences. Unlike the existing algorithms, most of which treat genomes as one-dimensional text strings and compress them based on dictionaries or probability models, this paper proposes a novel approach called CoGI (the abbreviation of Compressing Genomes as an Image) for genome compression, which transforms the genomic sequences to a two-dimensional binary image (or bitmap), then applies a rectangular partition coding algorithm to compress the binary image. CoGI can be used as either a reference-based compressor or a reference-free compressor. For the former, we develop two entropy-based algorithms to select a proper reference genome. Performance evaluation is conducted on various genomes. Experimental results show that the reference-based CoGI significantly outperforms two state-of-the-art reference-based genome compressors GReEn and RLZ-opt in both compression ratio and compression efficiency. It also achieves comparable compression ratio but two orders of magnitude higher compression efficiency in comparison with XM-one state-of-the-art reference-free genome compressor. Furthermore, our approach performs much better than Gzip-a general-purpose and widely-used compressor, in both compression speed and compression ratio. So, CoGI can serve as an effective and practical genome compressor. The source code and other related documents of CoGI are available at: http://admis.fudan.edu.cn/projects/cogi.htm.
Article
Full-text available
Human–chimpanzee comparative genome research is essential for narrowing down genetic changes involved in the acquisition of unique human features, such as highly developed cognitive functions, bipedalism or the use of complex language. Here, we report the high-quality DNA sequence of 33.3 megabases of chimpanzee chromosome 22. By comparing the whole sequence with the human counterpart, chromosome 21, we found that 1.44% of the chromosome consists of single-base substitutions in addition to nearly 68,000 insertions or deletions. These differences are sufficient to generate changes in most of the proteins. Indeed, 83% of the 231 coding sequences, including functionally important genes, show differences at the amino acid sequence level. Furthermore, we demonstrate different expansion of particular subfamilies of retrotransposons between the lineages, suggesting different impacts of retrotranspositions on human and chimpanzee evolution. The genomic changes after speciation and their biological consequences seem more complex than originally hypothesized.
Conference Paper
Full-text available
Biological sequence comparison is one of the most impor-tant tasks in Bioinformatics. Due to the growth of biological databases, sequence comparison is becoming an important challenge for high per-formance computing, especially when very long sequences are compared. The Smith-Waterman (SW) algorithm is an exact method based on dy-namic programming to quantify local similarity between sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). In this work, we show how long sequences comparison takes advantage of current and future multicore architectures. We analyze two different SW implementations on the CellBE and use simulation tools to study the performance scalability in a multicore architecture. We study the mem-ory organization that delivers the maximum bandwidth with the mini-mum cost. Our results show that a heterogeneous architecture is an valid alternative to execute challenging bioinformatic workloads.
Article
Full-text available
DNA sequence alignment is a very important problem in bioinformatics. The algorithm proposed by Smith-Waterman (SW) is an exact method that obtains optimal local alignments in quadratic space and time. For long sequences, quadratic complexity makes the use of this algorithm impractical. In this scenario, the use of a reconfigurable architecture is a very attractive alternative. This article presents the design and evaluation of an FPGA-based architecture that obtains the similarity score between DNA sequences, as well as its coordinates. The results obtained in a Xilinx xc2vp70 FPGA prototype presented a speedup of 246.9 over the software solution to compare sequences of size 100MBP and 100BP, respectively. Different from others hardware solutions that just calculate alignment scores, our design was able to avoid architecture's bottlenecks and accelerate the most computer intensive part of a sequence alignment software algorithm.
Article
Full-text available
Sequence alignment is a fundamental operation for homology search in bioinformatics. For two DNA or protein sequences of length m and n, full-matrix (FM), dynamic programming alignment algorithms such as Needleman-Wunsch and Smith-Waterman take O(m × n) time and use a possibly prohibitive O(m × n) space. Hirschberg's algorithm reduces the space requirements to O(min(m, n)), but requires approximately twice the number of operations required by the FM algorithms. The Fast Linear-Space Alignment (FastLSA) algorithm adapts to the amount of space available by trading space for operations. FastLSA can effectively adapt to use either linear or quadratic space, depending on the specific machine. Our experiments show that, in practice, due to memory caching effects, FastLSA is always as fast or faster than the Hirschberg and FM algorithms. To improve the performance of FastLSA further, we have parallelized it using a simple but effective form of wavefront parallelism. Our experimental results show that Parallel FastLSA exhibits good speedups, almost linear for eight processors or less, and also that the efficiency of Parallel FastLSA increases with the size of the sequences that are aligned. Consequently, parallel and sequential FastLSA can be flexibly and effectively used with high performance in situations where space and the number of parallel processors can vary greatly.
Conference Paper
Full-text available
Protein sequences with unknown functionality are often compared to a set of known sequences to detect functional similarities. Efficient dynamic programming algorithms exist for this problem, however current solutions still require significant scan times. These scan time requirements are likely to become even more severe due to the rapid growth in the size of these databases. In this paper, we present a new approach to bio-sequence database scanning using computer graphics hardware to gain high performance at low cost. To derive an efficient mapping onto this type of architecture, we have reformulated the Smith-Waterman dynamic programming algorithm in terms of computer graphics primitives. Our OpenGL implementation achieves a speedup of approximately sixteen on a high-end graphics card over available straightforward and optimized CPU Smith-Waterman implementations
Article
Full-text available
Genomic alignments, as a means to uncover evolutionary relationships among organisms, are a fundamental tool in computational biology. There is considerable recent interest in using the Cell Broadband Engine, a heterogeneous multicore chip that provides high performance, for biological applications. However, work in genomic alignments so far has been limited to computing optimal alignment scores using quadratic space for the basic global/local alignment problem. In this paper, we present a comprehensive study of developing alignment algorithms on the Cell, exploiting its thread and data level parallelism features. First, we develop a parallel implementation on the Cell that computes optimal alignments and adopts Hirschberg's linear space technique. The former is essential, as merely computing optimal alignment scores is not useful, while the latter is needed to permit alignments of longer sequences. We then present Cell implementations of two advanced alignment techniques-spliced alignments and syntenic alignments. Spliced alignments are useful in aligning mRNA sequences with corresponding genomic sequences to uncover the gene structure. Syntenic alignments are used to discover conserved exons and other sequences between long genomic sequences from different organisms. We present experimental results for these three types of alignments on 16 Synergistic Processing Elements of the IBM QS20 dual-Cell blade system.
Conference Paper
Full-text available
CUDASW++ is a parallelization of the Smith-Waterman algorithm for CUDA graphical processing units that computes the similarity scores of a query sequence paired with each sequence in a database. The algorithm uses one of two kernel functions to compute the score between a given pair of sequences: the inter-task kernel or the intra-task kernel. We have identified the intra-task kernel as a major bottleneck in the CUDASW++ algorithm. We have developed a new intra-task kernel that is faster than the original intra-task kernel used in CUDASW++. We describe the development of our kernel as a series of incremental changes that provide insight into a number of issues that must be considered when developing any algorithm for the CUDA architecture. We analyze the performance of our kernel compared to the original and show that the use of our intra-task kernel substantially improves the overall performance of CUDASW++ on the order of three to four giga-cell updates per second on various benchmark databases.
Article
Full-text available
Finding regions of similarity between two very long data streams is a computationally intensive problem referred to as sequence alignment. Alignment algorithms must allow for imperfect sequence matching with different starting locations and some gaps and errors between the two data sequences. Perhaps the most well known application of sequence matching is the testing of DNA or protein sequences against genome databases. The Smith–Waterman algorithm is a method for precisely characterizing how well two sequences can be aligned and for determining the optimal alignment of those two sequences. Like many applications in computational science, the Smith–Waterman algorithm is constrained by the memory access speed and can be accelerated significantly by using graphics processors (GPUs) as the compute engine. In this work we show that effective use of the GPU requires a novel reformulation of the Smith–Waterman algorithm. The performance of this new version of the algorithm is demonstrated using the SSCA#1 (Bioinformatics) benchmark running on one GPU and on up to four GPUs executing in parallel. The results indicate that for large problems a single GPU is up to 45 times faster than a CPU for this application, and the parallel implementation shows linear speed up on up to 4 GPUs.
Conference Paper
Full-text available
Molecular biologists frequently align DNA sequences of entire genomes to detect important matched and mismatched regions. Even though efficient dynamic programming algorithms exist for this problem, the required computing time is still very high due to the size of these sequences (usually a few million base pairs in length). Because the number of sequenced organisms is increasing rapidly, fast and accurate solutions are of highest importance to research in this area. In this paper we present an algorithm to compute the optimal and near-optimal alignments of two sequences in linear space and quadratic time. We demonstrate how this algorithm can be parallelized efficiently on a PC cluster and on a computational grid in order to reduce its runtime significantly. The grid implementation uses a hierarchical approach combining inter-cluster and intra-cluster parallelism.
Conference Paper
Full-text available
Progressive alignment is a widely used approach for computing multiple sequence alignments (MSAs). However, aligning several hundred or thousand sequences with popular progressive alignment tools such as ClustalW requires hours or even days on state-of-the-art workstations. This paper presents MSA-CUDA, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using CUDA and achieves significant speedups compared to the sequential ClustalW for a variety of large protein sequence datasets. Our tests on a GeForce GTX 280 GPU demonstrate average speedups of 36.91 (for long protein sequences), 18.74 (for average-length protein sequences), and 11.27 (for short protein sequences) compared to the sequential ClustalW running on a Pentium 4 3.0 GHz processor. Our MSA-CUDA outperforms ClustalW-MPI running on 32 cores of a high performance workstation cluster.
Conference Paper
Full-text available
The Smith Waterman algorithm for sequence alignment is one of the main tools of bioinformatics. It is used for sequence similarity searches and alignment of similar sequences. The high end Graphical Processing Unit (GPU), used for processing graphics on desktop computers, deliver computational capabilities exceeding those of CPUs by an order of magnitude. Recently these capabilities became a ccessible for general purpose computations thanks to CUDA programming environment on Nvidia GPUs and ATI Stream Computing environment on ATI GPUs. Here we present an efficient implementation of the Smith Waterman algorithm on the Nvidia GPU. The algorithm achieves more than 3.5 times higher per core performance than previously published implementation of the Smith Waterman algorithm on GPU, reaching more than 70% of theoretical hardware performance. The differences between current and earlier approaches are described showing the example for writing efficient code on GPU.
Conference Paper
Full-text available
We present a novel hardware implementation of the double affine Smith-Waterman (DASW) algorithm, which uses dynamic programming to compare and align genomic sequences such as DNA and proteins. We implement DASW on a commodity graphics card, taking advantage of the general purpose programmability of the graphics processing unit to leverage its cheap parallel processing power. The results demonstrate that our system’s performance is competitive with current optimized software packages.
Article
Full-text available
Due to its high sensitivity, the Smith-Waterman algorithm is widely used for biological database searches. Unfortunately, the quadratic time complexity of this algorithm makes it highly time-consuming. The exponential growth of biological databases further deteriorates the situation. To accelerate this algorithm, many efforts have been made to develop techniques in high performance architectures, especially the recently emerging many-core architectures and their associated programming models. This paper describes the latest release of the CUDASW++ software, CUDASW++ 2.0, which makes new contributions to Smith-Waterman protein database searches using compute unified device architecture (CUDA). A parallel Smith-Waterman algorithm is proposed to further optimize the performance of CUDASW++ 1.0 based on the single instruction, multiple thread (SIMT) abstraction. For the first time, we have investigated a partitioned vectorized Smith-Waterman algorithm using CUDA based on the virtualized single instruction, multiple data (SIMD) abstraction. The optimized SIMT and the partitioned vectorized algorithms were benchmarked, and remarkably, have similar performance characteristics. CUDASW++ 2.0 achieves performance improvement over CUDASW++ 1.0 as much as 1.74 (1.72) times using the optimized SIMT algorithm and up to 1.77 (1.66) times using the partitioned vectorized algorithm, with a performance of up to 17 (30) billion cells update per second (GCUPS) on a single-GPU GeForce GTX 280 (dual-GPU GeForce GTX 295) graphics card. CUDASW++ 2.0 is publicly available open-source software, written in CUDA and C++ programming languages. It obtains significant performance improvement over CUDASW++ 1.0 using either the optimized SIMT algorithm or the partitioned vectorized algorithm for Smith-Waterman protein database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs.
Article
Full-text available
The Smith-Waterman algorithm is one of the most widely used tools for searching biological sequence databases due to its high sensitivity. Unfortunately, the Smith-Waterman algorithm is computationally demanding, which is further compounded by the exponential growth of sequence databases. The recent emergence of many-core architectures, and their associated programming interfaces, provides an opportunity to accelerate sequence database searches using commonly available and inexpensive hardware. Our CUDASW++ implementation (benchmarked on a single-GPU NVIDIA GeForce GTX 280 graphics card and a dual-GPU GeForce GTX 295 graphics card) provides a significant performance improvement compared to other publicly available implementations, such as SWPS3, CBESW, SW-CUDA, and NCBI-BLAST. CUDASW++ supports query sequences of length up to 59K and for query sequences ranging in length from 144 to 5,478 in Swiss-Prot release 56.6, the single-GPU version achieves an average performance of 9.509 GCUPS with a lowest performance of 9.039 GCUPS and a highest performance of 9.660 GCUPS, and the dual-GPU version achieves an average performance of 14.484 GCUPS with a lowest performance of 10.660 GCUPS and a highest performance of 16.087 GCUPS. CUDASW++ is publicly available open-source software. It provides a significant performance improvement for Smith-Waterman-based protein sequence database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs.
Article
Full-text available
The newest version of MUMmer easily handles comparisons of large eukaryotic genomes at varying evolutionary distances, as demonstrated by applications to multiple genomes. Two new graphical viewing tools provide alternative ways to analyze genome alignments. The new system is the first version of MUMmer to be released as open-source software. This allows other developers to contribute to the code base and freely redistribute the code. The MUMmer sources are available at http://www.tigr.org/software/mummer.
Article
Full-text available
Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment. In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware. The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment. Their performance is better than any alternative available on commodity hardware platforms. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches.
Conference Paper
Full-text available
Pairwise sequence alignment is a fundamental operation for homology search in bioinformatics. For two DNA or protein sequences of length m and n, full-matrix (FM), dynamic programming alignment algorithms such as Needleman-Wunsch and Smith-Waterman take O(mtimesn) time and use a possibly prohibitive O(timesn) space. Hirschberg's algorithm reduces the space requirements to O(min(m,n)), but requires approximately twice the number of operations required by the FM algorithms. The fast linear space alignment (FastLSA) algorithm adapts to the amount of space available by trading space for operations. FastLSA can effectively adapt to use either linear or quadratic space, depending on the amount of available memory. Our experiments show that, in practice, due to memory caching effects, FastLSA is always as fast or faster than Hirschberg and the FM algorithms. We have also parallelized FastLSA using a simple but effective form of wavefront parallelism. Our experimental results show that Parallel FastLSA exhibits good speedups
Article
Dynamic programming algorithms are often used to find the similarities of sequences as well as to deliver the actual alignment of two sequences. Two kinds of alignments are used to compare sequences: local alignments and global alignments. The local alignments attempt to locate conserved regions, while the global alignments identify overall relationship between two sequences. While dynamic programming algorithms are relatively time consuming, the space required is often the limiting factor when aligning long sequences. A linear space algorithm for computing maximal common subsequences, proposed by Hirschberg, was applied by Myers and Miller to deliver optimal alignments in linear space. The authors have improved the Myers and Miller algorithm by introducing a multiple divide and conquer technique that reduces the algorithm`s running time while maintaining its linear space property. Efficient sequence alignment algorithms have been an active topic in computational biology.
Article
The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space. An algorithm is presented which will solve this problem in quadratic time and in linear space.
Article
Biological Sequence Comparison is one of the most important operations in Computational Biology since it is used to determine how similar two sequences are. Smith and Waterman proposed an exact algorithm (SW), based on dynamic programming, that is able to obtain the best local alignment between two sequences in quadratic time and space. In order to compare long biological sequences, SW is rarely used since the computation time and the amount of memory required becomes prohibitive. For this reason, heuristic methods like BLAST are widely used. Although faster, these heuristic methods do not guarantee that the best result will be produced. In this paper, we propose an exact parallel variant of the SW algorithm that obtains the best local alignments in quadratic time and reduced space. The results obtained in two clusters (8-machine and 16-machine) for DNA sequences longer than 32 KBP (kilo base-pairs) were very close to linear and, in some cases, superlinear. For very long DNA sequences (1.6 MBP), we were able to reduce execution time from 12.25 hours to 1.54 hours, in our 8-machine cluster. As far as we know, this is the first time 1.6 MBP sequences are compared with an exact SW variant. In this case, 30240 best local alignments were obtained.
Conference Paper
Pairwise sequence alignment is a basic operation in bioinformatics that is performed thousands of times, in a daily basis. The exact methods proposed in the literature have quadratic time complexity. For this reason, heuristic methods such as BLAST are widely used. Nevertheless, it is known that exact methods present better sensitivity, leading to better results. To obtain exact results faster, many parallel strategies have been proposed but most of them fail to align huge biological sequences. This happens because not only the quadratic time must be considered but also the space should be reduced. In this paper, we evaluate the performance and sensibility of z-align, a parallel exact strategy that runs in user-restricted memory space. The results obtained in a 64-processor cluster show that two sequences of size 23MBP (Mega Base Pairs) and 24MBP, respectively, were successfully aligned with z-align. Also, in order to align two 3MBP sequences, a speedup of 34.35 was achieved. Finally, when comparing z-align with BLAST, we can see that the z-align alignments are longer and have a higher score.
Conference Paper
Cross-species chromosome alignments can reveal ancestral relationships and may be used to identify the peculiarities of the species. It is thus an important problem in Bioinformatics. So far, aligning huge sequences, such as whole chromosomes, with exact methods has been regarded as unfeasible, due to huge computing and memory requirements. However, high performance computing platforms such as GPUs are being able to change this scenario, making it possible to obtain the exact result for huge sequences in reasonable time. In this paper, we propose and evaluate a parallel algorithm that uses GPU to align huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity. In order to achieve that, we propose optimizations that are able to reduce significantly the amount of data processed and that enforce full parallelism most of the time. Using the GTX 285 Board, our algorithm was able to produce the optimal alignment between sequences composed of 33 Millions of Base Pairs (MBP) and 47 MBP in 18.5 hours.
Article
Recently, many organisms had their DNA entirely sequenced, and this reality presents the need for aligning long DNA sequences, which is a challenging task due to its high demands for computational power and memory. The algorithm proposed by Smith–Waterman (SW) is an exact method that obtains optimal local alignments in quadratic space and time. For long sequences, quadratic complexity makes the use of this algorithm impractical. In this scenario, parallel computing is a very attractive alternative. In this paper, we propose and evaluate z-align, a parallel exact strategy based on the divergence concept to locally align long biological sequences using an affine gap function. Z-align runs in limited memory space, where the amount of memory used can be defined by the user. The results collected in a cluster with 16 processors presented very good speedups for long real DNA sequences. With z-align, we were able to compare up to 3 MBP (mega base-pairs) DNA sequences. As far as we know, this is the first time 3 MBP sequences are compared with an affine gap exact variation of the SW algorithm. Also, by comparing the results obtained with z-align and the popular BLAST tool, it is clear that z-align is able to produce longer and more significant alignments.
Article
In this paper we have described a dynamic programming algorithm to compute k non-intersecting near-optimal alignments in linear space. In order to reduce its runtime significantly, we are using a hierarchical grid system as the computing platform. Static and dynamic load balancing approaches are investigated in order to achieve efficiently mapping onto this type of architecture, which has characteristics such as: (1) the resources in the grid systems have different computational power; (2) the resources usually are connected by networks with widely varying performance characteristics. At last, a new dynamic load balancing approach named scheduler–worker parallel paradigm is proposed and evaluated.
Article
We present practical parallel algorithms using prefix computations for various problems that arise in pairwise comparison of biological sequences. We consider both constant and affine gap penalty functions, full-sequence and subsequence matching, and space-saving algorithms. Commonly used sequential algorithms solve the sequence comparison problems in O(mn) time and O(m+n) space, where m and n are the lengths of the sequences being compared. All the algorithms presented in this paper are time optimal with respect to the sequential algorithms and can use processors where n is the length of the larger sequence. While optimal parallel algorithms for many of these problems are known, we use a simple framework and demonstrate how these problems can be solved systematically using repeated parallel prefix operations. We also present a space-saving algorithm that uses space and runs in optimal time where p is the number of the processors used. We implemented the parallel space-saving algorithm and provide experimental results on an IBM SP-2 and a Pentium cluster.
Conference Paper
Biological sequence comparison is a very important operation in Bioinformatics. Even though there do exist exact methods to compare biological sequences, these methods are often neglected due to their quadratic time and space complexity. In order to accelerate these methods, many GPU algorithms were proposed in the literature. Nevertheless, all of them restrict the size of the smallest sequence in such a way that Megabase genome comparison is prevented. In this paper, we propose and evaluate CUDAlign, a GPU algorithm that is able to compare Megabase biological sequences with an exact Smith-Waterman affine gap variant. CUDAlign was implemented in CUDA and tested in two GPU boards, separately. For real sequences whose size range from 1MBP (Megabase Pairs) to 47MBP, a close to uniform GCUPS (Giga Cells Updates per Second) was obtained, showing the potential scalability of our approach. Also, CUDAlign was able to compare the human chromosome 21 and the chimpanzee chromosome 22. This operation took 21 hours on GeForce GTX 280, resulting in a peak performance of 20.375 GCUPS. As far as we know, this is the first time such huge chromosomes are compared with an exact method.
Conference Paper
We present the first space and time optimal parallel algorithm for the pairwise sequence alignment problem, a fundamental problem in computational biology. This problem can be solved sequentially in O(mn) time and O(m + n) space, where m and n are the lengths of the sequences to be aligned. The fastest known parallel space-optimal algorithm for pairwise sequence alignment takes optimal O((m+n)/(p)) space but suboptimal O(((m+n)2)/(p)) time, where p is the number of processors. On the other hand, the most space economical time-optimal parallel algorithm takes O((mn)/(p)) time but O(m + (n)/(p)) space. We close this gap by presenting an algorithm that achieves both time and space optimality, i.e. requires only O((m+n)/(p)) space and O((mn)/(p)) time. We also present an experimental evaluation of the proposed algorithm on an IBM xSeries cluster.
Conference Paper
An innovative reconfigurable supercomputing platform - XD1000 is being developed by XtremeData to exploit the rapid progress of FPGA technology and the high-performance of Hyper-Transport interconnection. In this paper, we present implementations of the Smith-Waterman algorithm for both DNA and protein sequences on the platform. The main features include: (1) we bring forward a multistage PE (processing element) design which significantly reduces the FPGA resource usage and hence allows more parallelism to be exploited; (2) our design features a pipelined control mechanism with uneven stage latencies - a key to minimize the overall PE pipeline cycle time; (3) we also present a compressed substitution matrix storage structure, resulting in substantial decrease of the on-chip SRAM usage. Finally, we implement a 384-PE systolic array running at 66.7MHz, which can achieve 25.6GCUPS peak performance. Compared with the 2.2GHz AMD Opteron host processor, the FPGA coprocessor results in speedup of 185 and 250 respectively.
Article
Space, not time, is often the limiting factor when computing optimal sequence alignments, and a number of recent papers in the biology literature have proposed space-saving strategies. However, a 1975 computer science paper by Hirschberg presented a method that is superior to the new proposals, both in theory and in practice. The goal of this paper is to give Hirschberg's idea the visibility it deserves by developing a linear-space version of Gotoh's algorithm, which accommodates affine gap penalties. A portable C-software package implementing this algorithm is available on the BIONET free of charge.
Article
The algorithm of Waterman et al. (1976) for matching biological sequences was modified under some limitations to be accomplished in essentially MN steps, instead of the M2N steps necessary in the original algorithm. The limitations do not seriously reduce the generality of the original method, and the present method is available for most practical uses. The algorithm can be executed on a small computer with a limited capacity of core memory.
Article
Introduction to Computational Biology: Maps, Sequencesand Genomes. Chapman Hall, 1995.[WF74] R.A. Wagner and M.J. Fischer. The String to String Correction Problem. Journal of the ACM, 21(1):168--173, 1974.[WM92] S. Wu and U. Manber. Fast Text Searching Allowing Errors. Communicationsof the ACM, 10(35):83--91, 1992.73Bibliography[KOS+00] S. Kurtz, E. Ohlebusch, J. Stoye, C. Schleiermacher, and R. Giegerich.Computation and Visualization of Degenerate Repeats in CompleteGenomes. In ...
Article
The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to downweight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.
Article
Scanning bio-sequence database and finding similarities among DNA and protein sequences is basic and important work in bioinformatics field. To solve this problem, Needleman-Wunschh (NW) algorithm is a classical and precise tool, and Smith-Waterman (SW) algorithm is more practical for its capability to find similarities between subsequences. Such algorithms have computational complexity proportional to the length product of both involved sequences, hence processing time becomes insufferable due to exponential growth speed and great amount of bio-sequence database. To alleviate this serious problem, a reconfigurable accelerator for SW algorithm is presented. In the accelerator, a modified equation is proposed to improve mapping efficiency of a processing element (PE), and a special floor plan is applied to a fine-grain parallel PE array and interface components to cut down their routing delay. Basing on the two techniques, the proposed accelerator can reach at 82-MHz frequency in an Altera EP1S30 device. Experiments demonstrate the accelerator provides more than 330 speedup as compared to a standard desktop platform with a 2.8-GHz Xeon processor and 4-GB memory and has 50% improvement on the peak performance of a transferred traditional implementation without using the two special techniques. Our implementation is also about 9% faster than the fastest implementation in a most recent family of SW algorithm accelerators.
Article
We present the first space and time optimal parallel algorithm for the pairwise sequence alignment problem, a fundamental problem in computational biology. This problem can be solved sequentially in O(mn) time and O(m+n) space, where m and n are the lengths of the sequences to be aligned. The fastest known parallel space-optimal algorithm for pairwise sequence alignment takes optimal O(m+n/p) space, but suboptimal O((m+n)2/p) time, where p is the number of processors. On the other hand, the most space economical time-optimal parallel algorithm takes O(mn/p) time, but O(m+n/p) space. We close this gap by presenting an algorithm that achieves both time and space optimality, i.e. requires only O((m+n)/p) space and O(mn/p) time. We also present an experimental evaluation of the proposed algorithm on an IBM xSeries cluster. Although presented in the context of full sequence alignments, our algorithm is applicable to other alignment problems in computational biology including local alignments and syntenic alignments. It is also a useful addition to the range of techniques available for parallel dynamic programming.
Miller, &ldquo,Optimal Alignments in Linear Space,&rdquo
  • E W Myers
Sun and X. Jiang, &ldquo,A Reconfigurable Accelerator for Smith-Waterman Algorithm,&rdquo
  • X Liu
  • L Xu
  • P Zhang
A. de Melo, &ldquo,CUDAlign: Using GPU to Accelerate the Comparison of Megabase Genomic Sequences,&rdquo
  • E F De
  • O Sandes
CUDAlign: Using GPU to Accelerate the Comparison of Megabase Genomic Sequences
  • de
Smith-Waterman Alignment of Huge Sequences with GPU in Linear Space
  • de