Conference PaperPDF Available
Using GPUs to Speed-Up Levenshtein Edit
Distance Computation
Khaled Balhaf, Mohammed A. Shehab, Wala’a T. Al-Sarayrah,
Mahmoud Al-Ayyoub, Mohammed Al-Saleh and Yaser Jararweh
Jordan University of Science and Technology, Irbid, Jordan
Emails: {khbalhaf14, mashehab12, wtalsarayrah14}@cit.just.edu.jo, {maalshbool, misaleh, yijararweh}@just.edu.jo
Abstract—Sequence comparison problems such as sequence
alignment and approximate string matching are part of the
fundamental problems in many fields such as natural language
processing, data mining and bioinformatics. However, the al-
gorithms proposed to address these problems suffer from high
computational complexities prohibiting them from being widely
used in practical large-scale settings. Many researchers used
parallel programming to reduce the execution time of these
algorithms. In this paper, we follow this approach and use
the parallelism capabilities of the Graphics Processing Unit
(GPU) to accelerate one of the most common algorithms to
compute the edit distance between two strings, which is known
as the Levenshtein distance. To take full advantage of the large
number of cores in a GPU, we employ a diagonal-based tracing
technique which results in even greater improvements in terms
of the running time. In fact, our CUDA implementation of the
Levenshtein algorithm is about 11X faster than the sequential
implementation. This is achieved without affecting the accuracy.
I. INTRODUCTION
The big revolution in this information age has given rise
to many interdisciplinary fields. One major example is the
field of bioinformatics which benefits from the computational
methods of computer science to address the complex problems
in biology, especially the ones related to genomics and pro-
teomics. Coupled with the rapid improvement in the hardware
and experimental instruments in the past few decades, this field
witnessed jumps in its development and maturity. One of the
feats of this field is the success of the Human Genome Project
(HGP)1which aims at identifying the structure and contents
of the human DNA from both a physical and functional
standpoint [1].
The HGP as well as many other problems in bioinformatics
are basically dealing with very long sequences and strings (of
DNA nucleotides, amino acids, etc.). One of the objectives
of these problems is to perform different kinds of sequence
comparisons such as sequence alignment and approximate
string matching. To address these problems, bioinformatics
researchers benefited from the rich literature of sequence
comparison algorithms in computer science fields such as fast
searching for name or strings similarity [2]. For example,
one measure of computing string similarity is the Levenshtein
distance which has been extensively studied in the field of
Natural Language Processing (NLP) for spelling correction
1https://en.wikipedia.org/wiki/Human Genome Project
[3]. The same measure can be used to get an indication of
the functional similarity between different DNA fragments
depending on their content similarity [4].
The proposed algorithm for computing the Levenshtein
distance is a very neat one. However, it suffers from high
computational complexity prohibiting it from being widely
used in certain large-scale settings such as that of DNA
fragments which are millions of nucleotides (characters) in
length. This is where the computational methods and the
advanced optimization techniques in computer science can
come in handy [5], [6].
Many researchers proposed to use parallel programming
to reduce the execution time of similar algorithms. One
example is the Smith-Waterman (SW) algorithm for the local
alignment problem [7], [8]. Parallel programming is a useful
technique in High Performance Computing (HPC) and it helps
to reduce the execution time of many algorithms [9]. One of
the hardware supporting highly parallel execution of programs
is the Graphics Processing Unit (GPU) [10], [11]. Due to its
large number of cores, the GPU supports a number of threads
far beyond those supported by the Central Processing Unit
(CPU). It can run more than 512 threads at same time while the
CPU can run at most 8 threads at the same time [12]. Created
by NVIDIA, Compute Unified Device Architecture (CUDA)
is a programming framework or a toolkit that allows C/C++
developers to have a parallel implementation that utilizes the
GPU capabilities for general purpose processing [13].
In this paper, we follow this approach and use the paral-
lelism capabilities of the GPU to accelerate the computation
of the Levenshtein distance between two strings. To take full
advantage of the large number of cores in a GPU, we employ a
diagonal-based tracing technique which results in even greater
improvements in terms of the running time. To the best of our
knowledge, few previous papers have presented any attempts
to accelerate the computation of the Levenshtein distance
using GPU. The closest to ours is the work of Siriwardena
and Ranasinghe [14] in which the authors implemented the
Needleman-Wunsch (NW) algorithm on CUDA. Compared to
the fill step on the CPU, they achieved 2 times speed-up by
using the GPU global memory with slow speed of accessing. In
the next level, they used fast memory access and achieved up
to 4.2 times speed-up compared to the CPU implementation.
The structure of paper as the following. On next section,
we represent some related work of accelerate the string
2016 7th International Conference on Information and Communication Systems (ICICS)
978-1-4673-8613-5/16/$31.00 ©2016 IEEE
matching algorithms. After that on third section, we illustrate
our methodology. Then, we display the results of this paper.
Finally, we deduce all the work in conclusion and suggest the
future work.
II. RELATED WORK S
The Levenshtein edit distance is one of the most used meth-
ods to calculate the similarity/distance between two strings, A
and Bof lengths nand m, respectively [15]. It is defined as
the minimum number of single-character edit operations (such
as insertions, deletions or substitutions) that are needed to
convert Ainto B. The basic dynamic programming algorithm
to compute the Levenshtein edit distance runs in O(mn)
time, which is inefficient considering the large-scale settings
in which it is commonly used. It shoud be noted that several
extension of the Levenshtein edit distance exists such as the
NW algorithms, which deals with the affine gap penalty issue
and the Smith-Waterman (SW), which deals with the local
alignment problem [16], [17].
There have been several attempts to improve these algo-
rithm. One approach is to employ bit-parallelism to find the
fastest approximate string matching. Such as Myers [18] who
achieved an optimal algorithm with O(nm/w)complexity,
where wis the word size in the machine. Other parallel
approaches are discussed in the following paragraph.
In [19] Xu et al. accelerated string matching algorithm by
converting Prasad et al. [20] extension of the BPR algorithm
[21] for multiple patterns to CUDA implementation. This
algorithm is used to detect similar patterns between two
strings. This code is also used for DNA string matching in
bioinformatics. They tested the sequential version with pure
parallel implementation to get the same outputs with faster
execution time. For the experiments, they used a machine with
a Intel Core i3 CPU, 2GB RAM and a GeForce 310M GPU
card with 512MB as GPU memory. The software for this paper
is Windows 7 with 32-bit architecture and Microsoft visual
studio 2008. The speed up gain in this research is about 28X
faster than sequential version.
Parallel programming using CUDA is very efficient to
accelerate string matching algorithms as shown in many recent
studies [9]. As an example of the works that exploited parallel
programming to speed-up string alignment is the work of
Ligowski and Rudnicki [7]. Specifically, the authors show
to gain significant spped up of the SW algorithm using the
parallel capabilities of the GPU. Compared with an earlier
implementation of the SW algorithm on GPU, the authors
showed that their implementation achieved 3.5 times higher
per core. They implemented the SW algorithm on Sony
PlayStation 3 (PS3) environment. Both PS3 and CPU use one
byte for representing integers. On the other hand, GPU uses
a full integer representation. As a result, using PS3 and CPU
implementation is better when non similar sequences need to
be found in large databases and, for similar sequences, the
GPU implementation is better [7].
In [14], the authors implemented NW algorithm using
CUDA-GPU to accelerate the computations. Compared to the
fill step on the CPU they achieved 2 times speed-up by using
the GPU global memory with slow speed of accessing. Then,
they used fast memory access and achieved up to 4.2 times
compared to the CPU implementation.
For the SW algorithm, the authors of [22] investigated two
parallel techniques in order to speed-up the computations on
a GPU. These techniques are: wave-front and streaming. The
wave-front technique suffers from a particular limitation be-
cause it cannot deal with long sequences due to GPU physical
memory limit. In streaming technique, long sequences can be
processed but with an overhead because of data transmission
between CPU and GPU. They implemented a new algorithm
(tile-based parallel Viterbi algorithm) that improves the GPU
performance. Long sequences are divided into small pieces
and every pair of pieces can be handled by the GPU memory.
Based on the experiments, their algorithm outperforms both
of the wave-front and the streaming techniques.
In [23], the authors implemented ClustalW tool using GPU
in order to reduce the run time of computations. They calcu-
lated the number of matches between two sequences by using
a new recurrence relation which gives the required results
ten times faster. Moreover, their design is scalable since it
partitions pairs of sequence comparisons to several GPUs
located on the same PC or connected via a network.
III. METHODOLOGY
In this section, we represent the main approach to accelerate
the Levenshtein edit distance computation. This improvement
is done by two main steps. The first step is reducing the
dependence between elements. Then the second step is done
by running this new implantation in parallel using CUDA
implementation.
A. Reduce data dependency
This section illustrates how to reduce dependency between
matrix cells in the Levenshtein distance algorithm. As shown
in Algorithm 1, the sequential implementation of the Leven-
shtein distance starts by setting the first row and first column
of matrix with initial values from the interval [1 N], where
Nis the length of strings under consideration. After that, the
algorithm continuously fills up all matrix cells to compute the
distance between the two strings of DNA sequence.
The value of each cell [i, j]in the matrix His computed
based on the values of three adjacent cells: upper cell, left cell
and upper left cell as follows.
Hi,j = min
Hi1,j1+Score
Hi,j1+ 1
Hi1,j + 1
(1)
where the score value in our implementation is zero if the
character of the first sequence matches the character of the
second sequence; otherwise, the score value is one. This way
of computation introduces dependencies as the ones shown in
Figure 1.
Based on the dependency problem mentioned in the previ-
ous paragraph, the algorithm cannot calculate the value of a
2016 7th International Conference on Information and Communication Systems (ICICS)
Algorithm 1 Sequential Implementation
1: procedure LEVENSHTEIN(Str1, Str2,N)
2: Initialize first row and first column from 1 to N
3: for <Row = 0 to N -1> do
4: for <Column = 0 to N -1> do
5: if Str1[Row 1] == Str2[Column 1] then
6: Score = 0
7: else
8: Score = 1
9: end if
10: Calculate Distance HRow,Column as Equa-
tion 1
11: end for
12: end for
13: end procedure
Fig. 1. Data Dependency Problem [14]
certain cell before computing the values of all cells above and
to the left of it. This limits the ability to exploit parallelism
to speed-up the computation of the edit distance.
As stated by Siriwardena and Ranasinghe [14], one way
to circumvent this problem is to compute cells values in a
“diagonal way.” As shown in Figure 2, if we compute the
values of cells in a diagonal way, the dependencies between
cells can be decreased. For example, after calculating cell
with index [0, 0], the values of the cells with indexes [1, 0]
and [0, 1] can be computed in parallel without any problem.
This method gives us a chance to compute more than one
cell in each iteration in parallel. Figure 2 shows the diagonal
technique and the number of cells that will run in parallel in
each iteration. Algorithm 2 shows all steps for this diagonal
approach.
B. CUDA implementation
This section represents the parallel implementation of
CUDA code. Parallel programming has two main techniques
for implementation. The first one is the pure parallel code,
where all parts of the code are run in the GPU side. The
second one is the hybrid parallel code, where some parts of
the code (i.e., some functions) are run on the GPU side and
the rest are run on the CPU side.
Our proposed implementation is a pure parallel one because
the Levenshtein distance algorithm has only one function
which calculates the distance between the two strings. Nev-
ertheless, we still need to perform some preliminary steps on
Algorithm 2 Diagonal Implementation
1: procedure LEVENSHTEIN(Str1, Str2,N)
2: Initialize first row and first column from 1 to N
3: for <slice = 0 to N*2-1> do
4: if slice < N then
5: Z= 0
6: else
7: Z=sliceN + 1
8: end if
9: for <j= Z to slice> do
10: Row =slice
11: Column =slice j
12: if Str1[Row 1] == Str1[Column 1] then
13: Score = 0
14: else
15: Score = 1
16: end if
17: Calculate Distance HRow,Column as Equa-
tion 1
18: end for
19: end for
20: end procedure
Fig. 2. Diagonal Technique in matrix
the CPU side before starting the computation on the GPU
side. Specifically, the CPU side controls the dimension of array
which corresponds to the number of elements that will be run
in parallel. Then the GPU side calculates each data element
in parallel. This makes the CPU the controller because of its
capability to change the direction of instructions.2The number
of threads we use is 256. As shown in Table I, this number
of threads leads to the best utilization of the GPU.
2https://en.wikipedia.org/wiki/Central processing unit
2016 7th International Conference on Information and Communication Systems (ICICS)
TABLE I
GPU UTILIZATION %
Compute
Capability
Threads per
block
1.0 1.1 1.2 1.3 2.0 2.1 3.0
64 67 67 50 50 33 33 50
96 100 100 75 75 50 50 75
128 100 100 100 100 67 67 100
192 100 100 94 94 100 100 94
256 100 100 100 100 100 100 100
384 100 100 75 75 100 100 94
512 67 67 100 100 100 100 100
768 N/A N/A N/A N/A 100 100 75
1024 N/A N/A N/A N/A 67 67 100
Furthermore, we use one direction for data transfer as an
optimization technique to reduce the effect of bus delay. The
CPU just calculates the number of elements that will run in
parallel for current iteration and sets the size of memory blocks
to be appropriate for the GPU session. The block size (bs) is
computed based on the slide ID (sID) by using the following
equation.
bs =sID 2×Z+ 1
where: Zis calculated as follows.
Z=0, ifsID < N
sID N+ 1, otherwise (2)
One of the limitations we faced with our experimental
setting is the lack of support for large-scale 2D arrays. So, as
a preprocessing step of our parallel diagonal implementation
(shown in Algorithm 3), we convert the 2D array into 1D array
and fill the 1D array in a column by column fashion. We use
the following equation to do this conversion.
index =row ×width +column
In the GPU side we use the following two equations to convert
the index back from 1D to 2D in order to get the correct
location in matrix. Then, access the correct index same as
sequential version as shown in pseudo code of CUDA kernel
Algorithm 4.
row =index
width (3)
column =index%width (4)
IV. RESULTS AND DISCUSSION
This section displays the results for our new implementa-
tion. The main objective is to compute the amount of speed-
up for scalable data sizes starting from strings of length 250
characters and growing up to strings of length 8000 characters.
For each data size, 30 different tests are conducted and the the
averages are reported. Finally, the improvement is calculated
by dividing the CPU time over the GPU time as the following
equation shows.
improvement =timeC P U
timeGP U
The experimental setup we use is as follows.
Algorithm 3 Parallel Diagonal Implementation
1: procedure LEVENSHTEIN(Str1, Str2,N)
2: Initialize first row and first column from 1 to N
3: for <Slice = 0 to N*2-1> do
4: if slice < N then
5: Z= 0
6: else
7: Z=sliceN + 1
8: end if
9: Size = slice 2z+ 1
10: CUDA KERNEL<<<SIZE, 265>>>
11: end for
12: end procedure
Algorithm 4 CUDA KERNEL for Diagonal Implementation
1: procedure CUDA KERNEL(Str1, Str2,N,z,slice,Increment)
2: Calculate thread ID
3: if Z <= 0 then
4: Start index = slice
5: else
6: Start index= Increment * z + slice
7: end if
8: j = start Index+(ID*Increment)
9: Calculate Row using Equation 3
10: Calculate Column using Equation 4
11: Index =Row W idth +column
12: H[index]= Calculate Distance as Equation 1
13: end procedure
Hardware: GPU (GT 740M NVIDIA) 2GB memory, Intel
CPU I7 with 6GB RAM and memory bandwidth 14.40
GB/s.
Software: Windows 10 as operating system, Microsoft
Visual Studio 2013, CUDA toolkit V7.5 and NVIDIA
drivers.
A shown in Table II, the big size of the array and the length
of the two sequences does not affect the GPU performance.
On the other hand the CPU performance is degraded when
we have bigger arrays and longer sequences. Due to the
parallelism used in GPU the performance is increased 11X
compared to the CPU performance.
TABLE II
THE P ERF OR MAN CE OF T HE CPU AND GPU IMPLEMENTATIONS FOR
DIFFERENT INPUT SIZES
Data Size CPU(sec) GPU(sec) Improvement
250 0.000733333 0.000566667 1.294117647
500 0.004366667 0.001766667 2.471698113
1000 0.023933333 0.004933333 4.851351351
2000 0.0878 0.0155 5.664516129
3000 0.242166667 0.023933333 10.1183844
4000 0.4593 0.0459 10.00653595
5000 0.722333333 0.0658 10.97771023
6000 1.052433333 0.091833333 11.46025408
7000 1.477566667 0.129366667 11.42154084
8000 2.018066667 0.181333333 11.12904412
2016 7th International Conference on Information and Communication Systems (ICICS)
V. CONCLUSION
Levenshtein Distance algorithm is an algorithm proposed
to calculate the distance between two strings. This algorithm
is widely used in NLP for spell checking and other tasks.
Moreover, this algorithm is very useful in bioinformatics, as
it used to calculate pattern matching between two DNA or
protein sequences. However, the high computation cost of this
algorithms prohibits it from being very useful in practical
large-scale scenarios. In this work, we decreases the execution
time for long two sequences with two main techniques. First
one is done by reading the matrix diagonally, which mitigates
the dependence problem between matrix cells. Secondly, we
improved the performance of Levenshtein algorithm by 11X
faster more than sequential implementation. This improvement
is done by using many techniques with parallel implementa-
tion, such as converting the array from 2D to 1D and using
one direction data transfer. As for memory management, we
avoided using shared and global memory of the GPU, because
they are slower than registers.
REFERENCES
[1] J. Serra, E. G´
omez, P. Herrera, and X. Serra, “Chroma binary similarity
and local alignment applied to cover song identification,Audio, Speech,
and Language Processing, IEEE Transactions on, vol. 16, no. 6, pp.
1138–1151, 2008.
[2] V. Pais and D. Ciobanu, “Fast name searching on gpu,” p. 217, 2013.
[3] K. Kukich, “Techniques for automatically correcting words in text,”
ACM Computing Surveys, vol. 24, p. 64, 1992.
[4] S. Soroushnia, M. Daneshtalab, T. Pahikkala, and J. Plosila, “Parallel
implementation of fuzzified pattern matching algorithm on gpu,” pp.
341–344, 2015.
[5] V. Saikrishna, A. Rasool, and N. Khare, “String matching and its
applications in diversified fields,International Journal of Computer
Science Issues, vol. 9, no. 1, pp. 219–226, 2012.
[6] D. Cantone, S. Cristofaro, and S. Faro, “Efficient string-matching
allowing for non-overlapping inversions,” Theoretical Computer Science,
vol. 483, pp. 85–95, 2013.
[7] Ł. Ligowski and W. Rudnicki, “An efficient implementation of smith
waterman algorithm on gpu using cuda, for massively parallel scanning
of sequence databases,” in Parallel & Distributed Processing, 2009.
IPDPS 2009. IEEE International Symposium on. IEEE, 2009, pp. 1–8.
[8] Y. Liu, W. Huang, J. Johnson, and S. Vaidya, “Gpu accelerated smith-
waterman,” pp. 188–195, 2006.
[9] S. Cook, CUDA programming: a developer’s guide to parallel comput-
ing with GPUs. Newnes, 2012.
[10] Y. Jararweh, S. Hariri, and T. Moukabary, “Simulating of cardiac
electrical activity with autonomic run time adjustments,AHSC frontiers
in biomedical research, 2009.
[11] M. Al-Ayyoub, A. M. Abu-Dalo, Y. Jararweh, M. Jarrah, and M. Al Sa’d,
“A gpu-based implementations of the fuzzy c-means algorithms for
medical image segmentation,” The Journal of Supercomputing, vol. 71,
no. 8, pp. 3149–3162, 2015.
[12] M. A. Shehab, M. Al-Ayyoub, and Y. Jararweh, “Improving fcm and
t2fcm algorithms performance using gpus for medical images segmen-
tation,” in Information and Communication Systems (ICICS), 2015 6th
International Conference on. IEEE, 2015, pp. 130–135.
[13] C. El Amrani, “A learning approach to introducing gpu computing in
undergraduate engineering program,” International Journal of Computer
Applications, vol. 107, no. 20, 2014.
[14] T. Siriwardena and D. Ranasinghe, “Accelerating global sequence align-
ment using cuda compatible multi-core gpu,” in Information and Au-
tomation for Sustainability (ICIAFs), 2010 5th International Conference
on. IEEE, 2010, pp. 201–206.
[15] V. I. Levenshtein, “Binary codes capable of correcting deletions, inser-
tions, and reversals,” in Soviet physics doklady, vol. 10, no. 8, 1966, pp.
707–710.
[16] L. Chao and W. Fan, “Pair-wise sequence alignment algorithm in
bioinformatics,” in Electrical & Electronics Engineering (EEESYM),
2012 IEEE Symposium on. IEEE, 2012, pp. 36–38.
[17] M. Refat, M. Shehab, Y. Jararweh, and M. Al-Ayyoub, “Accelerating
needleman-wunsch global alignment algorithm with gpus,” in The 12th
ACS/IEEE International Conference on Computer Systems and Applica-
tions (AICCSA 2015). IEEE, 2015.
[18] G. Myers, “A fast bit-vector algorithm for approximate string matching
based on dynamic programming,” Journal of the ACM (JACM), vol. 46,
no. 3, pp. 395–415, 1999.
[19] K. Xu, W. Cui, Y. Hu, and L. Guo, “Bit-parallel multiple approximate
string matching based on gpu,” Procedia Computer Science, vol. 17, pp.
523–529, 2013.
[20] R. Prasad, A. K. Sharma, A. Singh, S. Agarwal, and S. Misra, “Efficient
bit-parallel multi-patterns approximate string matching algorithms,” Sci-
entific Research and Essays, vol. 6, no. 4, pp. 876–881, 2011.
[21] S. Wu and U. Manber, “Fast text searching: allowing errors,Commu-
nications of the ACM, vol. 35, no. 10, pp. 83–91, 1992.
[22] Z. Du, Z. Yin, and D. A. Bader, “A tile-based parallel viterbi algorithm
for biological sequence alignment on gpu with cuda,” in Parallel &
Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010
IEEE International Symposium on. IEEE, 2010, pp. 1–8.
[23] W. Liu, B. Schmidt, G. Voss, and W. M¨
uller-Wittig, “Gpu-clustalw:
using graphics hardware to accelerate multiple sequence alignment,” in
High Performance Computing-HiPC 2006. Springer, 2006, pp. 363–
374.
2016 7th International Conference on Information and Communication Systems (ICICS)
... As can be seen in Equations 1 and 3 and clarified in Figure 2, to calculate an element in the matrix, the elements before it in the left to right diagonal and the elements in the row above it and the column to the left of it should be calculated first. To solve this dependency problem, we calculate the matrix elements in the right to left diagonal order as shown in Figure 3 as was done in [13], [14]. By doing this, all the element in the right to left diagonal can be calculated in parallel without any data dependencies. ...
... The other preprocessing step that is done on the data (the edit distance matrix) is to convert it from a two dimensional structure (2D) to a one dimensional structure (1D). This is done to make the data transfer from the CPU side to the GPU side, in the case of the parallel implementation without unified memory, simpler and easier [14]. Equation 4 is used to convert the data form 2D to 1D and Equations 5 and 6 are used to convert the matrix from 1D to 2D [14]. ...
... This is done to make the data transfer from the CPU side to the GPU side, in the case of the parallel implementation without unified memory, simpler and easier [14]. Equation 4 is used to convert the data form 2D to 1D and Equations 5 and 6 are used to convert the matrix from 1D to 2D [14]. ...
Conference Paper
String matching problems such as sequence alignment is one of the fundamental problems in many computer since fields such as natural language processing (NLP) and bioinformatics. Many algorithms have been proposed in the literature to address this problem. Some of these algorithms compute the edit distance between the two strings to perform the matching. However, these algorithms usually require long execution time. Many researches use high performance computing to reduce the execution time of many string matching algorithms. In this paper, we use the CUDA based Graphics Processing Unit (GPU) and the newly introduced Unified Memory(UM) to speed up the most common algorithms to compute the edit distance between two string. These algorithms are the Levenshtein and Damerau distance algorithms. Our results show that using GPU to implement the Levenshtein and Damerau distance algorithms improvements their execution times of about 11X and 12X respectively when compared to the sequential implementation. And an improvement of about 61X and 71X respectively can be achieved when GPU is used with unified memory.
... In recent papers, the authors of [2,3] presented their effort to improve the performance of global sequence alignment algorithms. Instead of focusing on the NW algorithms, the authors of [2,3] focused on the Levenshtein and Damerau Edit Distance Algorithms. ...
... In recent papers, the authors of [2,3] presented their effort to improve the performance of global sequence alignment algorithms. Instead of focusing on the NW algorithms, the authors of [2,3] focused on the Levenshtein and Damerau Edit Distance Algorithms. The improvement gains achieved were 11X and 12X, respectively. ...
Article
Full-text available
The Needleman-Wunsch (NW) is a dynamic programming algorithm used in the pairwise global alignment of two biological sequences. In this paper, three sets of parallel implementations of the NW algorithm are presented using a mixture of specialized software and hardware solutions: POSIX Threads-based, SIMD Extensions-based and a GPU-based implementations. The three implementations aim at improving the performance of the NW algorithm on large scale input without affecting its accuracy. Our experiments show that the GPU-based implementation is the best implementation as it achieves performance 72.5X faster than the sequential implementation, whereas the best performance achieved by the POSIX threads and the SIMD techniques are 2X and 18.2X faster than the sequential implementation, respectively.
... Balhaf et al. [14] tried to enhance the performance of the Levenshtein edit distance calculations. They first reduced the dependency between data using diagonal traversing. ...
Conference Paper
Full-text available
Bioinformatics is a growing field that attracts many researchers and continues to prove its value and significance. Since the early days of discovering genomic martial and using it to identify new life forms, sequence alignment applications have become important in enabling discoveries of important biological or medical benefits. Finding similarities, or even relations between sequences, is a demanding process that requires time and high cost. However, nowadays there are plenty of algorithms that are used to find similarity and/or differences between sequences. Many of these algorithms still suffer from performance issues, such as slow performance and poor scalability. Therefore, parallelization is widely used to address these issues. In this paper, we utilize a multi-threading parallelism technique coupled with a block alignment idea in order to improve the sequence alignment performance. The experiments show that the proposed implementation outperforms the sequential implementation by 4.9 times for sequences of lengths ranging between 1024 and 8192.
... A GPU allows massive parallelism as it can handle thousands of threads running concurrently. Moreover, GPUs are now affordable and can easily fit into laptops, which makes them appealing to speedup the performance of algorithms that might be used by individuals [3]- [6]. This fits with the problem at hand. ...
Conference Paper
Social Network Analysis (SNA) has been gaining a lot of attention lately. One of the common steps in SNA is community detection. SNA literature has many interesting algorithms for community detection. One of the popular ones was proposed by Newman and it is mainly revolved around using a clustering algorithm. Three phases are iteratively applied in this algorithm in order to find the " best " community structure. These phases are: spectral mapping, clustering and modularity computation. Despite its effectiveness, this method suffers greatly in terms of running time when dealing with large-scale networks. A parallel implementation using GPUs is one of the feasible solutions to address this problem. Moreover, due to the iterative nature of this algorithm, dynamic parallelism lends itself as a very appealing solution. Dynamic parallelism is a novel parallel programming technique that refers to the ability to launch new grids from the GPU. In this work, we present three implementation of the clustering-based community detection algorithm. In addition to the sequential implementation, we present two implementations: a Hybrid CPU-GPU (HCG) one and a Dynamic Parallel (DP) one. We test our parallel implementations on benchmark datasets to show the speed-up of each parallel implementation compared with the sequential one. The results show that the DP implementation achieves good speed-ups reaching up to 4.45X; however, the speed-ups achieved by HCG are almost twice as much.
Article
Pattern Recognition tasks in the structural domain generally exhibit high accuracy results, but their time efficiency is quite low. Furthermore, this low performance is more pronounced when dealing with instance-based classifiers, since, for each query, the entire corpus must be evaluated to find the closest prototype. In this work we address this efficiency issue for the Nearest Neighbor classifier when data are encoded as two-dimensional code sequences, and more precisely strings and sequences of vectors. For this, a set of bounds is proposed in the distance metric that avoid the calculation of unnecessary distances. Results obtained prove the effectiveness of the proposal as it reduces the classification time in percentages between 80% and 90% for string representations and between 60% and 80% for data codified as sequences of vectors with respect to their corresponding non-optimized version of the classifier.
Conference Paper
Full-text available
Image segmentation gained popularity recently due to numerous applications in many fields such as computer vision, medical imaging. From its name, segmentation is interested in partitioning the image into separate regions where one of them is of special interest. Such region is called the Region of Interest (RoI) and it is very important for many medical imaging problems. Clustering is one of the segmentation approaches typically used on medical images despite its long running time. In this work, we propose to leverage the power of the Graphics Processing Unit (GPU)to improve the performance of such approaches. Specifically, we focus on the Fuzzy C-Means (FCM) algorithm and its more recent variation, the Type-2 Fuzzy C-Means (T2FCM) algorithm. We propose a hybrid CPU-GPU implementation to speed up the execution time without affecting the algorithm's accuracy. The experiments show that such an approach reduces the execution time by up to 80% for FCM and 74% for T2FCM.
Article
Full-text available
Multi-patterns approximate string matching (MASM) problem is to find all the occurrences of set of patterns P 0 , P 1 , P 2 ...P r-1 , r≥1, in the given text T[0…n-1], allowing limited number of errors in the matches. This problem has many applications in computational biology viz. finding DNA subsequences after possible mutations, locating positions of a disease(s) in a genome etc. The MASM problem has been previously solved by Baeza-Yates and Navarro by extending the bit-parallel automata (BPA) of approximate matching and using the concept of classes of characters. The drawbacks of this approach are: (a) It requires verification for the potential matches and, (b) It can handle patterns of length less than or equal to word length (w) of computer used. In this paper, we propose two new bit-parallel algorithms to solve the same problem. These new algorithms requires no verification and can handle patterns of length > w. These two techniques also use the same BPA of approximate matching and concatenation to form a single pattern from the set of r patterns. We compare the performance of new algorithms with existing algorithms and found that our algorithms have better running time than the previous algorithms.
Article
Full-text available
Approximate string matching is one of the most important technologies in the field of computer science, and it is widely used in network security, computational biology, information retrieval, spelling correction and other areas. In these areas, there are huge amounts of data need to be addressed in time. So it requires a high- performance string matching algorithm. At the same time, with the introduction of Compute Unified Device Architecture (CUDA), the general-purpose computing of Graphic Processing Unit (GPU) extended to more applications. This paper proposed a bit-parallel multiple approximate string match algorithm, and developed a GPU implementation which achieved speedups about 28 relative to a single-thread CPU code.
Code
This code is an integration using Rcpp of the Needleman & Wunsch algorithm of global alignment of two string sequences.A simple example test is provided with R.
Article
Fuzzy clustering is one of the most popular techniques in medical image segmentation. The fuzzy C-means (FCM) algorithm has been widely used as it provides better performance and more information than other algorithms. As the data set becomes large, the serial implementation of the FCM algorithm becomes too slow to accomplish the clustering task within acceptable time. Hence, a parallel implementation [for example, using today’s fast graphics processing unit (GPU)] is needed. In this paper, we implement brFCM algorithm, a faster variant of the FCM algorithm, on two different GPU cards, Tesla M2070 and Tesla K20m. We compare our brFCM GPU-based implementation with its CPU-based sequential implementation. Moreover, we compare brFCM with the traditional version of the FCM algorithm. The experiments used lung CT and knee MRI images for clustering. The results show that our implementation has a significant improvement over the traditional CPU sequential implementation. GPU parallel brFCM is 2.24 times faster than its CPU implementation, and 23.43 times faster than a GPU parallel implementation of the traditional FCM.
Conference Paper
This paper presents an algorithm for performing name searches using various metrics. Parallelism in the algorithm is exploited in order to obtain an implementation running on GPUs.
Conference Paper
Bioinformatics is the core of biotechnology. Sequence alignment is the most basic and important operation of bioinformatics. The information of functions, structure and evolution in biological sequence can be found by sequence comparison. The basic operation of sequence alignment is comparison. This article describes the common kinds of pair-wise sequence alignment algorithm, and explains them in detail combined with instances. It finally points out the current problems of sequence alignment algorithm.