Page 1
GPUAccelerated DNA Distance Matrix Computation
Zhi Ying1, Xinhua Lin1, Simon ChongWee See123 and Minglu Li14
1 Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
2 Institute of High Performance Computing, A*Star, Singapore
3 NVIDIA Corporation
4 Shanghai Key Laboratory of Scalable Computing and Systems
AbstractβDistance matrix calculation used in phylogeny analysis
is computational intensive. The growing sequences data sets
necessitate fast computation method. This paper accelerate
Felsenstein's DNADIST program by using OpenCL to exploit the
great computation capability of graphic card. The GPU
accelerated DNADIST program achieves more than 12fold
speedup over the serial CPU program on a personal workstation
with a 2.66GHz quadcore Intel CPU and an AMD HD5850
graphics card. And dual HD5850 cards on the same platform
perform linear improvement of 24fold speedup. The program
also shows good performance portability by achieving 16fold
speedup with a NVIDIA Tesla C2050 card.
Keywords GPU, OpenCL, distance matrix, phylogeny
I.
INTRODUCTION
Distance matrixes are used in phylogeny to construct
phylogenetic trees which are the basic of evolutionary analysis
and had been used in many research areas including drug
discovery. The time complexity of the computing a distance
matrix of N species is O (N2). The number of sequences
increases dramatically in recent years due to the progress of
sequencing technology which leads to a great increase of
computational load of distance calculation. The distance matrix
computation of a modern homologous sequence data set on
CPUbased workstations could take hours or days.
On the other hand, the rapid development of GPU
(Graphics Processing Units)
performance cost ratio computational capability. The peak
double precision performance of todayβs single GPU card
grows to several hundreds of GFLOPS which is almost
equivalent to a small size CPU cluster. Meanwhile GPU
programming techniques such as CUDA [1] and OpenCL [2]
made it possible for programmers to take advantage of the great
computational capability of GPU for general purpose usage
without sophisticated graphic programming skills.
hardware provides high
The DNADIST program from PHYLIP [3] uses nucleotide
sequences to compute a distance matrix under four different
models of nucleotide substitution. Each distance is a maximum
likelihood estimation of the divergence time between the given
two species under a specific substitution model. In this paper
we implement the default F84 model [4][5] used in DNADIST
program on GPU using OpenCL. Considering the feature of
GPU architecture, we optimize the original sequential
algorithm and our program has achieved 12 to 24 times
speedup over the CPU version of DNADIST program.
The rest of this paper is organized as follows. Section II of
this paper provides a background of the paper. The original
distance calculation method and GPU implementation are
described in section III and IV respectively. Section V presents
GPU computational results of different platforms while section
VI concludes the paper.
II.
BACKGROUND
A. Distancebased Phylogenetic Tree Reconstruction
Phylogenetic tree represents the genealogical relationships
among organisms (species or populations). It is generally
considered that there are distance based and character based
methods of treereconstruction.[6] Character based methods
such as maximum parsimony [7], maximum likelihood [8] and
Bayesian methods [9] use nucleotides or amino acids of all
species to fit to a tree. And the maximum likelihood method is
statistically the most accurate method of reconstructing the
phylogenetic tree. However, the weakness of such methods is
that the computation of such method became almost impossible
when the size of the sequences data set is larger than several
dozen. Distance based methods use clustering algorithms to
convert the matrix of pairwise distances between species to
produce a tree. The pairwise distance can be calculated in
varieties of ways [10][11]. The advantages of distance based
methods are that they can handle large sequences data set and
usually they are much faster than character based methods.
The DNADIST program is used in distancebased
phylogenetic tree reconstruction methods and it uses model
based maximum likelihood estimation in distance computation
to take the accuracy advantage of character based method.
B. GPGPU Programming
Generalpurpose computing on graphics processing units
(GPGPU) is the techniques of using GPU to handle the
computation problem traditionally performed on CPU. The
early use of GPGPU can be traced back to early 2000s when
programmers started to use graphic programming interface
such as DirectX and OpenGL to accelerate their applications.
This programming manner was quite difficult because
programmers have to carefully mapping general arithmetic
operations to graphic operations. This situation does not
improve fundamentally until the main vendors and
communities noticed the great needs and started making effort
to simplify the process. NVIDIA launched CUDA (Compute
Unified Device Architecture) [1] in 2006 to support Cbased
programming language on GPU. AMD released its Stream
NVIDIA Academic Partnership Program
Page 2
SDK [12] in 2007 which also supports a variant of the ANSI C.
In order to provide an standardized open framework for writing
programs that executed across heterogeneous platforms
including CPUs, GPUs and other processors such as DSPs and
the Cell/B.E. processor, Khronos Group published OpenCL [2]
in 2008. OpenCL consists of APIs that are used to handle
different platforms and a Cbased programming language used
for writing kernel functions that executed on OpenCL devices.
The conceptual OpenCL device architecture is given in Fig. 1.
Figure 1. Conceptual OpenCL device architecture [2]
C. GPU Accelerated BioComputing and Related Work
Though the concept of GPGPU does not have a long history,
a lot of works have been done to exploit the computational
capability of GPU manycore architecture for biocomputing.
Sequence analysis especially sequence alignment which is the
basic of many biology study is time spending and a hot area of
hardware acceleration research. Schatz [13] have proposed
MUMmerGPU, a highthroughput sequence alignment
program using GPU which achieves more than 10fold speedup
over CPU program. Ligowski [14] have implemented an
efficient Smith Waterman algorithm on GPU which have
reached 70% of the theoretical hardware performance. Many
other works can be found in Sarkarβs survey [15].
Evolutionary tree reconstruction in phylogenetic research is
another time consuming process in biocomputing which have
also been accelerated by GPU in different ways. Charalambous
[16] ported the maximum likelihood based phylogenetic tree
inference program RAxML to GPU and achieved 1.2x to 3x
speedup. Suchard [17] implemented a Bayesian based
phylogenetic reconstruction algorithm on GPU and achieved
90x to 250x speedup over CPU program. There are also some
efforts on distancebased method which can handle larger
dataset. Chang implemented simplified distances algorithms
such as Euclidean and Manhattan on GPU which have achieved
quite good performance [18][19]. However statistically more
accurate distance computation algorithm have not been
implemented on GPU yet, and this is why the work in this
paper is done.
III.
SERIAL METHOD
A. Statistical Model
Felsenstein [5] explained the theory and computational
methods used in DNAML and DNAMLK programs in
PHYLIP [3] for computing maximum likelihood for a
phylogeny tree with evolutionary rates that follow a Hidden
Markov Model. The DNADIST program use similar methods
to calculate the distance between two sequences which is the
branch length which maximize the likelihood of the tree that
only contain two nodes. The difference is that only one
optimized branch length between each pair of sequences is
calculated in DNADIST program rather than calculate all
branches of all possible trees in DNAML and DNAMLK
programs.
If ci denotes the category of site i, the rate of site i is rci,
given data D that contains n sites, the likelihood of a given
phylogeny is given bellow:
πΏπΏ = οΏ½οΏ½β―οΏ½Prob(ππ1,ππ2,β―ππππ)
ππππ
ππ2
ππ1
Γ β
ProbοΏ½DiοΏ½T,πππππποΏ½
n
i=1
(1)
In the Hidden Markov Model the transition probability
from state x to y depends both on branch length v and
evolutionary rate r, which can be defined as ππij(π£π£,ππ), there is
ProbοΏ½DiοΏ½T,rciοΏ½ = β Οxlci
(i)
x
(x)β πππ₯π₯π₯π₯(π£π£,ππππππ)lci
y
(i)(y)
(2)
Given the model specifying the different probability that
sequence S1 changes to sequence S2 during evolution along
different phylogeny tree branch length, a maximum likelihood
can be obtained by evaluate the likelihood of all branch length
repeatedly. This process requires large amount of computation.
However a much faster way is using NewtonRaphson method
to less the number of branch length being computed. The
equation to get the derivatives needed in NewtonRaphson
method is
ππProbοΏ½π·π·πποΏ½T,πππππποΏ½
πππ£π£
= βππππππ(Ξ± + Ξ²)(πΎπΎ1β πΎπΎ2)ππβ(Ξ±+Ξ²)πππππππ£π£
βπππππππ½π½(πΎπΎ2β πΎπΎ3)ππβπ½π½πππππππ£π£ (3)
where Ξ± and Ξ² represent rates of two gene change events [5]
and
πΎπΎ1= β πππ₯π₯ππππππππ
(ππ)(π₯π₯)ππππππππ
(ππ)(π₯π₯)
π₯π₯
(4)
πΎπΎ2= β πππ₯π₯ππππππππ
(ππ)(π₯π₯)β οΏ½
πππ₯π₯
β πππ§π§πππ§π§π₯π₯π§π§
οΏ½
π₯π₯π₯π₯
πππ₯π₯π₯π₯ππππππππ
(ππ)(π₯π₯) (5)
πΎπΎ3= οΏ½β πππ₯π₯ππππππππ
(ππ)(π₯π₯)
π₯π₯
οΏ½οΏ½β πππ₯π₯ππππππππ
π₯π₯
(ππ)(π₯π₯)οΏ½ (6)
Page 3
The detailed deduction processes of the equations can be
found in Felsensteinβs paper [5] [8][20].
B. Program Flow
Figure 2. Brief of dnadist program flow
The overall flow of the serial DNADIST program can be
simplified as Fig. 1. The hotspot profiling result of the original
program is showed in Fig. 3. We can see that main hotspot of
the program is the makev function which calculates a best
distance value between each pair of sequences. These processes
are independent which made this problem an embarrassing
parallel one. The main procedure in makev function is an
iteration process which implements a NewtonRaphson method
mentioned above. This pseudo code of the iteration process is
showed in Fig. 4 which would be optimized in later chapter.
Figure 3. Hotspot of the program.
Figure 4. Inside iteration of diatance computation β a NetwonRaphson
method
IV. GPU IMPLEMENTATION
A. Task Mapping
1) Mapping style consideration
OpenCL supports a workgroup style of mapping the task
onto device as Fig .3 shows. Wok item is the basic execution
unit and several work items can be organized as a work group
in which they can support an efficient vectorlike group
execution and share a high speed but limited local memory
space.
Figure 5. Overview of OpenCL parallel execution model [2]
To compute the distance matrix of a sample input of N
DNA sequences, the program needs to do N*(N1)/2 distance
computation. For occupancy consideration, itβs natural to treat
the entire task as a list and divide it equally into many pieces
(work groups). For a sample input of eight DNA sequences, Fig.
6 showed this linear style of task mapping where d(x,y) donates
the distance computation task of sequences x and y and a big
block which contains nine small squares represents an OpenCL
workgroup.
Figure 6. Linear style mapping
Figure 7. Matrix style mapping
On the other hand, if we consider about the memory
locality, weβll raise a matrix style task mapping method as Fig.
7 shows. When the workgroup size is (BLOCK_SIZE *
BLOCK_SIZE), each work group in the matrix style method
access data of at most 2*BLOCK_SIZE sequences, while a
tt = 0.1, delta = 0.1
while iteration < 100 && fabs(delta) > 0.0000002
slope = 0.0
if tt > 0.0 then
slope = slope_calculation()
endif
if slope < 0.0
delata = fabs(delta) / 2.0
else
delta = fabs(delta)
endif
tt += delta
iteration++
endwhile
value = tt
1. Read options, data file
2. foreach pair of sequences x and y:
makev(x,y); //Compute the distance iteratively to
obtain maximum likelihood
3. Save distance to distance matrix and output
Page 4
linearstyle workgroup may need up to BLOCK_SIZE *
BLOCK_SIZE sequences.
2) MultiCard Consideration
The advantage of GPU computing does not only lie in the
fact that todayβs single GPU card got a higher double precision
peak that CPU but also lie in the fact that it is possible and
convenient to extend the computation capability by using multi
GPU devices. Thus the performance scalability of the program
is also an important consideration of GPU programs. The GPU
program described in this paper use a simple task distribution
strategy to achieve both load balance and memory efficiency.
And the strategy of distributing matrix style tasks is to divide
the matrix into N rectangle according to their area as showed in
Fig. 8.
Figure 8. Multicard matrix style task distributing
B. Use of Local Memory
Figure 9. Slope calculation in NR method
Though the matrix mapping methods have lessened the
memory access times inside a workgroup, more work can be
done to explore the potential capability of the high speed local
memory to maximize program performance. When looking into
the calculation of the slope in NewtonRaphson method which
is showed in Fig. 10, we can see a high frequency global
memory access behavior because variable sitevalues is stored
in global memory and variable endsite which means the length
of DNA sequences can be several thousand or more.
Considering the fact that the latency of global memory read can
be hundreds of clocks and cost of normal arithmetic operation
is several clocks, this memory behavior should be seriously
considered.
The basic idea of optimizing global memory access is use
low latency onchip local memory to cache part of the global
data to lower the overall cost. Fig. 11 shows the local memory
version of the slope calculation process. The memory access
cost of Fig. 10 is 2 * endsite * Global Read, while Fig 11. is
2*(endsite/BS) + endsite * Local Read.
Figure 10. Slope calculation using local memory
C. Optimizing NewtonRaphson Method
Just as Fig. 3 showed, the original CPU program uses a
NewtonRaphson method to get best distance value. But the
original code may leads to a situation that different work items
in the same work group would loop different times. This is
inefficient on GPU which is typical SIMD or SIMT
architecture. To deal with this problem we optimized the
NewtonRaphson method for GPU (Figure 12). In this method
best value is gained using a binary search. Each workitems
loop the same times, the instructions executed in each work
items are the same at the same time. And the precision of the
optimized version is 1.49E7 (10/226) which is better than the
original code (2.0E7).
Figure 11. Optimized inside iteration
V.
RESULTS
Tests are performed on an input dataset contains 762 HBV
DNA sequences. A desktop computer with a quadcore CPU
(Intel i7 920), 6GB RAM and multi PCIE slots is used in the
experiment. In single card test, an AMD HD5850 with 1440
stream processing units is installed on the machine while in
multi card test another HD5850 is plugged in. To verify the
performance portability of the program, a single NVIDIA Tesla
C2050 with 448 cores is installed alone.
min = 0.0, max = 10.0, mid = (min + max)/2
for j = 0 to 25
slope = slope_calculation()
if slope >= 0.0 then
min = mid
else
max = mid
endif
mid = (min + max) / 2
endfor
value = mid
bx = get_group_id(0);
by = get_group_id(1);
tx = get_local_id(0);
ty = get_local_id(1);
spa = bx*BS + tx; spb = by*BS + ty; //Species A and B
__local TYPE A[BS][BS];
__local TYPE B[BS][BS];
β¦
for (s = 0; s < endsite; s += BS){
A[ty][tx]= sitevalues[spa][s + tx];
B[ty][tx]= sitevalues[spb][s + tx];
barrier(CLK_LOCAL_MEM_FENCE);
for( k = 0; k < BS; k++){
slope += f(A[ty][k],B[tx][k]);
}
barrier(CLK_LOCAL_MEM_FENCE);
}
for ( i = 0; i < endsite; i++){
slope += evaluate (sitevalues[A][i], sitevalues[B][i]);
}
Page 5
A. Locality Is the Key
Table I shows the performance differences of the original
CPU code, GPU code using linear mapping style and GPU
code using matrixstyle mapping strategy and local memory.
We can see that GPU code achieved good performance
speedup of 3.5 even without complex tuning. And the
performance speedup of GPU code over original serial CPU
code improves dramatically when locality is taken into
consideration.
TABLE I.
PERFORMANCE OF WO MAPPING STYLE NAΓVE GPU PROGRAM
Method
CPU
Execution Time
310s
88s
35s
Speedup

3.5
8.8
Linear Naive, GPU
Matrix+Local Memory , GPU
B. Instruction Alignment Matters
Considering the SIMD or SIMT feature of GPU chapter
IV.B proposed a modification of the original NewtonRaphson
method used in the serial code. And Table II shows that the
performance is significant better when instruction branches are
aligned in each work item.
TABLE II.
PERFORMANCE IMPROVEMENT OF OPTIMIZED NEWTON
RAPHSON METHOD
Method
CPU
Execution Time
310s
35s
24s
Speedup

8.8
12.9
Matrix+Local Memory , GPU
+NR Optimized
The overall performance of different versions of programs is
showed in Fig. 12.
Figure 12. Speedup of different versions of programs
C. Performance Scalability and Portability
Table III shows that when we use multicard mapping
method described in chapter IV.A and run the optimized GPU
code on dual HD5850 cards, the performance speedup got
nearly linear enhancement. This is mainly because the
computations of the distances do not depend on each other and
each card can have the same read only copy of the input data.
TABLE III.
MULTICARD PERFORMANCE
Devices
CPU
HD5850x1
HD5850x2
Execution Time
310s
24s
12.5s
Speedup

12.9
24.8
The main reason why we choose OpenCL as the GPU
parallel language is that it is an open standard supported on
many devices by multi device venders. Our experiment shows
that the code can be executed on Tesla C2050 without any
change. And with little tuning the OpenCL program achieves
good performance as showed in Table IV.
TABLE IV.
PERFORMANCE ON DIFFERENT PLATFORMS
Devices
CPU
HD5850
Tesla C2050
Execution Time
310s
24s
19s
Speedup

12.9
16
The performance of the program on different platforms is
showed in Fig. 13.
Figure 13. Speedup of different platforms
VI.
CONCLUSION
We have presented a GPU accelerated DNA sequences
distance computation program in this paper. Our results show
that a significant speedup can be achieved by mapping the task
properly, making use of locality and taking notice of instruction
branch alignment. The results also show that our work have a
good performance portability and performance scalability. The
future works includes integrating other nucleotide substitution
models into the program and implement some clustering
programs on GPU to accelerate the whole tree reconstructing
process.
ACKNOWLEDGMENT
This work is based on the program DNADIST from PHYLIP
developed by J.Felsenstein [3].
REFERENCES
[1] N. Corporation, βNvidia cuda Programming guide,β Changes, 2009,
p. 179.
1
3.5
8.8
12.9
CPUNaΓ―ve
Linear,GPU
+Locality,GPU+Instruction
alignment,GPU
1
12.9
16
24.8
CPUHD5850Tesla C2050HD5850x2
CPUHD5850Tesla C2050HD5850x2
Page 6
[2] K. Opencl, βOpenCL Specification,β ReVision, vol. V, 2010, pp. 1
377.
[3] J. Felsenstein, βPHYLIP (Phylogeny Inference Package) version
3.69,β 2005.
[4] H. Kishino and M. Hasegawa, βEvaluation of the maximum
likelihood estimate of the evolutionary tree topologies from DNA
sequence data, and the branching order in hominoidea,β Journal of
Molecular Evolution, vol. 29, Aug. 1989, pp. 170179.
[5] J. Felsenstein and G.A. Churchill, βA Hidden Markov Model
Approach Evolution to Variation Among Sites in Rate of Evolution,β
Molecular Biology, vol. 13, 1996, pp. 93104.
[6] Z. Yang, Computational molecular evolution, 2006.
[7] W.M. Fitch, βToward Defining the Course of Evolution: Minimum
Change for a Specific Tree Topology,β Systematic Zoology, vol. 20,
1971, pp. 406  416.
[8] J. Felsenstein, βEvolutionary trees from DNA sequences: a
maximum likelihood approach.,β Journal of molecular evolution,
vol. 17, Jan. 1981, pp. 36876.
[9] B. Rannala and Z. Yang, βProbability distribution of molecular
evolutionary trees: A new method of phylogenetic inference,β
Journal of Molecular Evolution, vol. 43, Sep. 1996, pp. 304311.
[10] W.M. Fitch and E. Margoliash, βConstruction of Phylogenetic
Trees,β Science, vol. 155, Jan. 1967, pp. 279284.
[11] L.L. CavalliSforza and A.W. Edwards, βPhylogenetic analysis.
Models and estimation procedures.,β American journal of human
genetics, vol. 19, May. 1967, pp. 23357.
[12] βATI Stream Technology,β
http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM
TECHNOLOGY/Pages/streamtechnology.aspx.
[13] M.C. Schatz, C. Trapnell, A.L. Delcher, and A. Varshney, βHigh
throughput sequence alignment using Graphics Processing Units.,β
BMC bioinformatics, vol. 8, Jan. 2007, p. 474.
[14] L. Ligowski and W. Rudnicki, βAn efficient implementation of
Smith Waterman algorithm on GPU using CUDA, for massively
parallel scanning of sequence databases,β 2009 IEEE International
Symposium on Parallel & Distributed Processing, May. 2009, pp. 1
8.
[15] S. Sarkar, T. Majumder, A. Kalyanaraman, and P.P. Pande,
βHardware accelerators for biocomputing: A survey,β Proceedings
of 2010 IEEE International Symposium on Circuits and Systems,
May. 2010, pp. 37893792.
[16] M. Charalambous and P. Trancoso, βInitial experiences porting a
bioinformatics application to a graphics processor,β Advances in
Informatics, 2005.
[17] M. a Suchard and A. Rambaut, βManycore algorithms for statistical
phylogenetics.,β Bioinformatics (Oxford, England), vol. 25, Jun.
2009, pp. 13706.
[18] D. Chang, N. Jones, and D. Li, βCompute pairwise Euclidean
distances of data points with GPUs,β Proceedings of the IASTED
International Symposium Computational Biology and
Bioinformatics, 2008, pp. 278283.
[19] D.J. Chang, A.H. Desoky, M. Ouyang, and E.C. Rouchka,
βCompute Pairwise Manhattan Distance and Pearson Correlation
Coefficient of Data Points with GPU,β 2009 10th ACIS
International Conference on Software Engineering, Artificial
Intelligences, Networking and Parallel/Distributed Computing,
2009, pp. 501506.
[20] J. Felsenstein, βMaximum Likelihood and MinimumSteps Methods
for Estimating Evolutionary Trees from Data on Discrete Characters
MAXIMUM LIKELIHOOD AND MINIMUMSTEPS METHODS
FOR ESTIMATING EVOLUTIONARY TREES,β Systematic
Zoology, vol. 22, 1973, pp. 240249.
Download fulltext