AxPcoords & parallel AxParafit: statistical co-phylogenetic analyses on thousands of taxa.

Alexandros Stamatakis, Alexander F Auch, Jan Meier-Kolthoff, Markus Göker

Ecole Polytechnique Fédérale de Lausanne, School of Computer & Communication Sciences, Laboratory for Computational Biology and Bioinformatics STATION 14, CH-1015 Lausanne, Switzerland.

Journal Article: BMC Bioinformatics (impact factor: 3.43). 02/2007; 8:405. DOI: 10.1186/1471-2105-8-405

Abstract

BACKGROUND: Current tools for Co-phylogenetic analyses are not able to cope with the continuous accumulation of phylogenetic data. The sophisticated statistical test for host-parasite co-phylogenetic analyses implemented in Parafit does not allow it to handle large datasets in reasonable times. The Parafit and DistPCoA programs are the by far most compute-intensive components of the Parafit analysis pipeline. We present AxParafit and AxPcoords (Ax stands for Accelerated) which are highly optimized versions of Parafit and DistPCoA respectively. RESULTS: Both programs have been entirely re-written in C. Via optimization of the algorithm and the C code as well as integration of highly tuned BLAS and LAPACK methods AxParafit runs 5-61 times faster than Parafit with a lower memory footprint (up to 35% reduction) while the performance benefit increases with growing dataset size. The MPI-based parallel implementation of AxParafit shows good scalability on up to 128 processors, even on medium-sized datasets. The parallel analysis with AxParafit on 128 CPUs for a medium-sized dataset with an 512 by 512 association matrix is more than 1,200/128 times faster per processor than the sequential Parafit run. AxPcoords is 8-26 times faster than DistPCoA and numerically stable on large datasets. We outline the substantial benefits of using parallel AxParafit by example of a large-scale empirical study on smut fungi and their host plants. To the best of our knowledge, this study represents the largest co-phylogenetic analysis to date. CONCLUSION: The highly efficient AxPcoords and AxParafit programs allow for large-scale co-phylogenetic analyses on several thousands of taxa for the first time. In addition, AxParafit and AxPcoords have been integrated into the easy-to-use CopyCat tool.

Source: PubMed

Comments on this publication

ResearchGate members can add comments. Sign up now and post your comment!

Similar publications

Page 1
 
Page 2
 
Page 3
 
Page 4
 
Page 5
 
End of preview.
Page 1
ral
ssBioMed CentBMC Bioinformatics
Open AcceSoftware
AxPcoords & parallel AxParafit: statistical co-phylogenetic analyses
on thousands of taxa
Alexandros Stamatakis*1,2, Alexander F Auch3, Jan Meier-Kolthoff3 and
Markus Göker4
Address: 1École Polytechnique Fédérale de Lausanne, School of Computer & Communication Sciences, Laboratory for Computational Biology and
Bioinformatics STATION 14, CH-1015 Lausanne, Switzerland, 2Swiss Institute of Bioinformatics, 3Center for Bioinformatics (ZBIT), Sand 14,
Tübingen, University of Tübingen, Germany and 4Organismic Botany/Mycology, Auf der Morgenstelle 1, Tübingen, University of Tübingen,
Germany
Email: Alexandros Stamatakis* - Alexandros.Stamatakis@epfl.ch; Alexander F Auch - auch@informatik.uni-tuebingen.de; Jan Meier-
Kolthoff - jan.mk@gmx.de; Markus Göker - markus.goeker@uni-tuebingen.de
* Corresponding author
Abstract
Background: Current tools for Co-phylogenetic analyses are not able to cope with the
continuous accumulation of phylogenetic data. The sophisticated statistical test for host-parasite
co-phylogenetic analyses implemented in Parafit does not allow it to handle large datasets in
reasonable times. The Parafit and DistPCoA programs are the by far most compute-intensive
components of the Parafit analysis pipeline. We present AxParafit and AxPcoords (Ax stands for
Accelerated) which are highly optimized versions of Parafit and DistPCoA respectively.
Results: Both programs have been entirely re-written in C. Via optimization of the algorithm and
the C code as well as integration of highly tuned BLAS and LAPACK methods AxParafit runs 5–61
times faster than Parafit with a lower memory footprint (up to 35% reduction) while the
performance benefit increases with growing dataset size. The MPI-based parallel implementation of
AxParafit shows good scalability on up to 128 processors, even on medium-sized datasets. The
parallel analysis with AxParafit on 128 CPUs for a medium-sized dataset with an 512 by 512
association matrix is more than 1,200/128 times faster per processor than the sequential Parafit
run. AxPcoords is 8–26 times faster than DistPCoA and numerically stable on large datasets. We
outline the substantial benefits of using parallel AxParafit by example of a large-scale empirical study
on smut fungi and their host plants. To the best of our knowledge, this study represents the largest
co-phylogenetic analysis to date.
Conclusion: The highly efficient AxPcoords and AxParafit programs allow for large-scale co-
phylogenetic analyses on several thousands of taxa for the first time. In addition, AxParafit and
AxPcoords have been integrated into the easy-to-use CopyCat tool.
Background alists have co-speciated with their respective hosts (e.g.,
Published: 22 October 2007
BMC Bioinformatics 2007, 8:405 doi:10.1186/1471-2105-8-405
Received: 26 June 2007
Accepted: 22 October 2007
This article is available from: http://www.biomedcentral.com/1471-2105/8/405
© 2007 Stamatakis et al.; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 10
(page number not for citation purposes)
One of the basic questions in evolutionary analyses [1] is
whether parasites (e.g., lice or Papillomaviruses) or mutu-
mammals). The constant accumulation of DNA and AA
sequence data coupled with recent advances in tree build-
Page 2
BMC Bioinformatics 2007, 8:405 http://www.biomedcentral.com/1471-2105/8/405
ing software, such as TNT [2], MrBayes [3], GARLI [4] or
RAxML [5], allow for large-scale phylogenetic analyses
with several hundred or thousand taxa [6-12]. Thus, large-
scale co-phylogenetic studies have also potentially
become feasible. However, most common co-phyloge-
netic tools or methods such as BPA, TreeMap or TreeFitter
(see review in [13]) are not able to handle datasets with a
large number of taxa or have not been tested in this regard
with respect to their statistical properties. Therefore, there
is a performance and scalability gap between tools for
phylogenetic analysis and meta-analysis. The capability to
analyze large datasets is important to infer "deep co-phyl-
ogenetic" relationships which could otherwise not be
assessed [14].
Parafit [15] implements statistical tests for both overall
phylogenetic congruence as well as for the significance of
individual associations. Extensive simulations have
shown that the Parafit tests are statistically well-behaved
and yield acceptable error rates. The method has been suc-
cessfully applied in a number of biological studies [16-
19]. In addition, the Type-II statistical error of Parafit
decreases with the size of the dataset (see [15]), i.e., this
approach scales well on large phylogenies of hosts and
associates. Due to these desirable properties, recent work
on CopyCat [14] focused on improving the usability of
Parafit via a Graphical User Interface (GUI) and automa-
tion of the analysis pipeline which transforms phyloge-
netic trees to patristic (tree-based) distance matrices,
converts distance matrices to matrices of eigenvectors
using DistPCoA [20], invokes Parafit, and parses input,
intermediate, as well as output files. However, co-phylo-
genetic analyses with CopyCat can not be conducted on
large datasets due to the excessive run time requirements
of Parafit and DistPCoA, which represent the by far most
compute-intensive part of the CopyCat analysis pipeline.
Here we present AxParafit and AxPcoords which are
highly optimized and parallelized versions of Parafit and
DistPCoA respectively. As outlined by the case-study on
smut fungi on page 6 these accelerated programs allow for
more thorough large-scale co-phylogenetic analyses and
extend the applicability of the approach by 1–2 orders of
magnitude, thus closing the aforementioned performance
gap concerning current phylogenetic meta-analysis tools.
Coupled with the easy-to-use CopyCat tool AxParafit/
AxPcoords facilitate statistical co-phylogenetic analyses
on the largest trees that can currently be computed.
Implementation
For programming convenience and portability as well as
due to the structure of the original Fortran code we re-
implemented Parafit and DistPCoA in C from scratch.
Sequential Optimization
The sequential C code was optimized by reducing unnec-
essary memory allocations for matrices in AxPcoords/
AxParafit and using a faster method to permute matrices
in AxParafit.
Thereafter the compute-intensive for-loops in AxParafit/
AxPcoords were manually tuned. After those initial opti-
mizations we profiled both programs and found that the
run-times were now largely dominated (over 90% of total
execution time) by a dense matrix-matrix multiplication
in AxParafit and the computation of eigenvectors/eigen-
values in AxPcoords respectively. To further accelerate the
programs we integrated function calls to the highly opti-
mized matrix multiplication of the BLAS (Basic Linear
Algebra Package [21]) package and eigenvector/eigen-
value decomposition in LAPACK (Linear Algebra PACK-
age [22]).
For BLAS we assessed the usage of ATLAS BLAS (Automat-
ically Tuned Linear Algebra Software, math-atlas.source-
forge.net) as well as the ACML BLAS (AMD Core Math
Library [23]) libraries on a 2.4 GHz AMD Opteron CPU.
The ACML package showed slightly faster speeds (≈ 7–
9%). However, AxParafit also provides an interface to the
INTEL MKL (Math Kernel Library) and ATLAS BLAS
implementations. AMD ACML, INTEL MKL, and ATLAS
are all freely available for academic use. AxParafit can also
be compiled without BLAS and rely on a manually tuned
matrix multiplication which is approximately 4 times
slower.
AxPcoords can use either the LAPACK functions imple-
mented in the AMD ACML or INTEL MKL libraries. In
addition, AxPcoords can also make use of the GNU scien-
tific library [24] for eigenvector/eigenvalue computations.
The tuned programs were designed to yield exactly the
same results as Parafit and DistPCoA. Note however, that
in contrast to AxPcoords we observed numerically unsta-
ble results for DistPCoA on datasets with large association
matrices, containing more than 4,096 entries. This is due
to some well-known problems with the stability of eigen-
vector/eigenvalue decomposition [25-27] on large data-
sets and due to the fact that the original Parafit code uses
the algorithm from [28]. Therefore, the integration of the
thoroughly tested LAPACK routines, apart from speed
benefits, also yields increased numerical stability. We inte-
grated AxPcoords and AxParafit into CopyCat [14]. Figure
1 provides a screen-shot of CopyCat whit a drop-down
menu that allows the user to select AxParafit/AxPcoords
for executing the analyses.Page 2 of 10
(page number not for citation purposes)
Page 3
BMC Bioinformatics 2007, 8:405 http://www.biomedcentral.com/1471-2105/8/405
Parallelization
AxPcoords requires less than 24 hours of run-time on a
single CPU, even for distance matrices with several thou-
sands of taxa. Therefore, we exclusively focused on the
parallelization of AxParafit which requires run-times of
several days or weeks on large datasets.
The execution time of Parafit depends on the sizes of
input matrices A, B, and C with dimensions n1n2, n4n1,
and n3n2 respectively (for details see [15]). The complexity
is roughly O(nonZero(A)n3n4n1p). The term n3n4n1 is the
complexity of the dense matrix multiplication in
AxParafit. The variable p is the user-specified number of
permutations that shall be executed (typically 99–9,999,
not counting the original permutation) and nonZero(A) is
the number of non-zero elements in the binary associa-
tion matrix A. The program executes two main steps: the
global test of co-speciation with complexity O(n3n4n1p)
and the individual tests with complexity
O(nonZero(A)n3n4n1p). Since in real-world analyses
total computational load. Our approach represents a
trade-off between the amount of programming effort
required for the parallelization and the expected perform-
ance gains. Thus, initially the global test of co-speciation
must be executed using the sequential version of
AxParafit. The sequential program provides an option to
conduct the global test, write a binary output file that can
be used to start the parallel computation of individual
host-parasite links, and then exit.
The statistical test of individual associations has been par-
allelized with MPI (Message Passing Interface) via a mas-
ter-worker scheme. The parallelization is straight-forward
since all tests of individual associations are independent
from one another and can thus be computed independ-
ently on individual workers. Moreover, each individual
test has approximately the same execution time, such that
there are no problems due to load imbalance. The maxi-
mum number of CPUs that can be used by our paralleli-
zation is thus nonZero(A). However, this can be improved
by using the ACML or MKL BLAS implementations that
exploit fine-grained loop level parallelism on SMP (Sym-
metric Multi-Processing) architectures. This allows for a
more efficient utilization of hybrid supercomputer archi-
tecture. Moreover, it might help to improve performance
on huge datasets where SMP implementations can profit
from super-linear speedups due to increased cache effi-
ciency.
Results and Discussion
The current Section is split into two parts: Part 1 describes
the computational results while Part 2 outlines the sub-
stantial benefits of using AxParafit for large-scale empiri-
cal co-phylogenetic studies.
Computational Performance
Here we provide performance data regarding the purely
computational aspects of AxParafit.
Experimental Setup
To conduct computational experiments we used an
unloaded system of 36 4-way AMD 2.4 GHz Opteron
processors with 8 GB of main memory per node which are
interconnected by an Infiniband switch. Parafit and Dist-
PCoA were compiled using g77 -ffixed-line-length-0 -ff90-
intrinsics-delete -03. AxParafit and AxPcoords were com-
piled with -03 -fomit-frame-pointer -funroll-loops and
linked with the AMD ACML library. We also assessed
additional compiler optimizations (-fomit-frame-pointer,
-funroll-loops, -m64, -march = k8) with g77 for Fortran,
which actually lead to performance decrease of Parafit and
DistPCoA (data not shown).
Screen-shot of AxParafit/AxPcoords Option in CopyCatFigure 1
Screen-shot of AxParafit/AxPcoords Option in Copy-
Cat. This screen-shot shows the CopyCat drop-down menu
that allows the user to select AxParafit/AxPcoords for exe-
cuting the analyses and to switch between the U and W
modes of branch length computation.Page 3 of 10
(page number not for citation purposes)
nonZero(A) Ŭ 1 we only parallelized all individual tests of
co-speciation which typically generate over 99% of the
In order to assess performance of AxParafit we extracted
subsets from a large empirical dataset with more than
Page 4
BMC Bioinformatics 2007, 8:405 http://www.biomedcentral.com/1471-2105/8/405
30,000 host-associate links (collected from entries in the
EMBL database [29]), which we are currently analyzing
with our tools. We sampled square association matrices A,
i.e., n1 = n2 of dimensions 128, 256, 512, 1,024, and
2,048. The number nonZero(A) was 128, 256, 512, 1,024,
and 2,048 respectively. The number of permutations p
was set to 99, 99, 9, 2, and 2 respectively. A complete test
on the dataset of size 4,096 was not conducted with
Parafit due to the extremely long run-times on n1 = n2 = 2,
048 which already amounts to 19.9 days compared to 7.7
hours required by AxParafit.
To test AxPcoords we used the same compiler switches as
indicated above and a subset of the square association
matrices with nonZero(A) amounting to 512, 1,024,
2,048, and 4,096 respectively.
Results
In Figure 2 we provide the sequential run-time improve-
ment of AxParafit over Parafit. The acceleration obtained
by AxParafit increases with growing dataset size and
attains a factor of 61.86 on the association matrix of size
2,048. The increase of the performance improvement with
growing dataset size is mainly due the larger efficiency of
both our own optimizations as well as the cache blocking
strategies used in the BLAS implementations.
Figure 3 provides the memory use of AxParafit and Parafit
in MB for quadratic A-matrices of sizes 128, 256, 512,
1,024, 2,048, and 4,096 (note that the dataset of size
4,096 was not run to completion). To test AxPcoords we
used distance matrices of sizes 512, 1,024, 2,048, and
4,096. Run-time improvements range from 8.8 to 25.74.
The run on 4,096 with DistPCoA apparently terminated
but did not write a results file, most probably due to
numerical instability (Pierre Legendre, personal commu-
nication). Figure 4 shows the run-time improvement of
AxPcoords over DistPCoA for quadratic distance matrices
of sizes 512, 1,024, 2,048, and 4,096. As already men-
tioned, the run on 4,096 with DistPCoA did not write a
results file. Tests on smaller distance matrices e.g., of size
128 and 256 were omitted due to the low execution times
Run Time Improvement Sequential AxPcoords versus DistP-CoAFigure 4
Run Time Improvement Sequential AxPcoords ver-
sus DistPCoA. Run-time improvement of AxPcoords ver-
sus DistPCoA for quadratic distance matrices of dimensions
0
5
10
15
20
25
30
1000 2000 3000 4000
R
un
T
im
e
Im
pr
ov
em
en
t
Size of Quadratic Distance Matrix
"PcoordsSpeed"
Run Time Improvement Sequential AxParafit versus ParafitFigure 2
Run Time Improvement Sequential AxParafit versus
Parafit. Run-time improvement of AxParafit versus Parafit
for quadratic association matrices of dimensions 128, 256,
512, 1,024, and 2,048.
0
10
20
30
40
50
60
70
0 500 1000 1500 2000 2500
R
un
T
im
e
Im
pr
ov
em
en
t
Dimension of Association Matrix
"acceleration"
Memory Consumption AxParafit versus ParafitFigure 3
Memory Consumption AxParafit versus Parafit. Mem-
ory consumption of Parafit and AxParafit for quadratic asso-
ciation matrices of size 128, 256, 512, 1,024, 2,048, and
4,096.
0
200
400
600
800
1000
1200
0 1000 2000 3000 4000 5000
M
em
or
y
Co
ns
um
pt
io
n
in
M
B
Dimension of Association Matrix
"Parafit"
"AxParafit"Page 4 of 10
(page number not for citation purposes)
512, 1,024, 2,048, and 4,096.
Page 5
BMC Bioinformatics 2007, 8:405 http://www.biomedcentral.com/1471-2105/8/405
which were below 10 seconds. On the largest matrix AxP-
coords terminated within only 399 seconds as opposed to
10,268 seconds required by DistPCoA.
We assessed scalability of parallel AxParafit using the asso-
ciation matrix A of size 512 on 4, 8, 16, 32, 64, and 128
processors with p = 99. Figure 5 provides the speedup with
respect to the number of worker processes. We indicate
speedup values for the parallel part (SpeedupIndividual,
computation of individual host-parasite links) as well as
for the sequential plus the parallel part of the program
(SpeedupWhole), i.e., we added the sequential computa-
tion time for the global test to the parallel execution time.
On 128 processors the computation took only 50 sec-
onds. An analysis of this dataset with the sequential ver-
sion of Parafit would take approximately 20 hours.
A Real-World Example
In order to provide an example for the substantial benefits
of performing a large-scale co-phylogenetic analysis with
AxParafit we provide a real-world study on smut fungi and
their host plants.
Experimental Data
We collected a large sample of associations of smut fungi
and their host plants. Smut fungi comprise more than
1,500 species of obligate phytoparasites and are arranged
in the taxa Entorrhizomycetes, Microbotryales, and Ustilagin-
omycotina. These parasites cause syndromes such as dark,
powdery appearance of the mature spore masses or may
even lead to plant deformation in some cases [30,31]. The
Ustilaginomycotina also comprise obligate plant parasites
with distinct morphology [30].
With a few exceptions, hosts of smut fungi belong to the
Angiosperms [30]. For economically important hosts,
such as barley and other cereals, smut fungi may cause
considerable yield losses (see e.g., [32]). Phylogeny and
taxonomy of genera and higher ranks has been derived
from sound molecular and ultrastructural data in recent
years (see [30] and references therein). However, apart
from the work presented in [14], co-phylogenetic analysis
of smut fungi have so far been restricted to single genera
with comparatively few species [33,34].
In addition to the host plant index for European smut
fungi [31,35] that has been used in [14], information on
smut fungus-host plant associations was extracted from
the following publications: Bauer et al. [36-38], Begerow
et al. [33,39], De Beer et al. [40], Hendrichs et al. [41],
Nannfeldt [42], Piepenbring [43], Scholz and Scholz [44],
an unpublished manuscript by K. Vanky (Smut fungi of
the Indian subcontinent; Vanky, personal communica-
tion), and Vanky and McKenzie [45]. Moreover, we
included information contained in the "specific host"
entries of the complete collection of core nucleotide
sequences for Entorrhizomycetes, Microbotryales, and Usti-
laginomycotina downloaded from GenBank [46] on Sep-
tember 01, 2007 (12,815 sequences). Parasite taxon
names were corrected using Vanky's synonym-list [35].
Synonyms for host taxon names were obtained from
Palese and Moser [47].
Including synonyms, our data set contained 3,912 differ-
ent fungus-plant associations. In order to retrieve taxon
IDs and to construct taxonomy trees for hosts and para-
sites [14], we used the NCBI taxonomy release of Septem-
ber 01, 2007. For host and parasite species names that
were not found in the NCBI taxonomy, the search was
repeated after reducing the taxon name to the respective
genus. In this way, a total of 2,362 different associations
could be identified that covers 413 smut fungi and 1,400
host plants. Thus, the dataset assembled was more than
three times larger than the one recently analyzed in [14],
which contained 645 associations, corresponding to 140
smut fungi and 437 host plants. The Parafit analysis of
this comparatively small dataset took already more than a
week. For both hosts and parasites, two trees were con-
structed, one tree with branch lengths corresponding to
the "true" (denoted as W for Weighted) taxonomical dis-
tance [14] and one with all branch lengths set to 1
(denoted as U for Un-weighted/Uniform). As outlined on
page 4 the computational complexity of AxParafit is
O(nonZero(A)n3n4n1p) and thus the execution time
requirements for this larger dataset increase significantly.
Inference with AxParafit
Speedup of Parallel AxParafitFigure 5
Speedup of Parallel AxParafit. Speedup of parallelized
part and speedup for sequential plus parallel part of AxPar-
Parafit for a quadratic association matrix of size 512 on 4, 8,
0
20
40
60
80
100
120
140
160
180
0 20 40 60 80 100 120 140
Sp
ee
du
p
Number of Processors
"OptimalSpeedup"
"SpeedupIndividual"
"SpeedupWhole"Page 5 of 10
(page number not for citation purposes)
Production runs with Parafit and AxParafit on an initial
version of our dataset were started on August 29, 2007.
16, 32, 64 and 128 CPUs.
End of preview.
Preview full-text

Science & Research Jobs

Keywords

512 association matrix
 
AxParafit programs
 
compute-intensive components
 
dataset size
 
DistPCoA programs
 
easy-to-use CopyCat tool
 
host-parasite co-phylogenetic analyses
 
large datasets
 
large-scale co-phylogenetic analyses
 
large-scale empirical study
 
largest co-phylogenetic analysis
 
medium-sized dataset
 
medium-sized datasets
 
Parafit analysis pipeline
 
parallel analysis
 
parallel AxParafit
 
performance benefit increases
 
sequential Parafit
 
substantial benefits
 
tuned BLAS