Content uploaded by Kayhan Erciyes
Author content
All content in this area was uploaded by Kayhan Erciyes on Apr 17, 2014
Content may be subject to copyright.
Phylogenetic Tree Construction for Y-DNA
Haplogroups
Esra Ruzgar and Kayhan Erciyes
Computer Eng. Dept., Izmir University
Gursel Aksel Bulvari, 14, Uckuyular, 35350, Izmir, Turkey
{esra.ruzgar,kayhan.erciyes}@izmir.edu.tr
http://www.izmir.edu.tr
Abstract. Male Y-chromosome is currently used to estimate the pater-
nal ancestry and migratory patterns of humans. Y-chromosomal Short
Tandem Repeat(STR) segments provide important data for reconstruct-
ing phylogenetic trees. However, STR data is not widely used for phy-
logeny because there is not enough appropriate methodology. We propose
a three-step method for analyzing large numbers of STR data and con-
structing phylogenetic trees. We implement our method on 145 samples
from the Y-DNA Haplogroup G. We use distance matrix based phylogeny
so we find genetic distance between each sample first. Then samples are
partitioned into a number of clusters. Finally we construct phylogenetic
trees for each cluster by using Neighbor-Joining (NJ) method. We also
propose a new partitioning based clustering algorithm. We compare sev-
eral partitioning based (e.g. FCM) and density based (e.g. FN-DBSCAN)
clustering algorithms with our algorithm. We use Multi-Dimensional
Scaling (MDS) to visualize samples and compare results of MDS with
results of NJ method.
Key words: Y-DNA Haplogroups, clustering, phylogenetic tree
1 Introduction
Genetic genealogy is becoming very popular because of availability of reasonably
priced Y-chromosome testing of STRs. The results of this test give information
about genealogical relationships between two or more males. The set of STR
values that is obtained for Y-chromosome markers is called a haplotype. It is also
very popular to find Y-chromosome haplogroup, a group or family that share
a common ancestry. Phylogenetic trees are used to visualize the evolutionary
relationships between the biological species, we can also use them to visualize
genealogical relationships between males. Many methods have been presented
for constructing phylogenetic trees and networks from amino acid sequences,
nucleotide sequences or gene frequencies, however STR data is not widely used
for phylogeny.
The phylogenetic tree construction can be classified into two groups: ma-
trix methods and sequence methods. The former group, which uses a distance
matrix of genetic measure, includes the unweighted pair-group method using
arithmetic averages (UPGMA)[1], Cavalli-Sforza and Edwards’s the minimum
evolution method[2], the distance Wagner method of Farris[3], the neighbor-
joining (NJ) method of Saitou and Nei[4], the median-joining (MJ) networks
method of Bandelt et al[5], and others. On the other hand, the latter group,
which utilizes amino acid or nucleotide sequences directly, contains maximum
parsimony methods of Eck and Dayhoff[6], the maximum likelihood method of
Felsenstein[7], Tateno’s method[8], the Reduced Median (RM) networks method
of Bandelt et al[9]. In this study, we propose a three-step method to construct
phylogenetic trees for samples from Y-DNA Haplogroup G. First, we calculate
pairwise distances between all samples and generate distance matrix. Then we
divide samples into several clusters and finally construct trees for each cluster
by using neighbor-joining (NJ) method. We also compare several clustering al-
gorithms to find the algorithm that fits with biological data best. Rest of the
paper is organized as follows. Section 2 gives brief background information about
Human DNA structure and Y-chromosome. In Section 3, all steps and details of
phylogenetic tree construction is explained. Section 4 concludes with summary
and discussions.
2 Background
2.1 Human DNA Structure
Deoxyribonucleic acid, DNA is the genetic material that we inherit from our
parents. The total collection of DNA for a single person or organism is referred
to as its genome. DNA is a long string of nucleotide units attached to one
another. In a single nucleotide, there exist three components: a sugar molecule, a
phosphate group, and a nitrogenous base. The nitrogenous bases are what make
DNA variable. There are four different types of bases: Adenine (A), Guanine (G),
Cytosine (C) and Thymine (T). In a single nucleotide, the sugar is attached at
one end to a phosphate group. Because the sugar of that nucleotide can attach
to another phosphate at its other end, we can string together many nucleotides
in a long chain.
DNA has two sides or strands, and these strands are twisted together like
a twisted ladder called the double helix. The nitrogenous bases point inward
on the ladder and form pairs with bases on the other strand. Each base pair
is formed from two complementary nucleotides (purine with pyrimidine) bound
together by hydrogen bonds. The base pairs in DNA are Adenine with Thymine
and Cytosine with Guanine.
2.2 Chromosomes
A chromosome is an organized structure of DNA and located in cells. The human
genome is composed of 23 kinds of chromosomes. However, every child receives
two sets of 23 chromosomes - one from mother and one from father. As a result,
every individual has 23 pairs of chromosomes, for a total of 46 chromosomes. One
pair of chromosomes, called sex chromosomes, is responsible for determining sex.
Remaining 22 pairs are called autosomes. In male genome, sex chromosome pair
is composed of one X chromosome and one Y chromosome. In female genome,
it is composed of two X chromosomes. In Fig. 1, genome of a male is shown.
Fig. 1. The Human Genome (Male)
2.3 Y Chromosome Haplogroups
Y chromosome is found only in male genome and passed from father to son un-
changed. Only small changes occur in Y-DNA during generations. So Y-DNA
is used in genetic genealogy, which tries to find genetic relationships between
individuals. The genealogical Y-DNA testing involves looking at STR segments
of DNA on the Y chromosome. STRs are sequences of repeating nucleotides. The
number of repetitions of a sequence differs from one person to another person,
particular number of repetitions is known as an allele of the marker. The vari-
ations in STR segments are caused by mutations that increase or decrease the
number of repeats. An STR on the Y chromosome is designated by a DYS num-
ber (DNA Y-chromosome Segment number). For example, one allele of DYS393
marker is 12, also called the marker’s value. The value 12 means the DYS393
sequence of nucleotides is repeated 12 times with a DNA sequence of AGAT.
All combinations of DNA marker values for an individual show his haplotype.
Haplogroup is a group of similar haplotypes that share a common ancestor with a
Single Nucleotide Polymorphism (SNP) mutation. SNP is a DNA sequence vari-
ation occurring when a single nucleotide in the genome differs between members
of species. For instance, two DNA sequences AAGCCTA and AAGCTTA con-
tain a difference in a single nucleotide. Haplogroups are used to define genetic
populations of the world. Fig. 2 shows all major haplogroups.
Fig. 2. Y-DNA Haplogroup Tree
3 Phylogenetic Tree Construction
We propose three steps for constructing phylogenetic tree for Y-DNA hap-
logroups. First step consists of selecting Nsamples and generating a distance
matrix of size N2. Distance matrix shows the genetic distance in Y-DNA be-
tween each individual. Then by using the distance matrix we divide samples
into clusters. Finally we construct phylogenetic trees for each cluster.
3.1 Calculating Distances
First step of phylogenetic tree construction is selecting samples and generating
distance matrix of samples. We selected 145 samples belonging to G haplogroup
from Y-DNA Haplogroup G Project[10]. Our samples are member of G2a3a and
G2a3b subgroups. Genetic distances are calculated using 12 DYS markers for
each sample. We keep records that contain surname, origin and 12 markers of
each sample. Two sample records for 6 DYS markers are shown below.
Table 1. DYS Markers of two sample records
Surname Origin DY S393 DY S390 DY S394 DY S391 DY S385a DY S385b
Sample1 Italy 14 21 15 10 13 16
Sample2 Turkey 14 22 16 10 13 16
Genetic distance between two samples is calculated according to differences
between each DYS markers. But each DYS marker have different mutation rate
so we should take mutation rates into account when we calculate distances. Mu-
tation rates are estimated in[11]. Let d(j, k) denote the function that calculates
weighted distance between two samples, widenote mutation rate of marker i.
Then d(j, k) is,
We calculate all distances between each sample and keep them in a matrix,
so for Nsamples we obtain a matrix size of N2.
3.2 Diving Samples into Clusters
The second step is dividing samples into clusters. Clustering is a process of
partitioning a set of data into a set of meaningful subclasses, called clusters.
Clustering helps us to understand genetic relationship between samples more
easily. In the clustering process, intra-cluster distances should be minimized and
inter-cluster distances should be maximized.
Clustering methods can be divided into different groups such as hierarchi-
cal, partitioning-based, density/neighborhood-based[12]. Partitioning algorithms
construct a partition of a database Dof nobjects into a set of kclusters. The par-
titioning algorithm typically starts with an initial partition of Dand then uses
an iterative control strategy to optimize an objective function. Consequently,
partitioning algorithms use a two-step procedure. First, determining krepre-
sentatives minimizing the objective function. Second, assigning each object to
the cluster with its representative closest to the considered object. The second
step implies that a partition is equivalent to a Voronoi diagram and each cluster
is contained in one of the Voronoi cells[13]. Some examples of this method are
k-means, k-medoids[14, 15] and fuzzy c-means(FCM)[16, 17].
Hierarchical algorithms create a hierarchical decomposition of D. The hier-
archical decomposition is represented by a dendrogram, a tree that iteratively
splits Dinto smaller subsets until each subset consists of only one object. In such
a hierarchy, each node of the tree represents a cluster of D. The dendrogram can
either be created from the leaves up to the root (agglomerative approach) or from
the root down to the leaves (divisive approach) by merging or dividing clusters
at each step. In contrast to partitioning algorithms, hierarchical algorithms do
not need k as an input. However, a termination condition has to be defined
indicating when the merge or division process should be terminated[13].
In density-based algorithms clusters are regarded as regions in the data space
in which the objects are dense, and which are separated by regions of low object
density (noise). The general idea of density-based clustering approaches is to
search for regions of high density in the data space. These regions may have an
arbitrary shape and the points inside a region may be arbitrarily distributed.
Some examples of this method are DBSCAN, GDBSCAN, OPTICS [13, 18, 19]
and FN-DBSCAN [20].
3.3 Our Clustering Algorithm
Our clustering algorithm is a type of partitioning-based method. It selects k
samples as cluster centers and assigns each sample to nearest of these centers.
Center selection is done by using a modified farthest first traversal algorithm.
In farthest first traversal algorithm, initial cluster center is chosen randomly,
but we choose two centers, which are the most distant pair of samples, as initial
centers. Then we pick next centers to be as far as possible from centers chosen
so far. But we do not know optimum kin advance. So we need a termination
criterion for selecting centers. If the maximum distance in distance matrix is
dm, then we pick centers while furthest distance is bigger than a×dm, a is an
integer between 0 and 1. After kcenters are chosen , each sample assigns itself
to nearest center. The clustering algorithm is given below.
Procedure Center_Selection(X, a, N)
Input: X: distance matrix of samples
a: an integer between 0 and 1
N: number of samples
Output: C: cluster centers
k: number of clusters
1.k=0
2. Pick most distant pair of samples p1, p2∈X, put them in C
3. while d(pi, C)> a ×dm
Let pibe the sample in Xthat is farthest from p1, ..., pi−1
k=k+1
4. Execute Cluster_Formation(X, C, k, N)
Procedure Cluster_Formation(X, C, k, N)
Input: X: distance matrix of samples
C: cluster centers
k: number of clusters
N: number of samples
Output: clusters: samples divided into k partitions
1. unclassifiedSamples = N
2. Create k clusters
3. Assign each ci∈Cto its own cluster clusteri∈clusters
4. while (unclassifiedSamples > 0)
Find nearest center cn∈Cfor each sample pi∈X
Assign pito clustern
5. Return clusters
3.4 Visualizing Samples and Clustering Results
Visualization of samples in multidimensional space is required to compare clus-
tering results. We only have symmetric distance matrix of samples. We used
Multi-Dimensional Scaling(MDS) for finding positions of samples in 2-dimensional
space. MDS is a statistical technique to visualize dissimilarity in data. In MDS,
objects are represented as points in a usually 2-dimensional space, such that
the distances between the points match the observed dissimilarities as closely as
possible[21]. The degree of correspondence between the distances among points
implied by MDS and the matrix input is measured by a stress function. We used
MDSJ Java package[22] to find positions of samples in 2-dimensional space. Po-
sitions of 145 samples in shown in Fig. 3.
Fig. 3. Visualization of 145 samples in 2-dimensional space by using MDS.
In order to find the clustering algorithm that gives the most effective clus-
ters for our biological data, we executed several algorithms on 145 samples. Two
partitioning-based algorithms: our algorithm and FCM, one density-based algo-
rithm: FN-DBSCAN are executed and their results are shown in Fig. 4, Fig. 5
and Fig. 6 respectively. In Fig. 4 results of our clustering algorithm is shown.
Fuzzy c-Means(FCM) and FN-DBSCAN are fuzzy clustering algorithms. The
main difference between the traditional hard clustering and fuzzy clustering can
be stated as follows. While in hard clustering an entity belongs only to one
cluster, in fuzzy clustering entities are allowed to belong to many clusters with
different degrees of membership[23]. In Fig. 5 results of FCM algorithm is shown.
FCM algorithm needs number of clusters kas an input, so we found optimal value
of kas 3 with Partitioning Coefficient method. We used k-nearest neighbors
method for forming initial clusters.
FN-DBSCAN is based on DBSCAN algorithm, but it uses fuzzy neighbor-
hood relation. It is observed that the FN-DBSCAN algorithm is more robust
than the DBSCAN algorithm to data sets with various shapes and densities[20].
Results of FN-DBSCAN algorithm by using exponential membership function
with parameters 1= 0.91 and 2= 0.1 is shown in Fig. 6.
It is easily seen that FN-DBSCAN algorithm performs much better than our
partitioning-based algorithm. It also performs better than FCM, even both uses
fuzzy neighborhood relation.
Fig. 4. Results of our clustering algorithm with parameters a= 0.75
Fig. 5. Results of FCM algorithm with parameters k= 3
Fig. 6. Results of FN-DBSCAN algorithm by using exponential membership function
with parameters 1= 0.91 and 2= 0.1
3.5 Constructing Phylogenetic Tree with Neighbor-Joining
Algorithm
Neighbor-Joining(NJ) algorithm[4] presented by Saitou and Nei is a widely used
method for constructing phylogenetic trees by using a symmetric distance ma-
trix. This method is based on the minimum evolution principle and provides
trees with near-minimal sum of branch-length estimates[24]. An alternative for-
mulation of the NJ method[25] which reduces computational complexity from
O(n) = n5to O(n) = n3was given by Studier and Keppler.
The NJ method proceeds in a heuristic manner and guarantees that a short
tree is found, but not the shortest. At each stage of clustering, NJ considers that
data are starlike, as shown in Fig. 7. Then, it extracts the closest pair 1, 2 which
minimizes the length of the tree as shown in Fig. 8. The closest pair is then
clustered into a new internal node, and the distances of this node to the rest of
the nodes in the tree are computed and used in later iterations. The algorithm
terminates when N−2 internal nodes have been inserted into the tree. In Fig. 9,
a final phylogenetic tree of 8 samples is shown. The internal nodes are labeled
from A to F. The branch length estimates of the tree are also shown.
Neighbor-joining method and the method formulated by Studier and Keppler
differ in combining the elements of selected pair. It is proven in[24]that both
methods always obtain the same tree shape. Simple considerations show that
both algorithms also provide identical branch lengths.
3.6 Results of Phylogenetic Tree Construction
At the final step of our method, we construct phylogenetic trees for each cluster
of samples by using NJ algorithm. We implemented NJ method in Java and used
the formulation of Studier and Keppler because of its reduced time complexity.
Fig. 7. A star-like tree with no hierarchical structure
Fig. 8. A tree in which 1 and 2 are clustered
Fig. 9. An unrooted tree of 8 samples, 1-8. A-F are internal nodes
FN-DBSCAN is the most robust clustering algorithm for our data. It forms
three clusters: cluster 1 size of 20, cluster 2 size of 34, and cluster 3 size of 90.
Phylogenetic trees show the genetic similarity of our samples from Hap-
logroup G. Each sample is labeled with a number starting from 0. In Fig. 10,
Fig. 12 and Fig. 13 phylogenetic trees of three clusters are shown.
Fig. 10. Phylogenetic tree obtained by NJ method for cluster 1
NJ method uses a symmetric distance matrix to construct unrooted trees.
MDS also uses symmetric distance matrix to visualize samples in multidimen-
sional space. So results of these two methods should produce similar results. In
Fig. 11, MDS representation of cluster 1 is shown.
4 Conclusion
In this paper, we proposed a three-step method for constructing phylogenetic
trees for samples of Y-DNA haplogroups. We considered mutation rates of STR
markers in calculating genetic distances between samples. We divided samples
Fig. 11. Representation of cluster 1 by MDS method
Fig. 12. Phylogenetic tree obtained by NJ method for cluster 2
Fig. 13. Phylogenetic tree obtained by NJ method for cluster 3
into clusters so we can handle large amount of data easily. We also proposed
a new partitioning-based clustering algorithm. Several partitioning and density
based clustering algorithms have been executed on samples. Results show that
density-based clustering algorithm gives more robust clusters. We finally con-
structed phylogenetic trees for each cluster by using NJ method and compared
results with MDS representations.
References
1. Sokal, RR., Michener C.D.: A statistical method for evaluating systematic relation-
ship, Univ. Kansas Sci Bull,(1958), vol 38, pp:1409-1438.
2. Cavalli-Sforza L.L., Edwards A.W.F., Phylogenetic analysis models and estimation
procedures. Am J Hum Genet.,(1967), 19:233-257.
3. Farris JS., Estimating phylogenetic trees from distance matrices. Am.
Nat.,(1972),vol. 106, p.645-668.
4. Saitou N., Nei M., (1987), The neighbor-joining method: a new method for recon-
structing phylogenetic trees. Mol Biol Evol., vol. 4:406-425.
5. Bandelt H-J, Forster P, Rohl A.,(1999), Median-joining networks for inferring in-
traspecific phylogenies. Mol Biol Evol 16:37-48.
6. Eck RV, Dayhoff MO (1966) Algorithm for constructing ancestral sequences and
a phylogenetic tree. In: Eck, RV, Dayhoff MO (eds) Atlas of protein sequence and
structure. National Biomedical Research Foundation, Silver Spring MD, pp 164-169.
7. FelsensteinJ (1981) Evolutionary trees from DNA sequences: a maximum likelihood
approach. J Mol Evol 17:368-376.
8. Tateno, Y. 1990. A method for molecular phylogeny construction by direct use of
nucleotide sequence data. J. Mol. Evol. 30:85-93.
9. Bandelt, H.-J., Forster P., Sykes B.C., Richards, M. B., Mitochondrial portraits of
human populations using median networks. Genetics , (1995), 141:743-753.
10. Y-DNA Haplogroup G Pro ject, http://www.members.cox.net/morebanks/
Diagram.html
11. Chandler: Estimating Per-Locus Mutation Rates. Journal of Genetic Genealogy
(2006)
12. J.Han, M.Kamber: Data Mining Concepts and Techniques, Morgan Kaufmann
Publishers, SanFrancisco (2001) , pp. 335-391.
13. M.Ester, H.P.Kriegel, J.Sander, X.Xu: A density-based algorithm for discovering
clusters in large spatial databases with noise, Second Internat. Conf. on Knowledge
Discovery and Data Mining (1996), pp.226-231.
14. MacQueen, J.: Some Methods for Classification and Analysis of Multivariate Ob-
servations. 5th Berkeley Symp. Math. Statist. Prob., (1967), vol. 1, pp. 281-297.
15. Vinod, H.: Integer programming and the theory of grouping, J. Amer. Statist.
Assoc., (1969), vol. 64, pp.506-517.
16. J. C. Dunn: A Fuzzy Relative of the ISODATA Process and Its Use in Detecting
Compact Well-Separated Clusters, Journal of Cybernetics, (1973), vol. 3, pp. 32-57.
17. J. Bezdek: Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum
Press, New York, 1981.
18. J.Sander, M.Ester, H.P.Kriegel, X.Xu: Density-based clustering in spatial
databases: the algorithm GDBSCAN and its applications, Data Mining and Knowl-
edge Discovery 2, (1998), pp. 169-194.
19. M.Ankerst, M.M.Breunig, H.-P.Kriegel, J.Sander: OPTICS:ordering points to iden-
tify the clustering structure, Proc.ACM SIGMOD Internat. Conf.on Management
of Data, Philadelphia, PA, (1999), pp. 49-60.
20. Nasibov, E. N., Ulutagay, G.: Robustness of density-based clustering methods with
various neighborhood relations, Fuzzy Sets and Systems, (2009), vol. 160, pp. 3601-
3615.
21. Patrick J.F. Groenen, Michel van de Velden: Multidimensional Scaling, Economet-
ric Institute Report EI 2004-15, (2004).
22. Algorithmics Group. MDSJ: Java Library for Multidimensional Scaling (Version
0.2). Available at http://www.inf.uni-konstanz.de/algo/software/mdsj/. Uni-
versity of Konstanz, (2009).
23. S. Nascimento, B. Mirkin, F. Moura-Pires: A Fuzzy Clustering Model of Data and
Fuzzy c-Means, Proc. of Ninth IEEE International Conference on Fuzzy Systems,
(2000), vol. 1, p. 302-307.
24. Gascuel, O.: A Note on Sattath and Tversky’s, Saitou and Nei’s, and Studier and
Keppler’s Algorithms for Inferring Phylogenies from Evolutionary Distances, Mol.
Biol. Evol., (1994), vol. 11(6), p:961-963.
25. Studier, J. A., and Keppler, K. J. A note on the neighbor-joining method of Saitou
and Nei. Mol. Biol. Evol.,(1988), vol. 5, p. 729-731.