Fig 5 - uploaded by Kayhan Erciyes
Content may be subject to copyright.
Results of FCM algorithm with parameters k = 3 

Results of FCM algorithm with parameters k = 3 

Source publication
Article
Full-text available
Male Y-chromosome is currently used to estimate the pater-nal ancestry and migratory patterns of humans. Y-chromosomal Short Tandem Repeat(STR) segments provide important data for reconstruct-ing phylogenetic trees. However, STR data is not widely used for phy-logeny because there is not enough appropriate methodology. We propose a three-step metho...

Contexts in source publication

Context 1
... of samples in multidimensional space is required to compare clustering results. We only have symmetric distance matrix of samples. We used Multi-Dimensional Scaling(MDS) for finding positions of samples in 2-dimensional space. MDS is a statistical technique to visualize dissimilarity in data. In MDS, objects are represented as points in a usually 2-dimensional space, such that the distances between the points match the observed dissimilarities as closely as possible[21]. The degree of correspondence between the distances among points implied by MDS and the matrix input is measured by a stress function. We used MDSJ Java package[22] to find positions of samples in 2-dimensional space. Po- sitions of 145 samples in shown in Fig. 3. In order to find the clustering algorithm that gives the most effective clusters for our biological data, we executed several algorithms on 145 samples. Two partitioning-based algorithms: our algorithm and FCM, one density-based algorithm: FN-DBSCAN are executed and their results are shown in Fig. 4, Fig. 5 and Fig. 6 respectively. In Fig. 4 results of our clustering algorithm is shown. Fuzzy c-Means(FCM) and FN-DBSCAN are fuzzy clustering algorithms. The main difference between the traditional hard clustering and fuzzy clustering can be stated as follows. While in hard clustering an entity belongs only to one cluster, in fuzzy clustering entities are allowed to belong to many clusters with different degrees of membership[23]. In Fig. 5 results of FCM algorithm is shown. FCM algorithm needs number of clusters k as an input, so we found optimal value of k as 3 with Partitioning Coefficient method. We used k-nearest neighbors method for forming initial clusters. FN-DBSCAN is based on DBSCAN algorithm, but it uses fuzzy neighborhood relation. It is observed that the FN-DBSCAN algorithm is more robust than the DBSCAN algorithm to data sets with various shapes and densities[20]. Results of FN-DBSCAN algorithm by using exponential membership function with parameters 1 = 0 . 91 and 2 = 0 . 1 is shown in Fig. 6. It is easily seen that FN-DBSCAN algorithm performs much better than our partitioning-based algorithm. It also performs better than FCM, even both uses fuzzy neighborhood relation. Neighbor-Joining(NJ) algorithm[4] presented by Saitou and Nei is a widely used method for constructing phylogenetic trees by using a symmetric distance matrix. This method is based on the minimum evolution principle and provides trees with near-minimal sum of branch-length estimates[24]. An alternative formulation of the NJ method[25] which reduces computational complexity from O ( n ) = n 5 to O ( n ) = n 3 was given by Studier and Keppler. The NJ method proceeds in a heuristic manner and guarantees that a short tree is found, but not the shortest. At each stage of clustering, NJ considers that data are starlike, as shown in Fig. 7. Then, it extracts the closest pair 1, 2 which minimizes the length of the tree as shown in Fig. 8. The closest pair is then clustered into a new internal node, and the distances of this node to the rest of the nodes in the tree are computed and used in later iterations. The algorithm terminates when N − 2 internal nodes have been inserted into the tree. In Fig. 9, a final phylogenetic tree of 8 samples is shown. The internal nodes are labeled from A to F. The branch length estimates of the tree are also shown. Neighbor-joining method and the method formulated by Studier and Keppler differ in combining the elements of selected pair. It is proven in[24]that both methods always obtain the same tree shape. Simple considerations show that both algorithms also provide identical branch lengths. At the final step of our method, we construct phylogenetic trees for each cluster of samples by using NJ algorithm. We implemented NJ method in Java and used the formulation of Studier and Keppler because of its reduced time complexity. FN-DBSCAN is the most robust clustering algorithm for our data. It forms three clusters: cluster 1 size of 20, cluster 2 size of 34, and cluster 3 size of 90. Phylogenetic trees show the genetic similarity of our samples from Haplogroup G. Each sample is labeled with a number starting from 0. In Fig. 10, Fig. 12 and Fig. 13 phylogenetic trees of three clusters are shown. NJ method uses a symmetric distance matrix to construct unrooted trees. MDS also uses symmetric distance matrix to visualize samples in multidimensional space. So results of these two methods should produce similar results. In Fig. 11, MDS representation of cluster 1 is shown. In this paper, we proposed a three-step method for constructing phylogenetic trees for samples of Y-DNA haplogroups. We considered mutation rates of STR markers in calculating genetic distances between samples. We divided samples into clusters so we can handle large amount of data easily. We also proposed a new partitioning-based clustering algorithm. Several partitioning and density based clustering algorithms have been executed on samples. Results show that density-based clustering algorithm gives more robust clusters. We finally con- structed phylogenetic trees for each cluster by using NJ method and compared results with MDS ...
Context 2
... of samples in multidimensional space is required to compare clustering results. We only have symmetric distance matrix of samples. We used Multi-Dimensional Scaling(MDS) for finding positions of samples in 2-dimensional space. MDS is a statistical technique to visualize dissimilarity in data. In MDS, objects are represented as points in a usually 2-dimensional space, such that the distances between the points match the observed dissimilarities as closely as possible[21]. The degree of correspondence between the distances among points implied by MDS and the matrix input is measured by a stress function. We used MDSJ Java package[22] to find positions of samples in 2-dimensional space. Po- sitions of 145 samples in shown in Fig. 3. In order to find the clustering algorithm that gives the most effective clusters for our biological data, we executed several algorithms on 145 samples. Two partitioning-based algorithms: our algorithm and FCM, one density-based algorithm: FN-DBSCAN are executed and their results are shown in Fig. 4, Fig. 5 and Fig. 6 respectively. In Fig. 4 results of our clustering algorithm is shown. Fuzzy c-Means(FCM) and FN-DBSCAN are fuzzy clustering algorithms. The main difference between the traditional hard clustering and fuzzy clustering can be stated as follows. While in hard clustering an entity belongs only to one cluster, in fuzzy clustering entities are allowed to belong to many clusters with different degrees of membership[23]. In Fig. 5 results of FCM algorithm is shown. FCM algorithm needs number of clusters k as an input, so we found optimal value of k as 3 with Partitioning Coefficient method. We used k-nearest neighbors method for forming initial clusters. FN-DBSCAN is based on DBSCAN algorithm, but it uses fuzzy neighborhood relation. It is observed that the FN-DBSCAN algorithm is more robust than the DBSCAN algorithm to data sets with various shapes and densities[20]. Results of FN-DBSCAN algorithm by using exponential membership function with parameters 1 = 0 . 91 and 2 = 0 . 1 is shown in Fig. 6. It is easily seen that FN-DBSCAN algorithm performs much better than our partitioning-based algorithm. It also performs better than FCM, even both uses fuzzy neighborhood relation. Neighbor-Joining(NJ) algorithm[4] presented by Saitou and Nei is a widely used method for constructing phylogenetic trees by using a symmetric distance matrix. This method is based on the minimum evolution principle and provides trees with near-minimal sum of branch-length estimates[24]. An alternative formulation of the NJ method[25] which reduces computational complexity from O ( n ) = n 5 to O ( n ) = n 3 was given by Studier and Keppler. The NJ method proceeds in a heuristic manner and guarantees that a short tree is found, but not the shortest. At each stage of clustering, NJ considers that data are starlike, as shown in Fig. 7. Then, it extracts the closest pair 1, 2 which minimizes the length of the tree as shown in Fig. 8. The closest pair is then clustered into a new internal node, and the distances of this node to the rest of the nodes in the tree are computed and used in later iterations. The algorithm terminates when N − 2 internal nodes have been inserted into the tree. In Fig. 9, a final phylogenetic tree of 8 samples is shown. The internal nodes are labeled from A to F. The branch length estimates of the tree are also shown. Neighbor-joining method and the method formulated by Studier and Keppler differ in combining the elements of selected pair. It is proven in[24]that both methods always obtain the same tree shape. Simple considerations show that both algorithms also provide identical branch lengths. At the final step of our method, we construct phylogenetic trees for each cluster of samples by using NJ algorithm. We implemented NJ method in Java and used the formulation of Studier and Keppler because of its reduced time complexity. FN-DBSCAN is the most robust clustering algorithm for our data. It forms three clusters: cluster 1 size of 20, cluster 2 size of 34, and cluster 3 size of 90. Phylogenetic trees show the genetic similarity of our samples from Haplogroup G. Each sample is labeled with a number starting from 0. In Fig. 10, Fig. 12 and Fig. 13 phylogenetic trees of three clusters are shown. NJ method uses a symmetric distance matrix to construct unrooted trees. MDS also uses symmetric distance matrix to visualize samples in multidimensional space. So results of these two methods should produce similar results. In Fig. 11, MDS representation of cluster 1 is shown. In this paper, we proposed a three-step method for constructing phylogenetic trees for samples of Y-DNA haplogroups. We considered mutation rates of STR markers in calculating genetic distances between samples. We divided samples into clusters so we can handle large amount of data easily. We also proposed a new partitioning-based clustering algorithm. Several partitioning and density based clustering algorithms have been executed on samples. Results show that density-based clustering algorithm gives more robust clusters. We finally con- structed phylogenetic trees for each cluster by using NJ method and compared results with MDS ...

Citations

Article
Full-text available
We provide a detailed review of basic algorithm techniqueues as applied to bioinformatic problems. Dynamic programming and graph algorithms are of particular concern due to their wide range of applications in bioinformatics. Some of the bioinformatic problems do not have solutions in polynomial time and are called NP-Complete. For these problems, approximation algorithms may be used. We show several examples where approximation algorithms may be used to provide sub-optimal solutions to these problems. We finally provide sample results of our ongoing work on building phylogenetic trees for Y-haplogroup data.