Evaluating distance functions for clustering tandem repeats

Department of Electrical and Computer Engineering, Boston University, Boston, MA, USA.
Genome informatics. International Conference on Genome Informatics 02/2005; 16(1):3-12.
Source: PubMed


Tandem repeats are an important class of DNA repeats and much research has focused on their efficient identification, their use in DNA typing and fingerprinting, and their causative role in trinucleotide repeat diseases such as Huntington Disease, myotonic dystrophy, and Fragile-X mental retardation. We are interested in clustering tandem repeats into groups or families based on sequence similarity so that their biological importance may be further explored. To cluster tandem repeats we need a notion of pairwise distance which we obtain by alignment. In this paper we evaluate five distance functions used to produce those alignments--Consensus, Euclidean, Jensen-Shannon Divergence, Entropy-Surface, and Entropy-weighted. It is important to analyze and compare these functions because the choice of distance metric forms the core of any clustering algorithm. We employ a novel method to compare alignments and thereby compare the distance functions themselves. We rank the distance functions based on the cluster validation techniques--Average Cluster Density and Average Silhouette Width. Finally, we propose a multi-phase clustering method which produces good-quality clusters. In this study, we analyze clusters of tandem repeats from five sequences: Human Chromosomes 3, 5, 10 and X and C. elegans Chromosome III.

Full-text preview

Available from:
  • Source
    • "Thus while the Euclidean kernel might be appropriate for image intensities , it might not be appropriate for all feature spaces (e.g. time series spectra or gene expression vectors) [43] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Computer-aided prognosis (CAP) is a new and exciting complement to the field of computer-aided diagnosis (CAD) and involves developing and applying computerized image analysis and multi-modal data fusion algorithms to digitized patient data (e.g. imaging, tissue, genomic) for helping physicians predict disease outcome and patient survival. While a number of data channels, ranging from the macro (e.g. MRI) to the nano-scales (proteins, genes) are now being routinely acquired for disease characterization, one of the challenges in predicting patient outcome and treatment response has been in our inability to quantitatively fuse these disparate, heterogeneous data sources. At the Laboratory for Computational Imaging and Bioinformatics (LCIB)(1) at Rutgers University, our team has been developing computerized algorithms for high dimensional data and image analysis for predicting disease outcome from multiple modalities including MRI, digital pathology, and protein expression. Additionally, we have been developing novel data fusion algorithms based on non-linear dimensionality reduction methods (such as Graph Embedding) to quantitatively integrate information from multiple data sources and modalities with the overarching goal of optimizing meta-classifiers for making prognostic predictions. In this paper, we briefly describe 4 representative and ongoing CAP projects at LCIB. These projects include (1) an Image-based Risk Score (IbRiS) algorithm for predicting outcome of Estrogen receptor positive breast cancer patients based on quantitative image analysis of digitized breast cancer biopsy specimens alone, (2) segmenting and determining extent of lymphocytic infiltration (identified as a possible prognostic marker for outcome in human epidermal growth factor amplified breast cancers) from digitized histopathology, (3) distinguishing patients with different Gleason grades of prostate cancer (grade being known to be correlated to outcome) from digitized needle biopsy specimens, and (4) integrating protein expression measurements obtained from mass spectrometry with quantitative image features derived from digitized histopathology for distinguishing between prostate cancer patients at low and high risk of disease recurrence following radical prostatectomy.
    Full-text · Article · Feb 2011 · Computerized medical imaging and graphics: the official journal of the Computerized Medical Imaging Society
  • Source
    • "Connected components clustering is used to produce initial clusters with a percent similarity cut-off value (default = 85%). Clusters may be refined with the slower Partition Around Medoids (PAM) algorithm (26,27) which is a k-means approach. Figure 3 shows an example of related repeats detected by clustering. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Tandem repeats in DNA have been under intensive study for many years, first, as a consequence of their usefulness as genomic markers and DNA fingerprints and more recently as their role in human disease and regulatory processes has become apparent. The Tandem Repeats Database (TRDB) is a public repository of information on tandem repeats in genomic DNA. It contains a variety of tools for repeat analysis, including the Tandem Repeats Finder program, query and filtering capabilities, repeat clustering, polymorphism prediction, PCR primer selection, data visualization and data download in a variety of formats. In addition, TRDB serves as a centralized research workbench. It provides user storage space and permits collaborators to privately share their data and analysis. TRDB is available at
    Full-text · Article · Feb 2007 · Nucleic Acids Research
  • [Show abstract] [Hide abstract]
    ABSTRACT: The lack of a unified medical language is blamed for the inability of computers to handle medical information effectively. As a solution to at least one of the problems embodied in this lack, the ASTM Subcommittee E31.12 on Medical Informatics has constructed an infrastructure detailing nosologic standards and guides for a biomedical nomenclature that lends itself to computer usage. The thrust of this effort and the principles guiding it are discussed
    No preview · Conference Paper · Dec 1989
Show more