Clustering methods for the analysis of DNA microarray data

Source: CiteSeer

ABSTRACT It is now possible to simultaneously measure the expression of thousands of genes during cellular differentiation and response, through the use of DNA microarrays. A major statistical task is to understand the structure in the data that arise from this technology. In this paper we review various methods of clustering, and illustrate how they can be used to arrange both the genes and cell lines from a set of DNA microarray experiments. The methods discussed are global clustering techniques including hierarchical, K-means, and block clustering, and tree-structured vector quantization. Finally, we propose a new method for identifying structure in subsets of both genes and cell lines that are potentially obscured by the global clustering approaches. 1 Introduction DNA microarrays and other high-throughput methods for analyzing complex nucleic acid samples make it now possible to measure rapidly, efficiently and accurately the levels of virtually all genes expressed in a biologi...

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background While the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. Large gene expression datasets from microarray experiments already exist and many of these can be used to help assign potential functions to these genes. We have applied Support Vector Machines (SVM), a sigmoid fitting function and a stratified cross-validation approach to analyze a large microarray experiment dataset from Drosophila melanogaster in order to predict possible functions for previously un-annotated genes. A total of approximately 5043 different genes, or about one-third of the predicted genes in the D. melanogaster genome, are represented in the dataset and 1854 (or 37%) of these genes are un-annotated.Results 39 Gene Ontology Biological Process (GO-BP) categories were found with precision value equal or larger than 0.75, when recall was fixed at the 0.4 level. For two of those categories, we have provided additional support for assigning given genes to the category by showing that the majority of transcripts for the genes belonging in a given category have a similar localization pattern during embryogenesis. Additionally, by assessing the predictions using a confidence score, we have been able to provide a putative GO-BP term for 1422 previously un-annotated genes or about 77% of the un-annotated genes represented on the microarray and about 19% of all of the un-annotated genes in the D. melanogaster genome.Conclusions Our study successfully employs a number of SVM classifiers, accompanied by detailed calibration and validation techniques, to generate a number of predictions for new annotations for D. melanogaster genes. The applied probabilistic analysis to SVM output improves the interpretability of the prediction results and the objectivity of the validation procedure.
    BioData Mining 04/2013; 6(1):8.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Using microarrays, we can build a table of gene expression profiles that characterize the dynamic functioning of each gene in the genome by measuring gene transcription levels at different developmental stages, in different tissues, and under various conditions. Rows in this table represent genes, columns represent samples, such as different tissues, developmental stages and treatments, and each position in the table contains values characterizing the expression level of that particular gene in a particular sample. We call this table a gene expression matrix.
    07/2011: pages 105-129;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cluster analysis is an important tool for data exploration and it has been applied in a wide variety of fields like engineering, economics, computer sciences, life and medical sciences, earth sciences and social sciences. The typical cluster analysis consists of four steps (i.e. feature selection or extraction, clustering algorithm design or selection, cluster validation and results interpretation) with feedback pathway. These steps are closely related to each other and affect the derived clusters. In this paper, a new metaheuristic algorithm is proposed for cluster analysis. This algorithm uses an Ant Colony Optimization to feature selection step and a Greedy Randomized Adaptive Search Procedure to clustering algorithm design step. The proposed algorithm has been applied with very good results to many data sets.
    Annals of Operations Research 01/2011; 188:343-358. · 1.03 Impact Factor