Clustering of gene expression data and end-point measurements by simulated annealing

Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA.
Journal of Bioinformatics and Computational Biology (Impact Factor: 0.78). 03/2009; 7(1):193-215. DOI: 10.1142/S021972000900400X
Source: PubMed


Most clustering techniques do not incorporate phenotypic data. Limited biological interpretation is garnered from the informal process of clustering biological samples and then labeling groups with the phenotypes of the samples. A more formal approach of clustering samples is presented. The method utilizes simulated annealing of the Modk-prototypes objective function. Separate weighting terms are used for microarray, clinical chemistry, and histopathology measurements to control the influence of each data domain on the clustering of the samples. The weights are adapted during the clustering process. A cluster's prototype is representative of the phenotype of the cluster members. Genes are extracted from phenotypic prototypes obtained from the livers of rats exposed to acetaminophen (an analgesic and antipyretic agent) that differed in the extent of centrilobular necrosis. Map kinase signaling and linoleic acid metabolism were significant biological processes influenced by the exposures of acetaminophen that manifested centrilobular necrosis.

Download full-text


Available from: Pierre Bushel, Jul 25, 2014
20 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: High throughput biological data need to be processed, analyzed, and interpreted to address problems in life sciences. Bioinformatics, computational biology, and systems biology deal with biological problems using computational methods. Clustering is one of the methods used to gain insight into biological processes, particularly at the genomics level. Clearly, clustering can be used in many areas of biological data analysis. However, this paper presents a review of the current clustering algorithms designed especially for analyzing gene expression data. It is also intended to introduce one of the main problems in bioinformatics - clustering gene expression data - to the operations research community.
    12/2012; 39(12):3046-3061. DOI:10.1016/j.cor.2012.03.008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Complex diseases are often difficult to diagnose, treat and study due to the multi-factorial nature of the underlying etiology. Large data sets are now widely available that can be used to define novel, mechanistically distinct disease subtypes (endotypes) in a completely data-driven manner. However, significant challenges exist with regard to how to segregate individuals into suitable subtypes of the disease and understand the distinct biological mechanisms of each when the goal is to maximize the discovery potential of these data sets. A multi-step decision tree-based method is described for defining endotypes based on gene expression, clinical covariates, and disease indicators using childhood asthma as a case study. We attempted to use alternative approaches such as the Student's t-test, single data domain clustering and the Modk-prototypes algorithm, which incorporates multiple data domains into a single analysis and none performed as well as the novel multi-step decision tree method. This new method gave the best segregation of asthmatics and non-asthmatics, and it provides easy access to all genes and clinical covariates that distinguish the groups. The multi-step decision tree method described here will lead to better understanding of complex disease in general by allowing purely data-driven disease endotypes to facilitate the discovery of new mechanisms underlying these diseases. This application should be considered a complement to ongoing efforts to better define and diagnose known endotypes. When coupled with existing methods developed to determine the genetics of gene expression, these methods provide a mechanism for linking genetics and exposomics data and thereby accounting for both major determinants of disease.
    BMC Systems Biology 11/2013; 7(1):119. DOI:10.1186/1752-0509-7-119 · 2.44 Impact Factor