Data mining of RNA expression and DNA genotype data: Presentation Group 5 contributions to Genetic Analysis Workshop 15

Genetic Epidemiology (Impact Factor: 2.6). 02/2007; 31 Suppl 1(S1):S43-50. DOI: 10.1002/gepi.20279
Source: PubMed


The complexity of data available in human genetics continues to grow at an explosive rate. With that growth, the challenges to understanding the meaning of the underlying information also grow. A currently popular approach to dissecting such information falls under the broad category of data mining. This can apply to any approach that tries to extract relevant information from large amounts of data, but often refers to methods that deal, in a non-linear fashion, with very large numbers of variables that cannot be simultaneously handled by more conventional statistical methods. To explore the usefulness of some of these approaches, 13 groups applied a variety of strategies to the first dataset provided to GAW 15 participants. With the extensive microarray and SNP data provided for 14 CEPH families, these groups explored multistage analyses, machine learning methods, network construction, and other techniques to try to answer questions about gene-gene interaction, functional similarities, co-regulated gene expression and the mapping of gene expression determinants, among others. In general, the methods offered strategies to provide a better understanding of the complex pathways involved in gene expression and function. These are still "works in progress," often exploratory in nature, but they provide insights into ways in which the data might be interpreted. Despite the still preliminary nature of some of these methods and the diversity of the approaches, some common themes emerged. The collection of papers and methods offer a starting point for further exploration of complex interactions in human genetic data now readily available.

1 Follower
8 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets. EMMIX-GENE is available at
    Bioinformatics 04/2002; 18(3):413-22. DOI:10.1093/bioinformatics/18.3.413 · 4.98 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide association studies using thousands to hundreds of thousands of single nucleotide polymorphism (SNP) markers and region-wide association studies using a dense panel of SNPs are already in use to identify disease susceptibility genes and to predict disease risk in individuals. Because these tasks become increasingly important, three different data sets were provided for the Genetic Analysis Workshop 15, thus allowing examination of various novel and existing data mining methods for both classification and identification of disease susceptibility genes, gene by gene or gene by environment interaction. The approach most often applied in this presentation group was random forests because of its simplicity, elegance, and robustness. It was used for prediction and for screening for interesting SNPs in a first step. The logistic tree with unbiased selection approach appeared to be an interesting alternative to efficiently select interesting SNPs. Machine learning, specifically ensemble methods, might be useful as pre-screening tools for large-scale association studies because they can be less prone to overfitting, can be less computer processor time intensive, can easily include pair-wise and higher-order interactions compared with standard statistical approaches and can also have a high capability for classification. However, improved implementations that are able to deal with hundreds of thousands of SNPs at a time are required.
    Genetic Epidemiology 02/2007; 31 Suppl 1(S1):S51-60. DOI:10.1002/gepi.20280 · 2.60 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Some genes that affect development and behavior in mammals are known to be imprinted; and > or = 1% of all mammalian genes are imprinted. Hence, incorporating an imprinting parameter into linkage analysis may increase the power to detect linkage for these traits. Here we propose theoretical justifications for a recently developed model for testing of linkage, in the presence of genetic imprinting, between a quantitative-trait locus and a polymorphic marker; this is achieved in the variance-components framework. We also incorporate sex-specific recombination fractions into this model. We discuss the effects that imprinting and nonimprinting have on the power of the usual variance-components method and on the variance-components method that incorporates an imprinting parameter. We provide noncentrality parameters that can be used to determine the sample size necessary to attain a specified power for a given significance level, which is useful in the planning of a linkage study. Optimal strategies for a genome scan of potentially imprinted traits are discussed.
    The American Journal of Human Genetics 04/2002; 70(3):751-7. DOI:10.1086/338931 · 10.93 Impact Factor
Show more