Automatic Annotation of Spatial Expression Patterns via Sparse Bayesian Factor Models

Institute for Genome Sciences and Policy, Duke University, Durham, North Carolina, United States of America.
PLoS Computational Biology (Impact Factor: 4.62). 07/2011; 7(7):e1002098. DOI: 10.1371/journal.pcbi.1002098
Source: PubMed


Advances in reporters for gene expression have made it possible to document and quantify expression patterns in 2D-4D. In contrast to microarrays, which provide data for many genes but averaged and/or at low resolution, images reveal the high spatial dynamics of gene expression. Developing computational methods to compare, annotate, and model gene expression based on images is imperative, considering that available data are rapidly increasing. We have developed a sparse Bayesian factor analysis model in which the observed expression diversity of among a large set of high-dimensional images is modeled by a small number of hidden common factors. We apply this approach on embryonic expression patterns from a Drosophila RNA in situ image database, and show that the automatically inferred factors provide for a meaningful decomposition and represent common co-regulation or biological functions. The low-dimensional set of factor mixing weights is further used as features by a classifier to annotate expression patterns with functional categories. On human-curated annotations, our sparse approach reaches similar or better classification of expression patterns at different developmental stages, when compared to other automatic image annotation methods using thousands of hard-to-interpret features. Our study therefore outlines a general framework for large microscopy data sets, in which both the generative model itself, as well as its application for analysis tasks such as automated annotation, can provide insight into biological questions.

Download full-text


Available from: Uwe Ohler,
  • Source
    • "As with the majority of described approaches, this study involved a high level of human intervention in selecting ‘good’ images for training/testing purposes—a potential drawback, considering the rapid increase in the size of ISH image collections. In contrast, Pruteanu-Malinici et al. (2011) proposed a new approach for automatic annotation of spatial expression patterns using a ‘vocabulary’ of basic patterns that involved little to no human intervention. This work provided a flexible unsupervised framework in competitively predicting gene annotation terms, while using only a small set of features. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Computational approaches for the annotation of phenotypes from image data have shown promising results across many applications, and provide rich and valuable information for studying gene function and interactions. While data are often available both at high spatial resolution and across multiple time points, phenotypes are frequently annotated independently, for individual time points only. In particular, for the analysis of developmental gene expression patterns, it is biologically sensible when images across multiple time points are jointly accounted for, such that spatial and temporal dependencies are captured simultaneously. Methods: We describe a discriminative undirected graphical model to label gene-expression time-series image data, with an efficient training and decoding method based on the junction tree algorithm. The approach is based on an effective feature selection technique, consisting of a non-parametric sparse Bayesian factor analysis model. The result is a flexible framework, which can handle large-scale data with noisy incomplete samples, i.e. it can tolerate data missing from individual time points. Results: Using the annotation of gene expression patterns across stages of Drosophila embryonic development as an example, we demonstrate that our method achieves superior accuracy, gained by jointly annotating phenotype sequences, when compared with previous models that annotate each stage in isolation. The experimental results on missing data indicate that our joint learning method successfully annotates genes for which no expression data are available for one or more stages. Contact:
    Bioinformatics 07/2013; 29(13):i27-i35. DOI:10.1093/bioinformatics/btt206 · 4.98 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: High-spatial resolution imaging datasets of mammalian brains have recently become available in unprecedented amounts. Images now reveal highly complex patterns of gene expression varying on multiple scales. The challenge in analyzing these images is both in extracting the patterns that are most relevant functionally and in providing a meaningful representation that allows neuroscientists to interpret the extracted patterns. Here, we present FuncISH-a method to learn functional representations of neural in situ hybridization (ISH) images. We represent images using a histogram of local descriptors in several scales, and we use this representation to learn detectors of functional (GO) categories for every image. As a result, each image is represented as a point in a low-dimensional space whose axes correspond to meaningful functional annotations. The resulting representations define similarities between ISH images that can be easily explained by functional categories. We applied our method to the genomic set of mouse neural ISH images available at the Allen Brain Atlas, finding that most neural biological processes can be inferred from spatial expression patterns with high accuracy. Using functional representations, we predict several gene interaction properties, such as protein-protein interactions and cell-type specificity, more accurately than competing methods based on global correlations. We used FuncISH to identify similar expression patterns of GABAergic neuronal markers that were not previously identified and to infer new gene function based on image-image similarities. Supplementary data are available at Bioinformatics online.
    Bioinformatics 07/2013; 29(13):i36-i43. DOI:10.1093/bioinformatics/btt207 · 4.98 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: One important problem in genome science is to determine sets of co-regulated genes based on measurements of gene expression levels across samples, where the quantification of expression levels includes substantial technical and biological noise. To address this problem, we developed a Bayesian sparse latent factor model that uses a three parameter beta prior to flexibly model shrinkage in the loading matrix. By applying three layers of shrinkage to the loading matrix (global, factor-specific, and element-wise), this model has non-parametric properties in that it estimates the appropriate number of factors from the data. We added a two-component mixture to model each factor loading as being generated from either a sparse or a dense mixture component; this allows dense factors that capture confounding noise, and sparse factors that capture local gene interactions. We developed two statistics to quantify the stability of the recovered matrices for both sparse and dense matrices. We tested our model on simulated data and found that we successfully recovered the true latent structure as compared to related models. We applied our model to a large gene expression study and found that we recovered known covariates and small groups of co-regulated genes. We validated these gene subsets by testing for associations between genotype data and these latent factors, and we found a substantial number of biologically important genetic regulators for the recovered gene subsets.
Show more