Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: A case study
Heterogeneity in sample composition is an inherent issue in many gene expression studies and, in many cases, should be taken into account in the downstream analysis to enable correct interpretation of the underlying biological processes. Typical examples are infectious diseases or immunology-related studies using blood samples, where, for example, the proportions of lymphocyte sub-populations are expected to vary between cases and controls. Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, notably in bioinformatics where its ability to extract meaningful information from high-dimensional data such as gene expression microarrays has been demonstrated. Very recently, it has been applied to biomarker discovery and gene expression deconvolution in heterogeneous tissue samples. Being essentially unsupervised, standard NMF methods are not guaranteed to find components corresponding to the cell types of interest in the sample, which may jeopardize the correct estimation of cell proportions. We have investigated the use of prior knowledge, in the form of a set of marker genes, to improve gene expression deconvolution with NMF algorithms. We found that this improves the consistency with which both cell type proportions and cell type gene expression signatures are estimated. The proposed method was tested on a microarray dataset consisting of pure cell types mixed in known proportions. Pearson correlation coefficients between true and estimated cell type proportions improved substantially (typically from about 0.5 to approximately 0.8) with the semi-supervised (marker-guided) versions of commonly used NMF algorithms. Furthermore known marker genes associated with each cell type were assigned to the correct cell type more frequently for the guided versions. We conclude that the use of marker genes improves the accuracy of gene expression deconvolution using NMF and suggest modifications to how the marker gene information is used that may lead to further improvements.
Available from: bioinformatics.oxfordjournals.org
- "Most techniques extend on the linear model of (Venet, et al., 2001), which estimates the final measured gene expression as the sum of gene expression of the contributing cell types. Several approaches estimate relative fractions of individual cell types within a sample using gene expression profiles that are characteristic for each cell type (Abbas, et al., 2009; Ahn, et al., 2013; Bolen, et al., 2011; Gaujoux and Seoighe, 2012; Gong, et al., 2011; Lu, et al., 2003; Wang, et al., 2006; Zhong, et al., 2013), whereas other models estimate the characteristic gene expression profiles of each cell type using measured cell type fractions (Shen-Orr, et al., 2010; Stuart, et al., 2004). Simultaneous estimates of cell type expression profiles and cell type fraction, analogous to principalcomponents analysis (PCA) have also been proposed (Erkkila, et al., 2010; Lahdesmaki, et al., 2005; Repsilber, et al., 2010; Venet, et al., 2001). "
[Show abstract] [Hide abstract]
ABSTRACT: One of the significant obstacles in the development of clinically relevant microarray-derived biomarkers and classifiers is tissue heterogeneity. Physical cell separation techniques, such as cell-sorting and laser-capture micro-dissection, can enrich samples for cell types of interest, but are costly, labor-intensive and can limit investigation of important interactions between different cell types.
We developed a new computational approach, called Microarray Micro-dissection with Analysis of Differences (MMAD), which performs micro-dissection in silico. Notably, MMAD (1) allows for simultaneous estimation of cell fractions and gene expression profiles of contributing cell types, (2) adjusts for microarray normalization bias, (3) utilizes the corrected Akaike Information Criterion (AICc) during model optimization to minimize overfitting, and (4) provides mechanisms for comparing gene expression and cell fractions between samples in different classes. Computational micro-dissection of simulated and experimental tissue mixture datasets showed tight correlations between predicted and measured gene expression of pure tissues as well as tight correlations between reported and estimated cell fraction for each of the individual cell types. In simulation studies, MMAD showed superior ability to detect differentially expressed genes in mixed tissue samples when compared to standard metrics, including both Significance Analysis of Microarrays (SAM) and cell-type specific significance analysis of microarrays (csSAM).Conclusions: We have developed a new computational tool called MMAD, which is capable of performing robust tissue micro-dissection in silico, and which can improve the detection of differentially expressed genes. MMAD software as implemented in MATLAB is publically available for download at http://sourceforge.net/projects/mmad/.
Available from: Kaifang Pang
- "Several supervised and semi-supervised computational deconvolution algorithms have also been proposed to tackle this problem
[17-21]. However, they require prior knowledge of either the cell type frequencies within a given tissue
[19,20], or the in vitro gene expression profiles of each component cell type
[Show abstract] [Hide abstract]
Cellular heterogeneity is present in almost all gene expression profiles. However, transcriptome analysis of tissue specimens often ignores the cellular heterogeneity present in these samples. Standard deconvolution algorithms require prior knowledge of the cell type frequencies within a tissue or their in vitro expression profiles. Furthermore, these algorithms tend to report biased estimations.
Here, we describe a Digital Sorting Algorithm (DSA) for extracting cell-type specific gene expression profiles from mixed tissue samples that is unbiased and does not require prior knowledge of cell type frequencies.
The results suggest that DSA is a specific and sensitivity algorithm in gene expression profile deconvolution and will be useful in studying individual cell types of complex tissues.
Available from: link.springer.com
- "We note that since we began this work, a small number of authors have published similar deconvolution algorithms using gene expression data
[13-15]. The techniques are similar to the quadratic programming method we describe below in Methods for deconvolving a single sample, but none comprehensively addresses statistical properties or employs data from DNA methylation. "
[Show abstract] [Hide abstract]
There has been a long-standing need in biomedical research for a method that quantifies the normally mixed composition of leukocytes beyond what is possible by simple histological or flow cytometric assessments. The latter is restricted by the labile nature of protein epitopes, requirements for cell processing, and timely cell analysis. In a diverse array of diseases and following numerous immune-toxic exposures, leukocyte composition will critically inform the underlying immuno-biology to most chronic medical conditions. Emerging research demonstrates that DNA methylation is responsible for cellular differentiation, and when measured in whole peripheral blood, serves to distinguish cancer cases from controls.
Here we present a method, similar to regression calibration, for inferring changes in the distribution of white blood cells between different subpopulations (e.g. cases and controls) using DNA methylation signatures, in combination with a previously obtained external validation set consisting of signatures from purified leukocyte samples. We validate the fundamental idea in a cell mixture reconstruction experiment, then demonstrate our method on DNA methylation data sets from several studies, including data from a Head and Neck Squamous Cell Carcinoma (HNSCC) study and an ovarian cancer study. Our method produces results consistent with prior biological findings, thereby validating the approach.
Our method, in combination with an appropriate external validation set, promises new opportunities for large-scale immunological studies of both disease states and noxious exposures.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.