Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis

Department of Biostatistics and Epidemiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA.
Biostatistics (Impact Factor: 2.24). 10/2012; 14(2). DOI: 10.1093/biostatistics/kxs038
Source: PubMed

ABSTRACT Motivated by studying the association between nutrient intake and human gut microbiome composition, we developed a method for structure-constrained sparse canonical correlation analysis (ssCCA) in a high-dimensional setting. ssCCA takes into account the phylogenetic relationships among bacteria, which provides important prior knowledge on evolutionary relationships among bacterial taxa. Our ssCCA formulation utilizes a phylogenetic structure-constrained penalty function to impose certain smoothness on the linear coefficients according to the phylogenetic relationships among the taxa. An efficient coordinate descent algorithm is developed for optimization. A human gut microbiome data set is used to illustrate this method. Both simulations and real data applications show that ssCCA performs better than the standard sparse CCA in identifying meaningful variables when there are structures in the data.

1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Most procedures which correct for multiple tests assume ideal p-values, for instance the Bonferroni correction or the procedures of Bonferroni-Holm, Sidak, Hochberg or Benjamini-Hochberg. This article considers multiple testing under the assumption that the ideal p-values for the hypotheses under consideration are not available and thus have to be approximated using Monte-Carlo simulation. This scenario widely occurs in practical situations. We are interested in obtaining the same rejections and non-rejections as the ones obtained if the ideal p-values for all hypotheses had been available. The contribution of this article is threefold. Firstly, it introduces a new framework for the scenario aforementioned, both in terms of a generic algorithm used to draw samples and an arbitrary multiple testing procedure to evaluate the tests. We establish conditions on both the testing procedure and on the algorithm which guarantee that the rejections and non-rejections obtained through Monte-Carlo simulation only are identical to the ones obtained with the ideal p-values. Secondly, by simplifying our condition for an arbitrary step-up or step-down procedure, we extend the applicability of our framework to a general class of step-up and step-down procedures used in practice. Thirdly, we show how to use our framework to improve established methods without proven properties in such a way as to yield certain theoretical guarantees on their results. These modifications can easily be implemented in practice and lead to a certain way of reporting classifications as three sets together with an error bound on their correctness, demonstrated exemplarily using a real biological dataset.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. It is challenging to explore the relationship between these different types of genomic data sets. In this paper, we focus on a multivariate statistical method, canonical correlation analysis (CCA) method for this problem. Conventional CCA method does not work effectively if the number of data samples is significantly less than that of biomarkers, which is a typical case for genomic data (e.g., SNPs). Sparse CCA (sCCA) methods were introduced to overcome such difficulty, mostly using penalizations with l-1 norm (CCA-l1) or the combination of l-1and l-2 norm (CCA-elastic net). However, they overlook the structural or group effect within genomic data in the analysis, which often exist and are important (e.g., SNPs spanning a gene interact and work together as a group). We propose a new group sparse CCA method (CCA-sparse group) along with an effective numerical algorithm to study the mutual relationship between two different types of genomic data (i.e., SNP and gene expression). We then extend the model to a more general formulation that can include the existing sCCA models. We apply the model to feature/variable selection from two data sets and compare our group sparse CCA method with existing sCCA methods on both simulation and two real datasets (human gliomas data and NCI60 data). We use a graphical representation of the samples with a pair of canonical variates to demonstrate the discriminating characteristic of the selected features. Pathway analysis is further performed for biological interpretation of those features. The CCA-sparse group method incorporates group effects of features into the correlation analysis while performs individual feature selection simultaneously. It outperforms the two sCCA methods (CCA-l1 and CCA-group) by identifying the correlated features with more true positives while controlling total discordance at a lower level on the simulated data, even if the group effect does not exist or there are irrelevant features grouped with true correlated features. Compared with our proposed CCA-group sparse models, CCA-l1 tends to select less true correlated features while CCA-group inclines to select more redundant features.
    BMC Bioinformatics 08/2013; 14(1):245. DOI:10.1186/1471-2105-14-245 · 2.67 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Both genetic variants and brain region abnormalities are recognized as important factors for complex diseases (e.g., schizophrenia). In this paper, we investigated the correspondence between single nucleotide polymorphism (SNP) and brain activity measured by functional magnetic resonance imaging (fMRI) to understand how genetic variation influences the brain activity. A group sparse canonical correlation analysis method (group sparse CCA) was developed to explore the correlation between these two datasets which are high dimensional-the number of SNPs/voxels is far greater than the number of samples. Different from the existing sparse CCA methods (sCCA), our approach can exploit structural information in the correlation analysis by introducing group constraints. A simulation study demonstrates that it outperforms the existing sCCA. We applied this method to the real data analysis and identified two pairs of significant canonical variates with average correlations of 0.4527 and 0.4292 respectively, which were used to identify genes and voxels associated with schizophrenia. The selected genes are mostly from 5 schizophrenia (SZ)-related signalling pathways. The brain mappings of the selected voxles also indicate the abnormal brain regions susceptible to schizophrenia. A gene and brain region of interest (ROI) correlation analysis was further performed to confirm the significant correlations between genes and ROIs.
    Medical image analysis 10/2013; 18(6). DOI:10.1016/ · 3.68 Impact Factor