The SVA package for removing batch effects and other unwanted variation in high-throughput experiments

Department of Biostatistics, JHU Bloomberg School of Public Health, Baltimore, MD, USA.
Bioinformatics (Impact Factor: 4.98). 01/2012; 28(6):882-3. DOI: 10.1093/bioinformatics/bts034
Source: PubMed


Heterogeneity and latent variables are now widely recognized as major sources of bias and variability in high-throughput
experiments. The most well-known source of latent variation in genomic experiments are batch effects—when samples are processed
on different days, in different groups or by different people. However, there are also a large number of other variables that
may have a major impact on high-throughput measurements. Here we describe the sva package for identifying, estimating and removing unwanted sources of variation in high-throughput experiments. The sva package supports surrogate variable estimation with the sva function, direct adjustment for known batch effects with the ComBat function and adjustment for batch and latent variables in prediction problems with the fsva function.

Availability: The R package sva is freely available from

Contact: jleek{at}

Supplementary information: Supplementary data are available at Bioinformatics online.

Download full-text


Available from: William Evan Johnson, Jan 06, 2014
  • Source
    • "The normalization and statistical analysis of the proteomics data were implemented in R. Two steps were used to normalize the raw data: 1) an intra-experimental (within plex) variation step – for each sample, each protein's raw signal was normalized to the total ion intensity of the protein within the plex, and 2) an inter-experimental (across plexes) variation step – for each sample, each protein's normalized value from step 1) was transformed by dividing it to the mean of all normalized values of the protein obtained in step 1) of all 24 samples. The Surrogate Variable Analysis (SVA) method implemented in the sva R package[18](v3.10.0) was used to eliminate latent noise in the data not explained by the categorical factors, including batch effects, by building a set of covariates constructed directly from the proteomics dataset. Three significant surrogate variables (SVs) were identified for the proteomics data (shown in Additional file 1: Table S1). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Parkinson disease (PD) is a neurodegenerative disease characterized by the accumulation of alpha-synuclein (SNCA) and other proteins in aggregates termed “Lewy Bodies” within neurons. PD has both genetic and environmental risk factors, and while processes leading to aberrant protein aggregation are unknown, past work points to abnormal levels of SNCA and other proteins. Although several genome-wide studies have been performed for PD, these have focused on DNA sequence variants by genome-wide association studies (GWAS) and on RNA levels (microarray transcriptomics), while genome-wide proteomics analysis has been lacking. Methods This study employed two state-of-the-art technologies, three-stage Mass Spectrometry Tandem Mass Tag Proteomics (12 PD, 12 controls) and RNA-sequencing transcriptomics (29 PD, 44 controls), evaluated in the context of PD GWAS implicated loci and microarray transcriptomics (19 PD, 24 controls). The technologies applied for this study were performed in a set of overlapping prefrontal cortex (Brodmann area 9) samples obtained from PD patients and sex and age similar neurologically healthy controls. Results After appropriate filters, proteomics robustly identified 3558 unique proteins, with 283 of these (7.9 %) significantly different between PD and controls (q-value < 0.05). RNA-sequencing identified 17,580 protein-coding genes, with 1095 of these (6.2 %) significantly different (FDR p-value < 0.05); only 166 of the FDR significant protein-coding genes (0.94 %) were present among the 3558 proteins characterized. Of these 166, eight genes (4.8 %) were significant in both studies, with the same direction of effect. Functional enrichment analysis of the proteomics results strongly supports mitochondrial-related pathways, while comparable analysis of the RNA-sequencing results implicates protein folding pathways and metallothioneins. Ten of the implicated genes or proteins co-localized to GWAS loci. Evidence implicating SNCA was stronger in proteomics than in RNA-sequencing analyses. Conclusions We report the largest analysis of proteomics in PD to date, and the first to combine this technology with RNA-sequencing to investigate GWAS implicated loci. Notably, differentially expressed protein-coding genes were more likely to not be characterized in the proteomics analysis, which lessens the ability to compare across platforms. Combining multiple genome-wide platforms offers novel insights into the pathological processes responsible for this disease by identifying pathways implicated across methodologies.
    Preview · Article · Dec 2015 · BMC Medical Genomics
  • Source
    • "Probe level analysis was performed using the " affy " package of the Bioconductor project [12]. After Robust Multi-array Average normalization , batch effect removal was performed using the Surrogate Variable Analysis (SVA) package [13]. The lipid-related probe sets were selected based on the Biological Processes of the Gene Ontology annotation [14]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Chronic obstructive pulmonary disease (COPD) is a heterogeneous and progressive inflammatory condition that has been linked to the dysregulation of many metabolic pathways including lipid biosynthesis. How lipid metabolism could affect disease progression in smokers with COPD remains unclear. We cross-examined the transcriptomics, proteomics, metabolomics, and phenomics data available on the public domain to elucidate the mechanisms by which lipid metabolism is perturbed in COPD. We reconstructed a sputum lipid COPD (SpLiCO) signaling network utilizing active/inactive, and functional/dysfunctional lipid-mediated signaling pathways to explore how lipid-metabolism could promote COPD pathogenesis in smokers. SpLiCO was further utilized to investigate signal amplifiers, distributers, propagators, feed-forward and/or -back loops that link COPD disease severity and hypoxia to disruption in the metabolism of sphingolipids, fatty acids and energy. Also, hypergraph analysis and calculations for dependency of molecules identified several important nodes in the network with modular regulatory and signal distribution activities. Our systems-based analyses indicate that arachidonic acid is a critical and early signal distributer that is upregulated by the sphingolipid signaling pathway in COPD, while hypoxia plays a critical role in the elevated dependency to glucose as a major energy source. Integration of SpLiCo and clinical data shows a strong association between hypoxia and the upregulation of sphingolipids in smokers with emphysema, vascular disease, hypertension and those with increased risk of lung cancer. DOI: 10.1016/j.bbalip.2015.07.005
    Full-text · Article · Aug 2015 · Biochimica et Biophysica Acta (BBA) - Molecular and Cell Biology of Lipids
  • Source
    • "QCs) to fit a smoothed model for the intensity levels of certain features, and then to correct all the biological samples accordingly (Dunn et al., 2011). The R package sva includes the ComBat function, which compensates the batch effects on microarray data using an empirical Bayes approach (Johnson et al., 2007; Leek et al., 2012). This method has been applied to normalize gene expression and methylation data (Chen et al., 2013; Leitch et al., 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Unlabelled: Liquid chromatography coupled to mass spectrometry (LC/MS) has become widely used in Metabolomics. Several artefacts have been identified during the acquisition step in large LC/MS metabolomics experiments, including ion suppression, carryover or changes in the sensitivity and intensity. Several sources have been pointed out as responsible for these effects. In this context, the drift effects of the peak intensity is one of the most frequent and may even constitute the main source of variance in the data, resulting in misleading statistical results when the samples are analysed. In this article, we propose the introduction of a methodology based on a common variance analysis before the data normalization to address this issue. This methodology was tested and compared with four other methods by calculating the Dunn and Silhouette indices of the quality control classes. The results showed that our proposed methodology performed better than any of the other four methods. As far as we know, this is the first time that this kind of approach has been applied in the metabolomics context. Availability and implementation: The source code of the methods is available as the R package intCor at Supplementary information: Supplementary data are available at Bioinformatics online.
    Full-text · Article · Jul 2014 · Bioinformatics
Show more