The SVA package for removing batch effects and other unwanted variation in high-throughput experiments

Department of Biostatistics, JHU Bloomberg School of Public Health, Baltimore, MD, USA.
Bioinformatics (Impact Factor: 4.98). 01/2012; 28(6):882-3. DOI: 10.1093/bioinformatics/bts034
Source: PubMed


Heterogeneity and latent variables are now widely recognized as major sources of bias and variability in high-throughput
experiments. The most well-known source of latent variation in genomic experiments are batch effects—when samples are processed
on different days, in different groups or by different people. However, there are also a large number of other variables that
may have a major impact on high-throughput measurements. Here we describe the sva package for identifying, estimating and removing unwanted sources of variation in high-throughput experiments. The sva package supports surrogate variable estimation with the sva function, direct adjustment for known batch effects with the ComBat function and adjustment for batch and latent variables in prediction problems with the fsva function.

Availability: The R package sva is freely available from

Contact: jleek{at}

Supplementary information: Supplementary data are available at Bioinformatics online.

Download full-text


Available from: William Evan Johnson, Jan 06, 2014
  • Source
    • "Probe level analysis was performed using the " affy " package of the Bioconductor project [12]. After Robust Multi-array Average normalization , batch effect removal was performed using the Surrogate Variable Analysis (SVA) package [13]. The lipid-related probe sets were selected based on the Biological Processes of the Gene Ontology annotation [14]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Chronic obstructive pulmonary disease (COPD) is a heterogeneous and progressive inflammatory condition that has been linked to the dysregulation of many metabolic pathways including lipid biosynthesis. How lipid metabolism could affect disease progression in smokers with COPD remains unclear. We cross-examined the transcriptomics, proteomics, metabolomics, and phenomics data available on the public domain to elucidate the mechanisms by which lipid metabolism is perturbed in COPD. We reconstructed a sputum lipid COPD (SpLiCO) signaling network utilizing active/inactive, and functional/dysfunctional lipid-mediated signaling pathways to explore how lipid-metabolism could promote COPD pathogenesis in smokers. SpLiCO was further utilized to investigate signal amplifiers, distributers, propagators, feed-forward and/or -back loops that link COPD disease severity and hypoxia to disruption in the metabolism of sphingolipids, fatty acids and energy. Also, hypergraph analysis and calculations for dependency of molecules identified several important nodes in the network with modular regulatory and signal distribution activities. Our systems-based analyses indicate that arachidonic acid is a critical and early signal distributer that is upregulated by the sphingolipid signaling pathway in COPD, while hypoxia plays a critical role in the elevated dependency to glucose as a major energy source. Integration of SpLiCo and clinical data shows a strong association between hypoxia and the upregulation of sphingolipids in smokers with emphysema, vascular disease, hypertension and those with increased risk of lung cancer. DOI: 10.1016/j.bbalip.2015.07.005
    Biochimica et Biophysica Acta (BBA) - Molecular and Cell Biology of Lipids 08/2015; 1851(10):1383-1393. DOI:10.1016/j.bbalip.2015.07.005 · 5.16 Impact Factor
  • Source
    • "QCs) to fit a smoothed model for the intensity levels of certain features, and then to correct all the biological samples accordingly (Dunn et al., 2011). The R package sva includes the ComBat function, which compensates the batch effects on microarray data using an empirical Bayes approach (Johnson et al., 2007; Leek et al., 2012). This method has been applied to normalize gene expression and methylation data (Chen et al., 2013; Leitch et al., 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Unlabelled: Liquid chromatography coupled to mass spectrometry (LC/MS) has become widely used in Metabolomics. Several artefacts have been identified during the acquisition step in large LC/MS metabolomics experiments, including ion suppression, carryover or changes in the sensitivity and intensity. Several sources have been pointed out as responsible for these effects. In this context, the drift effects of the peak intensity is one of the most frequent and may even constitute the main source of variance in the data, resulting in misleading statistical results when the samples are analysed. In this article, we propose the introduction of a methodology based on a common variance analysis before the data normalization to address this issue. This methodology was tested and compared with four other methods by calculating the Dunn and Silhouette indices of the quality control classes. The results showed that our proposed methodology performed better than any of the other four methods. As far as we know, this is the first time that this kind of approach has been applied in the metabolomics context. Availability and implementation: The source code of the methods is available as the R package intCor at Supplementary information: Supplementary data are available at Bioinformatics online.
    Bioinformatics 07/2014; 30(20). DOI:10.1093/bioinformatics/btu423 · 4.98 Impact Factor
  • Source
    • "Results with pSVA batch correction were compared to implementations of SVA in the SVA package (Leek et al., 2012). It was also compared to this package's implementation of ComBat, which also fits the model in eq (1) from estimates of both P and Γ with an empirical Bayes procedure as depicted in Figure 1 (Johnson et al., 2007). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Sample source, procurement process and other technical variations introduce batch effects into genomics data. Algorithms to remove these artifacts enhance differences between known biological covariates, but also carry potential concern of removing intragroup biological heterogeneity and thus any personalized genomic signatures. As a result, accurate identification of novel subtypes from batch-corrected genomics data is challenging using standard algorithms designed to remove batch effects for class comparison analyses. Nor can batch effects be corrected reliably in future applications of genomics-based clinical tests, in which the biological groups are by definition unknown a priori. Results: Therefore, we assess the extent to which various batch correction algorithms remove true biological heterogeneity. We also introduce an algorithm, permuted-SVA (pSVA), using a new statistical model that is blind to biological covariates to correct for technical artifacts while retaining biological heterogeneity in genomic data. This algorithm facilitated accurate subtype identification in head and neck cancer from gene expression data in both formalin-fixed and frozen samples. When applied to predict Human Papillomavirus (HPV) status, pSVA improved cross-study validation even if the sample batches were highly confounded with HPV status in the training set. Availability and implementation: All analyses were performed using R version 2.15.0. The code and data used to generate the results of this manuscript is available from
    Bioinformatics 06/2014; 30(19). DOI:10.1093/bioinformatics/btu375 · 4.98 Impact Factor
Show more