The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets – improving meta-analysis and prediction of prognosis

Applied Bioinformatics of Cancer Research Group, Breakthrough Research Unit, Edinburgh Cancer Research Centre, Western General Hospital, Crewe Road South, Edinburgh, EH4 2XR, UK.
BMC Medical Genomics (Impact Factor: 2.87). 09/2008; 1(1):42. DOI: 10.1186/1755-8794-1-42
Source: PubMed Central


The number of gene expression studies in the public domain is rapidly increasing, representing a highly valuable resource. However, dataset-specific bias precludes meta-analysis at the raw transcript level, even when the RNA is from comparable sources and has been processed on the same microarray platform using similar protocols. Here, we demonstrate, using Affymetrix data, that much of this bias can be removed, allowing multiple datasets to be legitimately combined for meaningful meta-analyses.

A series of validation datasets comparing breast cancer and normal breast cell lines (MCF7 and MCF10A) were generated to examine the variability between datasets generated using different amounts of starting RNA, alternative protocols, different generations of Affymetrix GeneChip or scanning hardware. We demonstrate that systematic, multiplicative biases are introduced at the RNA, hybridization and image-capture stages of a microarray experiment. Simple batch mean-centering was found to significantly reduce the level of inter-experimental variation, allowing raw transcript levels to be compared across datasets with confidence. By accounting for dataset-specific bias, we were able to assemble the largest gene expression dataset of primary breast tumours to-date (1107), from six previously published studies. Using this meta-dataset, we demonstrate that combining greater numbers of datasets or tumours leads to a greater overlap in differentially expressed genes and more accurate prognostic predictions. However, this is highly dependent upon the composition of the datasets and patient characteristics.

Multiplicative, systematic biases are introduced at many stages of microarray experiments. When these are reconciled, raw data can be directly integrated from different gene expression datasets leading to new biological findings with increased statistical power.

Download full-text


Available from: Andrew H Sims, Oct 02, 2015
25 Reads
  • Source
    • "For the gene expression analysis of 1107 primary breast cancers, a meta-analysis of six comprised Affymetrix datasets was performed as previously described [37]. Endpoints for datasets Chin et al., Pawitan et al. and Sotiriou et al. was recurrence-free survival and for Desmedt et al., Ivshina et al. and Wang et al. datasets it was disease-free survival. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Yes-associated protein (YAP1) is frequently reported to function as an oncogene in many types of cancer, but in breast cancer results remain controversial. We set out to clarify the role of YAP1 in breast cancer by examining gene and protein expression in subgroups of patient material and by downregulating YAP1 in vitro and studying its role in response to the widely used anti-estrogen tamoxifen. YAP1 protein intensity was scored as absent, weak, intermediate or strong in two primary breast cancer cohorts (n = 144 and n = 564) and mRNA expression of YAP1 was evaluated in a gene expression dataset (n = 1107). Recurrence-free survival was analysed using the log-rank test and Cox multivariate analysis was used to test for independence. WST-1 assay was employed to measure cell viability and a luciferase ERE (estrogen responsive element) construct was used to study the effect of tamoxifen, following downregulation of YAP1 using siRNAs. In the ER+ (Estrogen Receptor alpha positive) subgroup of the randomised cohort, YAP1 expression was inversely correlated to histological grade and proliferation (p = 0.001 and p = 0.016, respectively) whereas in the ER- (Estrogen Receptor alpha negative) subgroup YAP1 expression correlated positively to proliferation (p = 0.005). Notably, low YAP1 mRNA was independently associated with decreased recurrence-free survival in the gene expression dataset, specifically for the luminal A subgroup (p < 0.001) which includes low proliferating tumours of lower grade, usually associated with a good prognosis. This subgroup specificity led us to hypothesize that YAP1 may be important for response to endocrine therapies, such as tamoxifen, extensively used for luminal A breast cancers. In a tamoxifen randomised patient material, absent YAP1 protein expression was associated with impaired tamoxifen response which was significant upon interaction analysis (p = 0.042). YAP1 downregulation resulted in increased progesterone receptor (PgR) expression and a delayed and weaker tamoxifen in support of the clinical data. Decreased YAP1 expression is an independent prognostic factor for recurrence in the less aggressive luminal A breast cancer subgroup, likely due to the decreased tamoxifen sensitivity conferred by YAP1 downregulation.
    BMC Cancer 02/2014; 14(1):119. DOI:10.1186/1471-2407-14-119 · 3.36 Impact Factor
  • Source
    • "Affymetrix gene expression data representing a total of 1107 primary breast tumors from six previously published microarray studies [23], [24], [25], [26], [27], [28] were integrated as described previously using ComBat [29] to remove batch effects [30]. Centroid prediction [31] was used to assign the tumors from each dataset to the five Norway/Stanford subtypes (Basal, Luminal A, Luminal B, ERBB2 and Normal-like [32]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Wnt signalling has been implicated in stem cell regulation however its role in breast cancer stem cell regulation remains unclear. We used a panel of normal and breast cancer cell lines to assess Wnt pathway gene and protein expression, and for the investigation of Wnt signalling within stem cell-enriched populations, mRNA and protein expression was analysed after the selection of anoikis-resistant cells. Finally, cell lines and patient-derived samples were used to investigate Wnt pathway effects on stem cell activity in vitro. Wnt pathway signalling increased in cancer compared to normal breast and in both cell lines and patient samples, expression of Wnt pathway genes correlated with estrogen receptor (ER) expression. Furthermore, specific Wnt pathway genes were predictive for recurrence within subtypes of breast cancer. Canonical Wnt pathway genes were increased in breast cancer stem cell-enriched populations in comparison to normal breast stem cell-enriched populations. Furthermore in cell lines, the ligand Wnt3a increased whilst the inhibitor DKK1 reduced mammosphere formation with the greatest inhibitory effects observed in ER+ve breast cancer cell lines. In patient-derived metastatic breast cancer samples, only ER-ve mammospheres were responsive to the ligand Wnt3a. However, the inhibitor DKK1 efficiently inhibited both ER+ve and ER-ve breast cancer but not normal mammosphere formation, suggesting that the Wnt pathway is aberrantly activated in breast cancer mammospheres. Collectively, these data highlight differential Wnt signalling in breast cancer subtypes and activity in patient-derived metastatic cancer stem-like cells indicating a potential for Wnt-targeted treatment in breast cancers.
    PLoS ONE 07/2013; 8(7):e67811. DOI:10.1371/journal.pone.0067811 · 3.23 Impact Factor
  • Source
    • "Such reviews, or meta-analyses have greater statistical power to identify true effects from study-specific artefacts and, as such, are capable of identifying subtle effects that might be missed or deemed insignificant in smaller datasets. In the context of gene-expression analyses, meta-analysis of results from microarray studies has great potential, but also presents significant challenges due to differences between the platforms and analysis approaches employed in each study [1-5]. Direct integration of probe-level expression data from multiple studies is potentially even more powerful, but is further complicated due to differences in the conditions under which each dataset was generated, such as the amplification or labelling method, the scanner used or even just the date on which the samples were processed. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Affymetrix GeneChips and Illumina BeadArrays are the most widely used commercial single channel gene expression microarrays. Public data repositories are an extremely valuable resource, providing array-derived gene expression measurements from many thousands of experiments. Unfortunately many of these studies are underpowered and it is desirable to improve power by combining data from more than one study; we sought to determine whether platform-specific bias precludes direct integration of probe intensity signals for combined reanalysis. Using Affymetrix and Illumina data from the microarray quality control project, from our own clinical samples, and from additional publicly available datasets we evaluated several approaches to directly integrate intensity level expression data from the two platforms. After mapping probe sequences to Ensembl genes we demonstrate that, ComBat and cross platform normalisation (XPN), significantly outperform mean-centering and distance-weighted discrimination (DWD) in terms of minimising inter-platform variance. In particular we observed that DWD, a popular method used in a number of previous studies, removed systematic bias at the expense of genuine biological variability, potentially reducing legitimate biological differences from integrated datasets. Normalised and batch-corrected intensity-level data from Affymetrix and Illumina microarrays can be directly combined to generate biologically meaningful results with improved statistical power for robust, integrated reanalysis.
    BMC Medical Genomics 08/2012; 5(1):35. DOI:10.1186/1755-8794-5-35 · 2.87 Impact Factor
Show more