The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets – improving meta-analysis and prediction of prognosis

Applied Bioinformatics of Cancer Research Group, Breakthrough Research Unit, Edinburgh Cancer Research Centre, Western General Hospital, Crewe Road South, Edinburgh, EH4 2XR, UK.
BMC Medical Genomics (Impact Factor: 3.91). 09/2008; 1(1):42. DOI: 10.1186/1755-8794-1-42
Source: PubMed Central

ABSTRACT Background
The number of gene expression studies in the public domain is rapidly increasing, representing a highly valuable resource. However, dataset-specific bias precludes meta-analysis at the raw transcript level, even when the RNA is from comparable sources and has been processed on the same microarray platform using similar protocols. Here, we demonstrate, using Affymetrix data, that much of this bias can be removed, allowing multiple datasets to be legitimately combined for meaningful meta-analyses.

A series of validation datasets comparing breast cancer and normal breast cell lines (MCF7 and MCF10A) were generated to examine the variability between datasets generated using different amounts of starting RNA, alternative protocols, different generations of Affymetrix GeneChip or scanning hardware. We demonstrate that systematic, multiplicative biases are introduced at the RNA, hybridization and image-capture stages of a microarray experiment. Simple batch mean-centering was found to significantly reduce the level of inter-experimental variation, allowing raw transcript levels to be compared across datasets with confidence. By accounting for dataset-specific bias, we were able to assemble the largest gene expression dataset of primary breast tumours to-date (1107), from six previously published studies. Using this meta-dataset, we demonstrate that combining greater numbers of datasets or tumours leads to a greater overlap in differentially expressed genes and more accurate prognostic predictions. However, this is highly dependent upon the composition of the datasets and patient characteristics.

Multiplicative, systematic biases are introduced at many stages of microarray experiments. When these are reconciled, raw data can be directly integrated from different gene expression datasets leading to new biological findings with increased statistical power.

  • Source
    • "At this stage, we performed a second scaling normalization to set the average expression on each chip to 1000. Although this technique cannot remove all, but it can significantly reduce batch effects (Sims et al. 2008). We integrated the gene expression and clinical data using PostgreSQL, an open-source object-relational database system (www.postgresql. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The validation of prognostic biomarkers in large independent patient cohorts is a major bottleneck in ovarian cancer research. We implemented an online tool to assess the prognostic value of the expression levels of all microarray-quantified genes in ovarian cancer patients. First, a database was set up using gene expression data and survival information of 1287 ovarian cancer patients downloaded from Gene Expression Omnibus and The Cancer Genome Atlas (Affymetrix HG-U133A, HG-U133A 2.0, and HG-U133 Plus 2.0 microarrays). After quality control and normalization, only probes present on all three Affymetrix platforms were retained (n=22,277). To analyze the prognostic value of the selected gene, we divided the patients into two groups according to various quantile expressions of the gene. These groups were then compared using progression-free survival (n=1090) or overall survival (n=1287). A Kaplan-Meier survival plot was generated and significance was computed. The tool can be accessed online at We used this integrative data analysis tool to validate the prognostic power of 37 biomarkers identified in the literature. Of these, CA125 (MUC16; P=3.7×10(-5), hazard ratio (HR)=1.4), CDKN1B (P=5.4×10(-5), HR=1.4), KLK6 (P=0.002, HR=0.79), IFNG (P=0.004, HR=0.81), P16 (P=0.02, HR=0.66), and BIRC5 (P=0.00017, HR=0.75) were associated with survival. The combination of several probe sets can further increase prediction efficiency. In summary, we developed a global online biomarker validation platform that mines all available microarray data to assess the prognostic power of 22,277 genes in 1287 ovarian cancer patients. We specifically used this tool to evaluate the effect of 37 previously published biomarkers on ovarian cancer prognosis.
    Endocrine Related Cancer 01/2012; 19(2):197-208. DOI:10.1530/ERC-11-0329 · 4.91 Impact Factor
  • Source
    • "However, qPCR analysis showed a similar decrease in cyclin D1 mRNA levels which may become Figure 3 Correlation of CCND1, ID1, SNAI1 and SNAI2 expression to recurrence free survival. Expression of our genes of interest in relation to recurrence-free survival was examined in a breast cancer database containing 1,107 tumours from Sims et al. (2008). Gene expression intensity was quartiled as 1-low, 2-medium low, 3-medium high and 4-high, and assessed in all patients, ER-positive and ER-negative patients, respectively (A) CCND1 quartiles (B) ID1 quartiles (C) SNAI1 quartiles (D) SNAI2 quartiles. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Cyclin D1 is a well-characterised cell cycle regulator with established oncogenic capabilities. Despite these properties, studies report contrasting links to tumour aggressiveness. It has previously been shown that silencing cyclin D1 increases the migratory capacity of MDA-MB-231 breast cancer cells with concomitant increase in 'inhibitor of differentiation 1' (ID1) gene expression. Id1 is known to be associated with more invasive features of cancer and with the epithelial-mesenchymal transition (EMT). Here, we sought to determine if the increase in cell motility following cyclin D1 silencing was mediated by Id1 and enhanced EMT-features. To further substantiate these findings we aimed to delineate the link between CCND1, ID1 and EMT, as well as clinical properties in primary breast cancer. Protein and gene expression of ID1, CCND1 and EMT markers were determined in MDA-MB-231 and ZR75 cells by western blot and qPCR. Cell migration and promoter occupancy were monitored by transwell and ChIP assays, respectively. Gene expression was analysed from publicly available datasets. The increase in cell migration following cyclin D1 silencing in MDA-MB-231 cells was abolished by Id1 siRNA treatment and we observed cyclin D1 occupancy of the Id1 promoter region. Moreover, ID1 and SNAI2 gene expression was increased following cyclin D1 knock-down, an effect reversed with Id1 siRNA treatment. Similar migratory and SNAI2 increases were noted for the ER-positive ZR75-1 cell line, but in an Id1-independent manner. In a meta-analysis of 1107 breast cancer samples, CCND1low/ID1high tumours displayed increased expression of EMT markers and were associated with reduced recurrence free survival. Finally, a greater percentage of CCND1low/ID1high tumours were found in the EMT-like 'claudin-low' subtype of breast cancer than in other subtypes. These results indicate that increased migration of MDA-MB-231 cells following cyclin D1 silencing can be mediated by Id1 and is linked to an increase in EMT markers. Moreover, we have confirmed a relationship between cyclin D1, Id1 and EMT in primary breast cancer, supporting our in vitro findings that low cyclin D1 expression can be linked to aggressive features in subgroups of breast cancer.
    BMC Cancer 09/2011; 11(1):417. DOI:10.1186/1471-2407-11-417 · 3.32 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: High-throughput genomic technology has rapidly become a major tool for the study of breast cancer. Gene expression profiling has been applied to many areas of research from basic science to translational studies, with the potential to identify new targets for treatment, mechanisms of resistance and to improve on current tools for the analysis of prognosis. However, the sheer scale of the data generated along with the number of different protocols, platforms and analysis methods can make these studies difficult for clinicians to comprehend. Similarly, computational scientists and statisticians that may be called upon to analyse the data generated are often unaware of the processes involved in sample collection or the relevance and impact of genetics and pathological characteristics. There is a pressing need for better understanding of the challenges and limitations of microarray approaches, both in experimental design and data analysis. Holistic, whole-genome approaches are still relatively new and critics have been quick to highlight non-overlapping results from groups testing similar hypotheses. However, it is often subtle differences in the experimental design and technology that underpin the variation between these studies. Rather than indicating that the data are meaningless, this suggests that many findings are real, but highly context dependent. This review explores both the current state and potential of bioinformatics to bring meaning to high-throughput genomic approaches in the understanding of breast cancer.
    Journal of clinical pathology 02/2009; 62(10):879-85. DOI:10.1136/jcp.2008.060376 · 2.55 Impact Factor
Show more