Meta-analysis for ranked discovery datasets: theoretical framework and empirical demonstration for microarrays. Comput Biol Chem

Department of Biomathematics, University of Thessaly School of Medicine, Larissa, Greece.
Computational Biology and Chemistry (Impact Factor: 1.12). 03/2008; 32(1):38-46. DOI: 10.1016/j.compbiolchem.2007.09.003
Source: PubMed


The combination of results from different large-scale datasets of multidimensional biological signals (such as gene expression profiling) presents a major challenge. Methodologies are needed that can efficiently combine diverse datasets, but can also test the extent of diversity (heterogeneity) across the combined studies. We developed METa-analysis of RAnked DISCovery datasets (METRADISC), a generalized meta-analysis method for combining information across discovery-oriented datasets and for testing between-study heterogeneity for each biological variable of interest. The method is based on non-parametric Monte Carlo permutation testing. The tested biological variables are ranked in each study according to the level of statistical significance. METRADISC tests for each biological variable of interest its average rank and the between-study heterogeneity of the study-specific ranks. After accounting for ties and differences in tested variables across studies, we randomly permute the ranks of each study and the simulated metrics of average rank and heterogeneity are calculated. The procedure is repeated to generate null distributions for the metrics. The use of METRADISC is demonstrated empirically using gene expression data from seven studies comparing prostate cancer cases and normal controls. We offer a new tool for combining complex datasets derived from massive testing, discovery-oriented research and for examining the diversity of results across the combined studies.

14 Reads
  • Source
    • "In addition, power analysis may not be applicable in meta-analysis, since it is a retrospective, all-inclusive synthesis of published studies [8,38]. Nevertheless, type II errors are expected to be less common in a meta-analysis than in single studies [8,39]. Currently, no single institution alone is able to provide a sufficient number of patients, and therefore the creation of large databases from consortia where researchers share their data are required. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Focal adhesion (FA) family genes have been studied as candidate genes for osteoporosis, but the results of genetic association studies (GASs) are controversial. To clarify these data, a systematic assessment of GASs for FA genes in osteoporosis was conducted. We developed Cumulative Meta-Analysis of GAS-OSTEOporosis (CUMAGAS-OSTEOporosis), a web-based information system that allows the retrieval, analysis and meta-analysis (for allele contrast, recessive, dominant, additive and codominant models) of data from GASs on osteoporosis with the capability of update. GASs were identified by searching the PubMed and HuGE PubLit databases. Data from 72 studies involving 13 variants of 6 genes were analyzed and catalogued in CUMAGAS-OSTEOporosis. Twenty-two studies produced significant associations with osteoporosis risk under any genetic model. All studies were underpowered (<50%). In four studies, the controls deviated from the Hardy-Weinberg equilibrium. Eight variants were chosen for meta-analysis, and significance was shown for the variants collagen, type I, α1 (COL1A1) G2046T (all genetic models), COL1A1 G-1997T (allele contrast and dominant model) and integrin β-chain β3 (ITGB3) T176C (recessive and additive models). In COL1A1 G2046T, subgroup analysis has shown significant associations for Caucasians, adults, females, males and postmenopausal women. A differential magnitude of effect in large versus small studies (that is, indication of publication bias) was detected for the variant COL1A1 G2046T. There is evidence of an implication of FA family genes in osteoporosis. CUMAGAS-OSTEOporosis could be a useful tool for current genomic epidemiology research in the field of osteoporosis.
    Full-text · Article · Jan 2011 · BMC Medicine
  • Source
    • "). Previously, both MD and RP have performed very well when tested on data from microarray studies (Breitling et al., 2004; Zintzaras and Ioannidis, 2008). However, unlike in our datasets, microarray-based ranks are produced from several tens of thousands of genes, of which most are overlapping in two or more microarrays. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Combining heterogeneous data from plant breeding trials into a single dataset can be challenging, especially if observations have been performed only on partially overlapping sets of accessions, or if evaluations were done with different rating scales. In the present work we propose combining such data by making use of aggregate ranking approaches. To test 13 aggregate ranking methods for performance, we have simulated 16 types of datasets that resemble those observed in plant breeding trials. The evaluation of aggregate ranking methods was carried out using both distance-based measures (Kendall’s tau and Spearman’s rho) and number of rank violations caused by a proposed aggregate ranking. Our analysis indicates that methods based on Bradley-Terry or Rasch models performed better than the other tested methods when factors such as fitness of aggregate rankings, time required for analyses, and ability to analyze weak rankings were considered. Verification of the approach on real data from 19 studies indicated a substantial increase in significance (P-value dropped by a factor of 100,000) when linkage between a marker and a trait was based on aggregated data rather than on each of the individual trials. The ability to combine heterogeneous data from independent studies has important ramifications for data analysis in association studies. Results from our study indicate that this kind of meta-analysis is more powerful than individual analyses.
    Full-text · Article · Jan 2010 · Communications in Biometry and Crop Science
  • Source
    • "We chose to use a rank-based method because: 1) in practice, the main purpose of microarray experiments is to rank genes rather than to obtain precise estimates of their statistical significance, since the number of statistically significant genes often greatly exceeds the number of genes that can be validated [33], 2) non-parametric analyses are more robust in general, 3) the techniques and assumptions used in the estimation of p-values and the subsequent correction for multiple hypothesis testing may be different between data sets and may not be directly comparable, and 4) using non-parametric methods to rank genes has proven highly effective in the context of genomics. Although more sophisticated rank-based procedures are available [34], the rank sum and rank product methods have been shown to give good results on microarray data [32]. Because the rank sum technique is more robust than the rank product approach and is preferable when the variance of some features may be larger than others [35], we employ the rank sum procedure. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Anaplastic astrocytoma (AA) and its more aggressive counterpart, glioblastoma multiforme (GBM), are the most common intrinsic brain tumors in adults and are almost universally fatal. A deeper understanding of the molecular relationship of these tumor types is necessary to derive insights into the diagnosis, prognosis, and treatment of gliomas. Although genomewide profiling of expression levels with microarrays can be used to identify differentially expressed genes between these tumor types, comparative studies so far have resulted in gene lists that show little overlap. To achieve a more accurate and stable list of the differentially expressed genes and pathways between primary GBM and AA, we performed a meta-analysis using publicly available genome-scale mRNA data sets. There were four data sets with sufficiently large sample sizes of both GBMs and AAs, all of which coincidentally used human U133 platforms from Affymetrix, allowing for easier and more precise integration of data. After scoring genes and pathways within each data set, we combined the statistics across studies using the nonparametric rank sum method to identify the features that differentiate GBMs and AAs. We found >900 statistically significant probe sets after correction for multiple testing from the >22,000 tested. We also used the rank sum approach to select >20 significant Biocarta pathways after correction for multiple testing out of >175 pathways examined. The most significant pathway was the hypoxia-inducible factor (HIF) pathway. Our analysis suggests that many of the most statistically significant genes work together in a HIF1A/VEGF-regulated network to increase angiogenesis and invasion in GBM when compared to AA. We have performed a meta-analysis of genome-scale mRNA expression data for 289 human malignant gliomas and have identified a list of >900 probe sets and >20 pathways that are significantly different between GBM and AA. These feature lists could be utilized to aid in diagnosis, prognosis, and grade reduction of high-grade gliomas and to identify genes that were not previously suspected of playing an important role in glioma biology. More generally, this approach suggests that combined analysis of existing data sets can reveal new insights and that the large amount of publicly available cancer data sets should be further utilized in a similar manner.
    Full-text · Article · Oct 2009 · Molecular Cancer
Show more