Meta-analysis for ranked discovery datasets: theoretical framework and empirical demonstration for microarrays. Comput Biol Chem
ABSTRACT The combination of results from different large-scale datasets of multidimensional biological signals (such as gene expression profiling) presents a major challenge. Methodologies are needed that can efficiently combine diverse datasets, but can also test the extent of diversity (heterogeneity) across the combined studies. We developed METa-analysis of RAnked DISCovery datasets (METRADISC), a generalized meta-analysis method for combining information across discovery-oriented datasets and for testing between-study heterogeneity for each biological variable of interest. The method is based on non-parametric Monte Carlo permutation testing. The tested biological variables are ranked in each study according to the level of statistical significance. METRADISC tests for each biological variable of interest its average rank and the between-study heterogeneity of the study-specific ranks. After accounting for ties and differences in tested variables across studies, we randomly permute the ranks of each study and the simulated metrics of average rank and heterogeneity are calculated. The procedure is repeated to generate null distributions for the metrics. The use of METRADISC is demonstrated empirically using gene expression data from seven studies comparing prostate cancer cases and normal controls. We offer a new tool for combining complex datasets derived from massive testing, discovery-oriented research and for examining the diversity of results across the combined studies.
- SourceAvailable from: Ivan Simko
[Show abstract] [Hide abstract]
- "). Previously, both MD and RP have performed very well when tested on data from microarray studies (Breitling et al., 2004; Zintzaras and Ioannidis, 2008). However, unlike in our datasets, microarray-based ranks are produced from several tens of thousands of genes, of which most are overlapping in two or more microarrays. "
ABSTRACT: Combining heterogeneous data from plant breeding trials into a single dataset can be challenging, especially if observations have been performed only on partially overlapping sets of accessions, or if evaluations were done with different rating scales. In the present work we propose combining such data by making use of aggregate ranking approaches. To test 13 aggregate ranking methods for performance, we have simulated 16 types of datasets that resemble those observed in plant breeding trials. The evaluation of aggregate ranking methods was carried out using both distance-based measures (Kendall’s tau and Spearman’s rho) and number of rank violations caused by a proposed aggregate ranking. Our analysis indicates that methods based on Bradley-Terry or Rasch models performed better than the other tested methods when factors such as fitness of aggregate rankings, time required for analyses, and ability to analyze weak rankings were considered. Verification of the approach on real data from 19 studies indicated a substantial increase in significance (P-value dropped by a factor of 100,000) when linkage between a marker and a trait was based on aggregated data rather than on each of the individual trials. The ability to combine heterogeneous data from independent studies has important ramifications for data analysis in association studies. Results from our study indicate that this kind of meta-analysis is more powerful than individual analyses.Communications in Biometry and Crop Science 01/2010;
- [Show abstract] [Hide abstract]
ABSTRACT: DNA microarray data provide a high-throughput technique for the genome-wide profiling of genes at the transcript level. With large amounts of microarray data deposited on various types and aspects of malignancies, microarray technology has revolutionized the study of cancer. Such experiments aid in the discovery of novel biomarkers and provide insight into disease diagnosis, prognosis and response to treatment. Nonetheless, microarray data contains non-biological obscuring variations and systemic biases, which can distort the extraction of true aberrations in gene expression. Moreover, the number of samples generated by a single experiment is typically less than is statistically required to support the large number of genes studied. As a result, biomarker gene lists produced from independent datasets show little overlap. Therefore, to understand the pathophysiology of cancers and the influence they exert on the cellular processes they override, methods for combining data from different sources are necessary. Meta-analysis techniques have been utilized to address this issue by conducting an individual statistical analysis on each of the acquired datasets, then incorporating the results to generate a final gene list based on aggregated p-values or ranks. However, much of the publicly accessible cancer microarray datasets are unbalanced or asymmetric and therefore lack data from healthy samples. Consequently, critical and considerable amounts of data are overlooked. An integrative approach that combines data prior to analysis can incorporate asymmetric data. For this reason, a merge approach to the previously validated technique, the significance analysis of microarrays, is proposed. The merged SAM technique reproduced the known-cancer literature with higher coverage than meta-analysis in the five independent cancer tissues considered. The same methodology was extended to a database of approximately 6000 healthy and cancer samples arising from thirteen tissues. The integrative approach has allowed for the identification of key genes common to the invasive paths of multiple cancers and can aid in drug discovery. Moreover, this integrative microarray approach was applied to viral data from HIV-1, hepatitis C and influenza to investigate the effect of these infections on iron-binding proteins. Iron is crucial for proteins involved in metabolism, DNA synthesis and immunity, accentuating such proteins as direct or indirect viral targets.
- [Show abstract] [Hide abstract]
ABSTRACT: The number of published genetic association studies (GASs) is increasing tremendously due to the availability of mapped single-nucleotide polymorphisms (SNPs) and advances in genotyping technologies. A search in HuGENet illustrates the rapid accumulation of evidence for major diseases. Recently, there has been a lot of activity regarding genome-wide association studies (GWASs), and a growing number of forthcoming studies is expected. GASs and GWASs are usually underpowered to detect significant associations, and the varying quality of reporting publications befuddles researchers. A meta-analysis can increase power and provide standards of reporting results. However, the conduct of a meta-analysis of GASs faces a major obstacle, which is the structure and diversity of stored information in databases. Similar problems are expected for GWASs, though the data are not yet publicly available. The development of a Web-based system for the detailed and structured recording of GAS or GWAS data, accompanied by an estimation of the overall genetic risk effects, would enable scientists to keep track of evidence for gene-disease associations.Journal of Human Genetics 02/2008; 53(1):1-9. DOI:10.1007/s10038-007-0223-5 · 2.53 Impact Factor