Meta-analysis for ranked discovery datasets: theoretical framework and empirical demonstration for microarrays.

Department of Biomathematics, University of Thessaly School of Medicine, Larissa, Greece.
Computational Biology and Chemistry (Impact Factor: 1.6). 03/2008; 32(1):38-46. DOI: 10.1016/j.compbiolchem.2007.09.003
Source: PubMed

ABSTRACT The combination of results from different large-scale datasets of multidimensional biological signals (such as gene expression profiling) presents a major challenge. Methodologies are needed that can efficiently combine diverse datasets, but can also test the extent of diversity (heterogeneity) across the combined studies. We developed METa-analysis of RAnked DISCovery datasets (METRADISC), a generalized meta-analysis method for combining information across discovery-oriented datasets and for testing between-study heterogeneity for each biological variable of interest. The method is based on non-parametric Monte Carlo permutation testing. The tested biological variables are ranked in each study according to the level of statistical significance. METRADISC tests for each biological variable of interest its average rank and the between-study heterogeneity of the study-specific ranks. After accounting for ties and differences in tested variables across studies, we randomly permute the ranks of each study and the simulated metrics of average rank and heterogeneity are calculated. The procedure is repeated to generate null distributions for the metrics. The use of METRADISC is demonstrated empirically using gene expression data from seven studies comparing prostate cancer cases and normal controls. We offer a new tool for combining complex datasets derived from massive testing, discovery-oriented research and for examining the diversity of results across the combined studies.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Recent advance in biotechnology and its wide applications have led to the generation of many high-dimensional gene expression data sets that can be used to address similar biological questions. Meta-analysis plays an important role in summarizing and synthesizing scientific evidence from multiple studies. When the dimensions of datasets are high, it is desirable to incorporate variable selection into meta-analysis to improve model interpretation and prediction. According to our knowledge, all existing methods conduct variable selection with meta-analyzed data in an “all-in-or-all-out” fashion, that is, a gene is either selected in all of studies or not selected in any study. However, due to data heterogeneity commonly exist in meta-analyzed data, including choices of biospecimens, study population, and measurement sensitivity, it is possible that a gene is important in some studies while unimportant in others. In this article, we propose a novel method called meta-lasso for variable selection with high-dimensional meta-analyzed data. Through a hierarchical decomposition on regression coefficients, our method not only borrows strength across multiple data sets to boost the power to identify important genes, but also keeps the selection flexibility among data sets to take into account data heterogeneity. We show that our method possesses the gene selection consistency, that is, when sample size of each data set is large, with high probability, our method can identify all important genes and remove all unimportant genes. Simulation studies demonstrate a good performance of our method. We applied our meta-lasso method to a meta-analysis of five cardiovascular studies. The analysis results are clinically meaningful.
    Biometrics 09/2014; 70(4). DOI:10.1111/biom.12213 · 1.52 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A comprehensive software for performing meta-analysis of ranked discovery oriented datasets, such as those derived from microarrays or other high throughput technologies, and for testing between-study heterogeneity for biological variables (gene expression, microRNA, proteomic, or other high-dimensional data) is presented. The software can identify biological probes that have either very high average ranks (e.g. consistently over-expressed genes) or very low average ranks (e.g. consistently under-expressed genes). The program tests each probe's average rank and the between-study heterogeneity of the study-specific ranks. Furthermore, it performs heterogeneity analyses restricted to probes with similar average ranks. The program allows both unweighted and weighted analysis. Statistical inferences are based on Monte Carlo permutation tests.
    Computer methods and programs in biomedicine 09/2012; DOI:10.1016/j.cmpb.2012.08.001 · 1.56 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the rapid advances of various high-throughput technologies, generation of '-omics' data is commonplace in almost every biomedical field. Effective data management and analytical approaches are essential to fully decipher the biological knowledge contained in the tremendous amount of experimental data. Meta-analysis, a set of statistical tools for combining multiple studies of a related hypothesis, has become popular in genomic research. Here, we perform a systematic search from PubMed and manual collection to obtain 620 genomic meta-analysis papers, of which 333 microarray meta-analysis papers are summarized as the basis of this paper and the other 249 GWAS meta-analysis papers are discussed in the next companion paper. The review in the present paper focuses on various biological purposes of microarray meta-analysis, databases and software and related statistical procedures. Statistical considerations of such an analysis are further scrutinized and illustrated by a case study. Finally, several open questions are listed and discussed.
    Nucleic Acids Research 01/2012; 40(9):3785-99. DOI:10.1093/nar/gkr1265 · 8.81 Impact Factor