Statistical Applications in Genetics and Molecular Biology (STAT APPL GENET MOL)

Publisher Berkeley Electronic Press

Description

Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies.

  • Impact factor
    1.52
    Show impact factor history 
     
    Impact factor
  • Website
    Statistical Applications in Genetics and Molecular Biology website
  • Other titles
    Statistical applications in genetics and molecular biology, SAGMB
  • ISSN
    1544-6115
  • OCLC
    52157137
  • Material type
    Document, Periodical, Internet resource
  • Document type
    Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details

Berkeley Electronic Press

  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author can archive a post-print version
  • Conditions
    • On non-commercial authors personal website, non-commercial authors open-access university and employers institutional repository and non-commercial authors course website
    • PubMed and UK PubMed after 12 months (automatic for several journals)
    • Publisher copyright and source must be acknowledged
    • Publisher's version/PDF may be used
  • Classification
    ​ green

Publications in this journal

  • Article: Bayesian hierarchical graph-structured model for pathway analysis using gene expression data.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract In genomic analysis, there is growing interest in network structures that represent biochemistry interactions. Graph structured or constrained inference takes advantage of a known relational structure among variables to introduce smoothness and reduce complexity in modeling, especially for high-dimensional genomic data. There has been a lot of interest in its application in model regularization and selection. However, prior knowledge on the graphical structure among the variables can be limited and partial. Empirical data may suggest variations and modifications to such a graph, which could lead to new and interesting biological findings. In this paper, we propose a Bayesian random graph-constrained model, rGrace, an extension from the Grace model, to combine a priori network information with empirical evidence, for applications such as pathway analysis. Using both simulations and real data examples, we show that the new method, while leading to improved predictive performance, can identify discrepancy between data and a prior known graph structure and suggest modifications and updates.
    Statistical Applications in Genetics and Molecular Biology 05/2013;
  • Article: Block-diagonal discriminant analysis and its bias-corrected rules.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract High-throughput expression profiling allows simultaneous measure of tens of thousands of genes at once. These data have motivated the development of reliable biomarkers for disease subtypes identification and diagnosis. Many methods have been developed in the literature for analyzing these data, such as diagonal discriminant analysis, support vector machines, and k-nearest neighbor methods. The diagonal discriminant methods have been shown to perform well for high-dimensional data with small sample sizes. Despite its popularity, the independence assumption is unlikely to be true in practice. Recently, a gene module based linear discriminant analysis strategy has been proposed by utilizing the correlation among genes in discriminant analysis. However, the approach can be underpowered when the samples of the two classes are unbalanced. In this paper, we propose to correct the biases in the discriminant scores of block-diagonal discriminant analysis. In simulation studies, our proposed method outperforms other approaches in various settings. We also illustrate our proposed discriminant analysis method for analyzing microarray data studies.
    Statistical Applications in Genetics and Molecular Biology 05/2013;
  • Article: An extension of the Wilcoxon-Mann-Whitney test for analyzing RT-qPCR data.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract Classical approaches for analyzing reverse transcription quantitative polymerase chain reaction (RT-qPCR) data commonly require normalization before assessing differential expression (DE). Normalization often has a substantial effect on the interpretation and validity of the subsequent analysis steps, but at the same time it causes a reduction in variance and introduces dependence among the normalized outcomes. These effects can be substantial, however, they are typically ignored. Most normalization techniques and methods for DE focus on mean expression and are sensitive to outliers. Moreover, in cancer studies, for example, oncogenes are often only expressed in a subsample of the populations during sampling. This primarily affects the skewness and the tails of the distribution and the mean is therefore not necessarily the best effect size measure within these experimental setups. In our contribution, we propose an extension of the Wilcoxon-Mann-Whitney test which incorporates a robust normalization, and the uncertainty associated with normalization is propagated into the final statistical summaries for DE. Our method relies on semiparametric regression models that focus on the probability P{Y≤Y'}, where Y and Y' denote independent responses for different subject groups. This effect size is robust to outliers, while remaining informative and intuitive when DE affects the shape of the distribution instead of only the mean. We also extend our approach for assessing DE for multiple features simultaneously. Simulation studies show that the test has a good performance, and that it is very competitive with standard methods for this platform. The method is illustrated on two neuroblastoma studies.
    Statistical Applications in Genetics and Molecular Biology 05/2013;
  • Article: Genetic model selection in genome-wide association studies: robust methods and the use of meta-analysis.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract In genetic association studies (GAS) as well as in genome-wide association studies (GWAS), the mode of inheritance (dominant, additive and recessive) is usually not known a priori. Assuming an incorrect mode of inheritance may lead to substantial loss of power, whereas on the other hand, testing all possible models may result in an increased type I error rate. The situation is even more complicated in the meta-analysis of GAS or GWAS, in which individual studies are synthesized to derive an overall estimate. Meta-analysis increases the power to detect weak genotype effects, but heterogeneity and incompatibility between the included studies complicate things further. In this review, we present a comprehensive summary of the statistical methods used for robust analysis and genetic model selection in GAS and GWAS. We then discuss the application of such methods in the context of meta-analysis. We describe the theoretical properties of the various methods and the foundations on which they are based. We also present the available software implementations of the described methods. Finally, since only few of the available robust methods have been applied in the meta-analysis setting, we present some simple extensions that allow robust meta-analysis of GAS and GWAS. Possible extensions and proposals for future work are also discussed.
    Statistical Applications in Genetics and Molecular Biology 04/2013;
  • Article: Sensitivity to prior specification in Bayesian genome-based prediction models
    [show abstract] [hide abstract]
    ABSTRACT: Different statistical models have been proposed for maximizing prediction accuracy in genome-based prediction of breeding values in plant and animal breeding. However, little is known about the sensitivity of these models with respect to prior and hyperparameter specification, because comparisons of prediction performance are mainly based on a single set of hyperparameters. In this study, we focused on Bayesian prediction methods using a standard linear regression model with marker covariates coding additive effects at a large number of marker loci. By comparing different hyperparameter settings, we investigated the sensitivity of four methods frequently used in genome-based prediction (Bayesian Ridge, Bayesian Lasso, BayesA and BayesB) to specification of the prior distribution of marker effects. We used datasets simulated according to a typical maize breeding program differing in the number of markers and the number of simulated quantitative trait loci affecting the trait. Furthermore, we used an experimental maize dataset, comprising 698 doubled haploid lines, each genotyped with 56110 single nucleotide polymorphism markers and phenotyped as testcrosses for the two quantitative traits grain dry matter yield and grain dry matter content. The predictive ability of the different models was assessed by five-fold cross-validation. The extent of Bayesian learning was quantified by calculation of the Hellinger distance between the prior and posterior densities of marker effects. Our results indicate that similar predictive abilities can be achieved with all methods, but with BayesA and BayesB hyperparameter settings had a stronger effect on prediction performance than with the other two methods. Prediction performance of BayesA and BayesB suffered substantially from a non-optimal choice of hyperparameters.
    Statistical Applications in Genetics and Molecular Biology 04/2013;
  • Article: Exploring the sampling universe of RNA-seq.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract How deep is deep enough? While RNA-sequencing represents a well-established technology, the required sequencing depth for detecting all expressed genes is not known. If we leave the entire biological overhead and meta-information behind we are dealing with a classical sampling process. Such sampling processes are well known from population genetics and thoroughly investigated. Here we use the Pitman Sampling Formula to model the sampling process of RNA-sequencing. By doing so we characterize the sampling by means of two parameters which grasp the conglomerate of different sequencing technologies, protocols and their associated biases. We differ between two levels of sampling: number of reads per gene and respectively, number of reads starting at each position of a specific gene. The latter approach allows us to evaluate the theoretical expectation of uniform coverage and the performance of sequencing protocols in that respect. Most importantly, given a pilot sequencing experiment we provide an estimate for the size of the underlying sampling universe and, based on these findings, evaluate an estimator for the number of newly detected genes when sequencing an additional sample of arbitrary size.
    Statistical Applications in Genetics and Molecular Biology 04/2013;
  • Article: A novel method for analyzing genetic association with longitudinal phenotypes.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract Knowledge of genes influencing longitudinal patterns may offer information about predicting disease progression. We developed a systematic procedure for testing association between SNP genotypes and longitudinal phenotypes. We evaluated false positive rates and statistical power to localize genes for disease progression. We used genome-wide SNP data from the Framingham Heart Study. With longitudinal data from two real studies unrelated to Framingham, we estimated three trajectory curves from each study. We performed simulations by randomly selecting 500 individuals. In each simulation replicate, we assigned each individual to one of the three trajectory groups based on the underlying hypothesis (null or alternative), and generated corresponding longitudinal data. Individual Bayesian posterior probabilities (BPPs) for belonging to a specific trajectory curve were estimated. These BPPs were treated as a quantitative trait and tested (using the Wald test) for genome-wide association. Empirical false positive rates and power were calculated. Our method maintained the expected false positive rate for all simulation models. Also, our method achieved high empirical power for most simulations. Our work presents a method for disease progression gene mapping. This method is potentially clinically significant as it may allow doctors to predict disease progression based on genotype and determine treatment accordingly.
    Statistical Applications in Genetics and Molecular Biology 03/2013;
  • Article: Two optimization strategies of multi-stage design in clinical proteomic studies.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract We evaluated statistical approaches to facilitate and improve multi-stage designs for clinical proteomic studies which plan to transit from laboratory discovery to clinical utility. To find the design with the greatest expected number of true discoveries under constraints on cost and false discovery, the operating characteristics of the multi-stage study were optimized as a function of sample sizes and nominal type-I error rates at each stage. A nested simulated annealing algorithm was used to find the best solution in the bounded spaces constructed by multiple design parameters. This approach is demonstrated to be feasible and lead to efficient designs. The use of biological grouping information in the study design was also investigated using synthetic datasets based on a cardiac proteomic study, and an actual dataset from a clinical immunology proteomic study. When different protein patterns presented, performance improved when the grouping was informative, with little loss in performance when the grouping was uninformative.
    Statistical Applications in Genetics and Molecular Biology 03/2013;
  • Article: Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery
    [show abstract] [hide abstract]
    ABSTRACT: In omics studies aimed at the early detection and diagnosis of cancer, bioinformatics tools play a significant role when analyzing high dimensional, complex datasets, as well as when identifying a small set of biomarkers. However, in many cases, there are ambiguities in the robustness and the consistency of the discovered biomarker sets, since the feature selection methods often lead to irreproducible results. To address this, both the stability and the classification power of several chemometrics-based feature selection algorithms were evaluated using the Monte Carlo sampling technique, aiming at finding the most suitable feature selection methods for early cancer detection and biomarker discovery. To this end, two data sets were analyzed, which comprised of MALDI-TOF-MS and LC/TOF-MS spectra measured on serum samples in order to diagnose ovarian cancer. Using these datasets, the stability and the classification power of multiple feature subsets found by different feature selection methods were quantified by varying either the number of selected features, or the number of samples in the training set, with special emphasis placed on the property of stability. The results show that high consistency does not necessarily guarantee high predictive power. In addition, differences in the stability, as well as agreement in feature lists between several feature selection methods, depend on several factors, such as the number of available samples, feature sizes, quality of the information in the dataset, etc. Among the tested methods, only the variable importance in projection (VIP)-based method shows complementary properties, providing both highly consistent and accurate subsets of features. In addition, successive projection analysis (SPA) was excellent with regards to maintaining high stability over a wide range of experimental conditions. The stability of several feature selection methods is highly variable, stressing the importance of making the proper choice among feature selection methods. Therefore, rather than evaluating the selected features using only classification accuracy, stability measurements should be examined as well to improve the reliability of biomarker discovery.
    Statistical Applications in Genetics and Molecular Biology 03/2013;
  • Article: Higher order asymptotics for negative binomial regression inferences from RNA-sequencing data.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract RNA sequencing (RNA-Seq) is the current method of choice for characterizing transcriptomes and quantifying gene expression changes. This next generation sequencing-based method provides unprecedented depth and resolution. The negative binomial (NB) probability distribution has been shown to be a useful model for frequencies of mapped RNA-Seq reads and consequently provides a basis for statistical analysis of gene expression. Negative binomial exact tests are available for two-group comparisons but do not extend to negative binomial regression analysis, which is important for examining gene expression as a function of explanatory variables and for adjusted group comparisons accounting for other factors. We address the adequacy of available large-sample tests for the small sample sizes typically available from RNA-Seq studies and consider a higher-order asymptotic (HOA) adjustment to likelihood ratio tests. We demonstrate that 1) the HOA-adjusted likelihood ratio test is practically indistinguishable from the exact test in situations where the exact test is available, 2) the type I error of the HOA test matches the nominal specification in regression settings we examined via simulation, and 3) the power of the likelihood ratio test does not appear to be affected by the HOA adjustment. This work helps clarify the accuracy of the unadjusted likelihood ratio test and the degree of improvement available with the HOA adjustment. Furthermore, the HOA test may be preferable even when the exact test is available because it does not require ad hoc library size adjustments.
    Statistical Applications in Genetics and Molecular Biology 03/2013;
  • Article: Flexible pooling in gene expression profiles: design and statistical modeling of experiments for unbiased contrasts.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract Pooling is an important resource in microarray gene expression experiments. Due to restrictions imposed by the statistical analysis it is widespread practice to employ a fixed pool size over the whole experiment. But this limits the efficient use of experimental material. In this paper we propose a design with flexible pool sizes for mRNA pooling which includes varying numbers of experimental units per pool. Enforcing balance between the pool sizes of every treatment level, we show the unbiasedness of the generalized least squares estimator of a contrast testing for differences in gene expression between treatments. In order to model the variability of pooled observations we include random biological effects as well as a special kind of technical error (random effect for mixtures), induced by inaccuracies in blending aliquots of mRNA from different individuals into common pools. Results for one-color arrays are also extended to two-color arrays.
    Statistical Applications in Genetics and Molecular Biology 03/2013;
  • Article: Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract DNA methylation is a well-recognized epigenetic mechanism that has been the subject of a growing body of literature typically focused on the identification and study of profiles of DNA methylation and their association with human diseases and exposures. In recent years, a number of unsupervised clustering algorithms, both parametric and non-parametric, have been proposed for clustering large-scale DNA methylation data. However, most of these approaches do not incorporate known biological relationships of measured features, and in some cases, rely on unrealistic assumptions regarding the nature of DNA methylation. Here, we propose a modified version of a recursively partitioned mixture model (RPMM) that integrates information related to the proximity of CpG loci within the genome to inform correlation structures from which subsequent clustering analysis is based. Using simulations and four methylation data sets, we demonstrate that integrating biologically informative correlation structures within RPMM resulted in improved goodness-of-fit, clustering consistency, and the ability to detect biologically meaningful clusters compared to methods which ignore such correlation. Integrating biologically-informed correlation structures to enhance modeling techniques is motivated by the rapid increase in resolution of DNA methylation microarrays and the increasing understanding of the biology of this epigenetic mechanism.
    Statistical Applications in Genetics and Molecular Biology 03/2013;
  • Article: Approximate Bayesian computation with functional statistics.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract Functional statistics are commonly used to characterize spatial patterns in general and spatial genetic structures in population genetics in particular. Such functional statistics also enable the estimation of parameters of spatially explicit (and genetic) models. Recently, Approximate Bayesian Computation (ABC) has been proposed to estimate model parameters from functional statistics. However, applying ABC with functional statistics may be cumbersome because of the high dimension of the set of statistics and the dependences among them. To tackle this difficulty, we propose an ABC procedure which relies on an optimized weighted distance between observed and simulated functional statistics. We applied this procedure to a simple step model, a spatial point process characterized by its pair correlation function and a pollen dispersal model characterized by genetic differentiation as a function of distance. These applications showed how the optimized weighted distance improved estimation accuracy. In the discussion, we consider the application of the proposed ABC procedure to functional statistics characterizing non-spatial processes.
    Statistical Applications in Genetics and Molecular Biology 02/2013;
  • Article: Modeling the DNA copy number aberration patterns in observational high-throughput cancer data.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract The process of occurrence of genomic aberrations over time in the genetic material of cancer cells reflects the progression of the cancer. Modern technologies like aCGH (array Comparative Genomic Hybridization) and MPS (Massive Parallel Sequencing) provide high-resolution measurements of DNA copy number aberrations, that reveal the full scale of genomic aberrations. A continuous time Markov chain model is proposed to describe the accumulation of aberrations over time. Time however is a latent variable (with the number of aberrations as a proxy). Integrating out time, yields the distribution of the observed DNA copy number data. The model parameters are estimated from high-dimensional DNA copy number data by means of penalized maximum pseudo- and likelihood and method of moments procedures. Having fitted the model, posterior time estimates of the advancement of each sample's cancer are obtained and the most likely locations of a sample's aberrations are predicted. The three estimation methods are compared in a simulation study. The paper closes with an application of the proposed methodology on cancer data.
    Statistical Applications in Genetics and Molecular Biology 01/2013; 12(2):143-74.
  • Article: Inferring latent gene regulatory network kinetics.
    [show abstract] [hide abstract]
    ABSTRACT: Abstract Regulatory networks consist of genes encoding transcription factors (TFs) and the genes they activate or repress. Various types of systems of ordinary differential equations (ODE) have been proposed to model these networks, ranging from linear to Michaelis-Menten approaches. In practice, a serious drawback to estimate these models is that the TFs are generally unobserved. The reason is the actual lack of high-throughput techniques to measure abundance of proteins in the cell. The challenge is to infer their activity profile together with the kinetic parameters of the ODE using level expression measurements of the genes they regulate. In this work we propose general statistical framework to infer the kinetic parameters of regulatory networks with one or more TFs using time course gene expression data. Our approach is also able to predict the activity levels of the TF. We use a penalized likelihood approach where the ODE is used as a penalty. The main advantage is that the solution of the ODE is not required explicitly as it is common in most proposed methods. This makes our approach computationally efficient and suitable for large systems with many components. We use the proposed method to study a SOS repair system in Escherichia coli. The reconstructed TF exhibits a similar behavior to experimentally measured profiles and the genetic expression data are fitted properly.
    Statistical Applications in Genetics and Molecular Biology 01/2013; 12(1):109-27.
  • Article: A family-based probabilistic method for capturing de novo mutations from high-throughput short-read sequencing data.
    [show abstract] [hide abstract]
    ABSTRACT: Recent advances in high-throughput DNA sequencing technologies and associated statistical analyses have enabled in-depth analysis of whole-genome sequences. As this technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donors' genotypes. The presence of a de novo mutation within the pedigree is indicated by a violation of Mendelian inheritance laws. Here, we present a method for probabilistically inferring genotypes across a pedigree using high-throughput sequencing data and producing the posterior probability of de novo mutation at each genomic site examined. This framework can be used to disentangle the effects of germline and somatic mutational processes and to simultaneously estimate the effect of sequencing error and the initial genetic variation in the population from which the founders of the pedigree arise. This approach is examined in detail through simulations and areas for method improvement are noted. By applying this method to data from members of a well-defined nuclear family with accurate pedigree information, the stage is set to make the most direct estimates of the human mutation rate to date.
    Statistical Applications in Genetics and Molecular Biology 01/2012; 11(2).
  • Article: Gene filtering in the analysis of Illumina microarray experiments.
    [show abstract] [hide abstract]
    ABSTRACT: Illumina bead arrays are microarrays that contain a random number of technical replicates (beads) for every probe (bead type) within the same array. Typically around 30 beads are placed at random positions on the array surface, which opens unique opportunities for quality control. Most preprocessing methods for Illumina bead arrays are ported from the Affymetrix microarray platform and ignore the availability of the technical replicates. The large number of beads for a particular bead type on the same array, however, should be highly correlated, otherwise they just measure noise and can be removed from the downstream analysis. Hence, filtering bead types can be considered as an important step of the preprocessing procedure for Illumina platform. This paper proposes a filtering method for Illumina bead arrays, which builds upon the mixed model framework. Bead types are called informative/non-informative (I/NI) based on a trade-off between within and between array variabilities. The method is illustrated on a publicly available Illumina Spike-in data set (Dunning et al., 2008) and we also show that filtering results in a more powerful analysis of differentially expressed genes.
    Statistical Applications in Genetics and Molecular Biology 01/2012; 11(2).
  • Article: Sample size calculations for designing clinical proteomic profiling studies using mass spectrometry.
    [show abstract] [hide abstract]
    ABSTRACT: In cancer clinical proteomics, MALDI and SELDI profiling are used to search for biomarkers of potentially curable early-stage disease. A given number of samples must be analysed in order to detect clinically relevant differences between cancers and controls, with adequate statistical power. From clinical proteomic profiling studies, expression data for each peak (protein or peptide) from two or more clinically defined groups of subjects are typically available. Typically, both exposure and confounder information on each subject are also available, and usually the samples are not from randomized subjects. Moreover, the data is usually available in replicate. At the design stage, however, covariates are not typically available and are often ignored in sample size calculations. This leads to the use of insufficient numbers of samples and reduced power when there are imbalances in the numbers of subjects between different phenotypic groups. A method is proposed for accommodating information on covariates, data imbalances and design-characteristics, such as the technical replication and the observational nature of these studies, in sample size calculations. It assumes knowledge of a joint distribution for the protein expression values and the covariates. When discretized covariates are considered, the effect of the covariates enters the calculations as a function of the proportions of subjects with specific attributes. This makes it relatively straightforward (even when pilot data on subject covariates is unavailable) to specify and to adjust for the effect of the expected heterogeneities. The new method suggests certain experimental designs which lead to the use of a smaller number of samples when planning a study. Analysis of data from the proteomic profiling of colorectal cancer reveals that fewer samples are needed when a study is balanced than when it is unbalanced, and when the IMAC30 chip-type is used. The method is implemented in the clippda package and is available in R at: http://www.bioconductor.org/help/bioc-views/release/bioc/html/clippda.html.
    Statistical Applications in Genetics and Molecular Biology 01/2012; 11(3):Article 2.
  • Article: A non-homogeneous dynamic Bayesian network with sequentially coupled interaction parameters for applications in systems and synthetic biology.
    [show abstract] [hide abstract]
    ABSTRACT: An important and challenging problem in systems biology is the inference of gene regulatory networks from short non-stationary time series of transcriptional profiles. A popular approach that has been widely applied to this end is based on dynamic Bayesian networks (DBNs), although traditional homogeneous DBNs fail to model the non-stationarity and time-varying nature of the gene regulatory processes. Various authors have therefore recently proposed combining DBNs with multiple changepoint processes to obtain time varying dynamic Bayesian networks (TV-DBNs). However, TV-DBNs are not without problems. Gene expression time series are typically short, which leaves the model over-flexible, leading to over-fitting or inflated inference uncertainty. In the present paper, we introduce a Bayesian regularization scheme that addresses this difficulty. Our approach is based on the rationale that changes in gene regulatory processes appear gradually during an organism's life cycle or in response to a changing environment, and we have integrated this notion in the prior distribution of the TV-DBN parameters. We have extensively tested our regularized TV-DBN model on synthetic data, in which we have simulated short non-homogeneous time series produced from a system subject to gradual change. We have then applied our method to real-world gene expression time series, measured during the life cycle of Drosophila melanogaster, under artificially generated constant light condition in Arabidopsis thaliana, and from a synthetically designed strain of Saccharomyces cerevisiae exposed to a changing environment.
    Statistical Applications in Genetics and Molecular Biology 01/2012; 11(4).
  • Article: A new approach for the joint analysis of multiple ChIP-seq libraries with application to histone modification.
    [show abstract] [hide abstract]
    ABSTRACT: Most approaches for analyzing ChIP-Seq data are focused on inferring exact protein binding sites from a single library. However, frequently multiple ChIP-Seq libraries derived from differing cell lines or tissue types from the same individual may be available. In such a situation, a separate analysis for each tissue or cell line may be inefficient. Here, we describe a novel method to analyze such data that intelligently uses the joint information from multiple related ChIP-Seq libraries. We present our method as a two-stage procedure. First, separate single cell line analysis is performed for each cell line. Here, we use a novel mixture regression approach to infer the subset of genes that are most likely to be involved in protein binding in each cell line. In the second step, we combine the separate single cell line analyses using an Empirical Bayes algorithm that implicitly incorporates inter-cell line correlation. We demonstrate the usefulness of our method using both simulated data, as well as real H3K4me3 and H3K27me3 histone methylation libraries.
    Statistical Applications in Genetics and Molecular Biology 01/2012; 11(3):Article 1.

Keywords

bootstrap
 
classification
 
data
 
gene
 
method
 
microarray
 
model
 
nucleosom
 
prediction
 
procedur
 
selection
 
set
 
sidák
 
tree
 

Related Journals