Statistical Applications in Genetics and Molecular Biology (STAT APPL GENET MOL)

Publisher: De Gruyter

Journal description

Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies.

Current impact factor: 1.52

Impact Factor Rankings

2015 Impact Factor Available summer 2015
2010 Impact Factor 1.842
2009 Impact Factor 2.247
2008 Impact Factor 2.167
2007 Impact Factor 2.167

Impact factor over time

Impact factor

Additional details

5-year impact 1.70
Cited half-life 7.20
Immediacy index 0.12
Eigenfactor 0.00
Article influence 0.91
Website Statistical Applications in Genetics and Molecular Biology website
Other titles Statistical applications in genetics and molecular biology, SAGMB
ISSN 1544-6115
OCLC 52157137
Material type Document, Periodical, Internet resource
Document type Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details

De Gruyter

  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author cannot archive a post-print version
  • Restrictions
    • 12 months embargo
  • Conditions
    • Pre-print and abstract on author's personal website only
    • Author's post-print on funder's repository or funder's designated repository at the funding agencys request or as a result of legal obligation.
    • Publisher's version/PDF may be used, on author's personal website, editor's personal website or institutional repository
    • Authors cannot deposit in subject repositories
    • Published source must be acknowledged
    • Must link to publisher version and article’s DOI must be given
    • Set statement to accompany deposit (see policy)
  • Classification
    ​ yellow

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: High-throughput sequencing techniques are increasingly affordable and produce massive amounts of data. Together with other high-throughput technologies, such as microarrays, there are an enormous amount of resources in databases. The collection of these valuable data has been routine for more than a decade. Despite different technologies, many experiments share the same goal. For instance, the aims of RNA-seq studies often coincide with those of differential gene expression experiments based on microarrays. As such, it would be logical to utilize all available data. However, there is a lack of biostatistical tools for the integration of results obtained from different technologies. Although diverse technological platforms produce different raw data, one commonality for experiments with the same goal is that all the outcomes can be transformed into a platform-independent data format - rankings - for the same set of items. Here we present the R package TopKLists, which allows for statistical inference on the lengths of informative (top-k) partial lists, for stochastic aggregation of full or partial lists, and for graphical exploration of the input and consolidated output. A graphical user interface has also been implemented for providing access to the underlying algorithms. To illustrate the applicability and usefulness of the package, we integrated microRNA data of non-small cell lung cancer across different measurement techniques and draw conclusions. The package can be obtained from CRAN under a LGPL-3 license.
    Statistical Applications in Genetics and Molecular Biology 05/2015; DOI:10.1515/sagmb-2014-0093
  • [Show abstract] [Hide abstract]
    ABSTRACT: A nonparametric estimator of mutual information is proposed and is shown to have asymptotic normality and efficiency, and a bias decaying exponentially in sample size. The asymptotic normality and the rapidly decaying bias together offer a viable inferential tool for assessing mutual information between two random elements on finite alphabets where the maximum likelihood estimator of mutual information greatly inflates the probability of type I error. The proposed estimator is illustrated by three examples in which the association between a pair of genes is assessed based on their expression levels. Several results of simulation study are also provided.
    Statistical Applications in Genetics and Molecular Biology 05/2015; DOI:10.1515/sagmb-2014-0047
  • [Show abstract] [Hide abstract]
    ABSTRACT: In genome-wide association studies (GWAS), it is of interest to identify genetic variants associated with phenotypes. For a given phenotype, the associated genetic variants are usually a sparse subset of all possible variants. Traditional Lasso-type estimation methods can therefore be used to detect important genes. But the relationship between genotypes at one variant and a phenotype may be influenced by other variables, such as sex and life style. Hence it is important to be able to incorporate gene-covariate interactions into the sparse regression model. In addition, because there is biological knowledge on the manner in which genes work together in structured groups, it is desirable to incorporate this information as well. In this paper, we present a novel sparse regression methodology for gene-covariate models in association studies that not only allows such interactions but also considers biological group structure. Simulation results show that our method substantially outperforms another method, in which interaction is considered, but group structure is ignored. Application to data on total plasma immunoglobulin E (IgE) concentrations in the Framingham Heart Study (FHS), using sex and smoking status as covariates, yields several potentially interesting gene-covariate interactions.
    Statistical Applications in Genetics and Molecular Biology 05/2015; DOI:10.1515/sagmb-2014-0073
  • [Show abstract] [Hide abstract]
    ABSTRACT: The use of fold-change (FC) to prioritize differentially expressed genes (DEGs) for post-hoc characterization is a common technique in the analysis of RNA sequencing datasets. However, the use of FC can overlook certain population of DEGs, such as high copy number transcripts which undergo metabolically expensive changes in expression yet fail to exceed the ratiometric FC cut-off, thereby missing potential important biological information. Here we evaluate an alternative approach to prioritizing RNAseq data based on absolute changes in normalized transcript counts (ΔT) between control and treatment conditions. In five pairwise comparisons with a wide range of effect sizes, rank-ordering of DEGs based on the magnitude of ΔT produced a power curve-like distribution, in which 4.7-5.0% of transcripts were responsible for 36-50% of the cumulative change. Thus, differential gene expression is characterized by the high production-cost expression of a small number of genes (large ΔT genes), while the differential expression of the majority of genes involves a much smaller metabolic investment by the cell. To determine whether the large ΔT datasets are representative of coordinated changes in the transcriptional program, we evaluated large ΔT genes for enrichment of gene ontologies (GOs) and predicted protein interactions. In comparison to randomly selected DEGs, the large ΔT transcripts were significantly enriched for both GOs and predicted protein interactions. Furthermore, enrichments were were consistent with the biological context of each comparison yet distinct from those produced using equal-sized populations of large FC genes, indicating that the large ΔT genes represent an orthagonal transcriptional response. Finally, the composition of the large ΔT gene sets were unique to each pairwise comparison, indicating that they represent coherent and context-specific responses to biological conditions rather than the non-specific upregulation of a family of genes. These findings suggest that the large ΔT genes are not a product of random or stochastic phenomenon, but rather represent biologically meaningful changes in the transcriptional program. They furthermore imply that high abundance transcripts are associated with particularly cellular states, and as cells change in response to internal or external conditions, the relative distribution of the abundant transcripts changes accordingly. Thus, prioritization of DEGs based on the concept of metabolic cost is a simple yet powerful method to identify biologically important transcriptional changes and provide novel insights into cellular behaviors.
    Statistical Applications in Genetics and Molecular Biology 03/2015; DOI:10.1515/sagmb-2014-0018
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract The increasing availability of ChIP-seq data demands for advanced statistical tools to analyze the results of such experiments. The inherent features of high-throughput sequencing output call for a modelling framework that can account for the spatial dependency between neighboring regions of the genome and the temporal dimension that arises from observing the protein binding process at progressing time points; also, multiple biological/technical replicates of the experiment are usually produced and methods to jointly account for them are needed. Furthermore, the antibodies used in the experiment lead to potentially different immunoprecipitation efficiencies, which can affect the capability of distinguishing between the true signal in the data and the background noise. The statistical procedure proposed consist of a discrete mixture model with an underlying latent Markov random field: the novelty of the model is to allow both spatial and temporal dependency to play a role in determining the latent state of genomic regions involved in the protein binding process, while combining all the information of the replicates available instead of treating them separately. It is also possible to take into account the different antibodies used, in order to obtain better insights of the process and exploit all the biological information available.
    Statistical Applications in Genetics and Molecular Biology 02/2015; 14(2). DOI:10.1515/sagmb-2014-0074
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract There has been much interest in reconstructing bi-directional regulatory networks linking the circadian clock to metabolism in plants. A variety of reverse engineering methods from machine learning and computational statistics have been proposed and evaluated. The emphasis of the present paper is on combining models in a model ensemble to boost the network reconstruction accuracy, and to explore various model combination strategies to maximize the improvement. Our results demonstrate that a rich ensemble of predictors outperforms the best individual model, even if the ensemble includes poor predictors with inferior individual reconstruction accuracy. For our application to metabolomic and transcriptomic time series from various mutagenesis plants grown in different light-dark cycles we also show how to determine the optimal time lag between interactions, and we identify significant interactions with a randomization test. Our study predicts new statistically significant interactions between circadian clock genes and metabolites in Arabidopsis thaliana, and thus provides independent statistical evidence that the regulation of metabolism by the circadian clock is not uni-directional, but that there is a statistically significant feedback mechanism aiming from metabolism back to the circadian clock.
    Statistical Applications in Genetics and Molecular Biology 02/2015; DOI:10.1515/sagmb-2014-0041
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract An ordinal scale is commonly used to measure health status and disease related outcomes in hospital settings as well as in translational medical research. In addition, repeated measurements are common in clinical practice for tracking and monitoring the progression of complex diseases. Classical methodology based on statistical inference, in particular, ordinal modeling has contributed to the analysis of data in which the response categories are ordered and the number of covariates (p) remains smaller than the sample size (n). With the emergence of genomic technologies being increasingly applied for more accurate diagnosis and prognosis, high-dimensional data where the number of covariates (p) is much larger than the number of samples (n), are generated. To meet the emerging needs, we introduce our proposed model which is a two-stage algorithm: Extend the generalized monotone incremental forward stagewise (GMIFS) method to the cumulative logit ordinal model; and combine the GMIFS procedure with the classical mixed-effects model for classifying disease status in disease progression along with time. We demonstrate the efficiency and accuracy of the proposed models in classification using a time-course microarray dataset collected from the Inflammation and the Host Response to Injury study.
    Statistical Applications in Genetics and Molecular Biology 02/2015; 14(1):93-111. DOI:10.1515/sagmb-2014-0004
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Experimental evolution is an important research method that allows for the study of evolutionary processes occurring in microorganisms. Here we present a novel approach to experimental evolution that is based on application of next generation sequencing. Under this approach population level sequencing is applied to an evolving population in which multiple first-step beneficial mutations occur concurrently. As a result, frequencies of multiple beneficial mutations are observed in each replicate of an experiment. For this new type of data we develop methods of statistical inference. In particular, we propose a method for imputing selection coefficients of first-step beneficial mutations. The imputed selection coefficient are then used for testing the distribution of first-step beneficial mutations and for estimation of mean selection coefficient. In the case when selection coefficients are uniformly distributed, collected data may also be used to estimate the total number of available first-step beneficial mutations.
    Statistical Applications in Genetics and Molecular Biology 02/2015; 14(1):65-81. DOI:10.1515/sagmb-2014-0030
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Association rule mining is a knowledge discovery technique which informs researchers about relationships between variables in data. These relationships can be focused to a specific set of response variables. We propose an augmented version of this method to discover groups of genotypes which relate to specific outcomes. We derive the methodology to find these candidate groups of genotypes and illustrate how the method works on data regarding neuroinvasive complications of West Nile virus and through simulation.
    Statistical Applications in Genetics and Molecular Biology 01/2015; DOI:10.1515/sagmb-2014-0033
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Complex diseases are often characterized by coordinated expression alterations of genes and proteins which are grouped together in a molecular network. Identifying such interconnected and jointly altered gene/protein groups from functional omics data and a given molecular interaction network is a key challenge in bioinformatics. We describe GenePEN, a penalized logistic regression approach for sample classification via convex optimization, using a newly designed Pairwise Elastic Net penalty that favors the selection of discriminative genes/proteins according to their connectedness in a molecular interaction graph. An efficient implementation of the method finds provably optimal solutions on high-dimensional omics data in a few seconds and is freely available at
    Statistical Applications in Genetics and Molecular Biology 01/2015; DOI:10.1515/sagmb-2014-0045
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract We present a multiple testing method for hypotheses that are ordered in space or time. Given such hypotheses, the elementary hypotheses as well as regions of consecutive hypotheses are of interest. These region hypotheses not only have intrinsic meaning but testing them also has the advantage that (potentially small) signals across a region are combined in one test. Because the expected number and length of potentially interesting regions are usually not available beforehand, we propose a method that tests all possible region hypotheses as well as all individual hypotheses in a single multiple testing procedure that controls the familywise error rate. We start at testing the global null-hypothesis and when this hypothesis can be rejected we continue with further specifying the exact location/locations of the effect present. The method is implemented in the R package cherry and is illustrated on a DNA copy number data set.
    Statistical Applications in Genetics and Molecular Biology 12/2014; 14(1). DOI:10.1515/sagmb-2013-0075
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract The binding behavior of molecules in nuclei of living cells can be studied through the analysis of images from fluorescence recovery after photobleaching experiments. However, there is still a lack of methodology for the statistical evaluation of FRAP data, especially for the joint analysis of multiple dynamic images. We propose a hierarchical Bayesian nonlinear model with mixed-effect priors based on local compartment models in order to obtain joint parameter estimates for all nuclei as well as to account for the heterogeneity of the nuclei population. We apply our method to a series of FRAP experiments of DNA methyltransferase 1 tagged to green fluorescent protein expressed in a somatic mouse cell line and compare the results to the application of three different fixed-effects models to the same series of FRAP experiments. With the proposed model, we get estimates of the off-rates of the interactions of the molecules under study together with credible intervals, and additionally gain information about the variability between nuclei. The proposed model is superior to and more robust than the tested fixed-effects models. Therefore, it can be used for the joint analysis of data from FRAP experiments on various similar nuclei.
    Statistical Applications in Genetics and Molecular Biology 12/2014; 14(1). DOI:10.1515/sagmb-2014-0013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Chromatin interactions mediated by a particular protein are of interest for studying gene regulation, especially the regulation of genes that are associated with, or known to be causative of, a disease. A recent molecular technique, Chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), that uses chromatin immunoprecipitation (ChIP) and high throughput paired-end sequencing, is able to detect such chromatin interactions genomewide. However, ChIA-PET may generate noise (i.e., pairings of DNA fragments by random chance) in addition to true signal (i.e., pairings of DNA fragments by interactions). In this paper, we propose MC_DIST based on a mixture modeling framework to identify true chromatin interactions from ChIA-PET count data (counts of DNA fragment pairs). The model is cast into a Bayesian framework to take into account the dependency among the data and the available information on protein binding sites and gene promoters to reduce false positives. A simulation study showed that MC_DIST outperforms the previously proposed hypergeometric model in terms of both power and type I error rate. A real data study showed that MC_DIST may identify potential chromatin interactions between protein binding sites and gene promoters that may be missed by the hypergeometric model. An R package implementing the MC_DIST model is available at
    Statistical Applications in Genetics and Molecular Biology 12/2014; 14(1). DOI:10.1515/sagmb-2014-0029
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract It has been proposed recently that differentially variable CpG methylation (DVC) may contribute to transcriptional aberrations in human diseases. In large scale epigenetic studies, potential confounders could affect the observed methylation variabilities and need to be accounted for. In this paper, we develop a robust statistical model for differential variability DVC analysis that accounts for potential confounding covariates by utilizing the propensity score method. Our method is based on a weighted score test on strata generated propensity score stratification. To the best of our knowledge, this is the first proposed statistical method for detecting DVCs that adjusts for confounding covariates. We show that this method is robust against model misspecification and achieves good operating characteristics based on extensive simulations and a case study.
    Statistical Applications in Genetics and Molecular Biology 10/2014; 13(6). DOI:10.1515/sagmb-2013-0072
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Conservative statistical tests are often used in complex multiple testing settings in which computing the type I error may be difficult. In such tests, the reported p-value for a hypothesis can understate the evidence against the null hypothesis and consequently statistical power may be lost. False Discovery Rate adjustments, used in multiple comparison settings, can worsen the unfavorable effect. We present a computationally efficient and test-agnostic calibration technique that can substantially reduce the conservativeness of such tests. As a consequence, a lower sample size might be sufficient to reject the null hypothesis for true alternatives, and experimental costs can be lowered. We apply the calibration technique to the results of DESeq, a popular method for detecting differentially expressed genes from RNA sequencing data. The increase in power may be particularly high in small sample size experiments, often used in preliminary experiments and funding applications.
    Statistical Applications in Genetics and Molecular Biology 10/2014; 13(6). DOI:10.1515/sagmb-2013-0074
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Approaches to Bayesian inference for problems with intractable likelihoods have become increasingly important in recent years. Approximate Bayesian computation (ABC) and "likelihood free" Markov chain Monte Carlo techniques are popular methods for tackling inference in these scenarios but such techniques are computationally expensive. In this paper we compare the two approaches to inference, with a particular focus on parameter inference for stochastic kinetic models, widely used in systems biology. Discrete time transition kernels for models of this type are intractable for all but the most trivial systems yet forward simulation is usually straightforward. We discuss the relative merits and drawbacks of each approach whilst considering the computational cost implications and efficiency of these techniques. In order to explore the properties of each approach we examine a range of observation regimes using two example models. We use a Lotka--Volterra predator prey model to explore the impact of full or partial species observations using various time course observations under the assumption of known and unknown measurement error. Further investigation into the impact of observation error is then made using a Schl\"ogl system, a test case which exhibits bi-modal state stability in some regions of parameter space.
    Statistical Applications in Genetics and Molecular Biology 10/2014; 14(2). DOI:10.1515/sagmb-2014-0072