Statistical Applications in Genetics and Molecular Biology (STAT APPL GENET MOL)

Publisher: De Gruyter

Journal description

Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies.

Current impact factor: 1.13

Impact Factor Rankings

2015 Impact Factor Available summer 2016
2014 Impact Factor 1.127
2010 Impact Factor 1.842
2009 Impact Factor 2.247
2008 Impact Factor 2.167
2007 Impact Factor 2.167

Impact factor over time

Impact factor

Additional details

5-year impact 1.54
Cited half-life 9.70
Immediacy index 0.10
Eigenfactor 0.00
Article influence 0.89
Website Statistical Applications in Genetics and Molecular Biology website
Other titles Statistical applications in genetics and molecular biology, SAGMB
ISSN 1544-6115
OCLC 52157137
Material type Document, Periodical, Internet resource
Document type Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details

De Gruyter

  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author cannot archive a post-print version
  • Restrictions
    • 12 months embargo
  • Conditions
    • Pre-print and abstract on author's personal website only
    • Author's post-print on funder's repository or funder's designated repository at the funding agency's request or as a result of legal obligation.
    • Publisher's version/PDF may be used, on author's personal website, editor's personal website or institutional repository
    • Authors cannot deposit in subject repositories
    • Published source must be acknowledged
    • Must link to publisher version and article's DOI must be given
    • Set statement to accompany deposit (see policy)
  • Classification

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: Splitting extended families into their component nuclear families to apply a genetic association method designed for nuclear families is a widespread practice in familial genetic studies. Dependence among genotypes and phenotypes of nuclear families from the same extended family arises because of genetic linkage of the tested marker with a risk variant or because of familial specificity of genetic effects due to gene-environment interaction. This raises concerns about the validity of inference conducted under the assumption of independence of the nuclear families. We indeed prove theoretically that, in a conditional logistic regression analysis applicable to disease cases and their genotyped parents, the naive model-based estimator of the variance of the coefficient estimates underestimates the true variance. However, simulations with realistic effect sizes of risk variants and variation of this effect from family to family reveal that the underestimation is negligible. The simulations also show the greater efficiency of the model-based variance estimator compared to a robust empirical estimator. Our recommendation is therefore, to use the model-based estimator of variance for inference on effects of genetic variants.
    Statistical Applications in Genetics and Molecular Biology 11/2015; DOI:10.1515/sagmb-2015-0056
  • [Show abstract] [Hide abstract]
    ABSTRACT: We are concerned with statistical inference for 2×C×K contingency tables in the context of genetic case-control association studies. Multivariate methods based on asymptotic Gaussianity of vectors of test statistics require information about the asymptotic correlation structure among these test statistics under the global null hypothesis. In the case of C=2, we show that for a wide variety of test statistics this asymptotic correlation structure is given by the standardized linkage disequilibrium matrix of the K loci under investigation. Three popular choices of test statistics are discussed for illustration. In the case of C=3, the standardized composite linkage disequilibrium matrix is the limiting correlation matrix of the K locus-specific Cochran-Armitage trend test statistics.
    Statistical Applications in Genetics and Molecular Biology 10/2015; 14(5). DOI:10.1515/sagmb-2015-0024
  • [Show abstract] [Hide abstract]
    ABSTRACT: Sample size calculations for gene expression microarray and NGS-RNA-Seq experiments are challenging because the overall power depends on unknown quantities as the proportion of true null hypotheses and the distribution of the effect sizes under the alternative. We propose a two-stage design with an adaptive interim analysis where these quantities are estimated from the interim data. The second stage sample size is chosen based on these estimates to achieve a specific overall power. The proposed procedure controls the power in all considered scenarios except for very low first stage sample sizes. The false discovery rate (FDR) is controlled despite of the data dependent choice of sample size. The two-stage design can be a useful tool to determine the sample size of high-dimensional studies if in the planning phase there is high uncertainty regarding the expected effect sizes and variability.
    Statistical Applications in Genetics and Molecular Biology 10/2015; 14(5). DOI:10.1515/sagmb-2014-0025
  • [Show abstract] [Hide abstract]
    ABSTRACT: In association studies of quantitative traits, the association of each genetic marker with the trait of interest is typically tested using the F-test assuming an additive genetic model. In practice, the true model is rarely known, and specifying an incorrect model can lead to a loss of power. For case-control studies, the maximum of test statistics optimal for additive, dominant, and recessive models has been shown to be robust to model misspecification. The approach has later been extended to quantitative traits. However, the existing procedures assume that the trait is normally distributed and may not maintain correct type I error rates and can also have reduced power when the assumption of normality is violated. Here, we introduce a maximum (MAX3) test that is based on ranks and is therefore distribution-free. We examine the behavior of the proposed method using a Monte Carlo simulation with both normal and non-normal data and compare the results to the usual parametric procedures and other nonparametric alternatives. We show that the rank-based maximum test has favorable properties relative to other tests, especially in the case of symmetric distributions with heavy tails. We illustrate the method with data from a real association study of symmetric dimethylarginine (SDMA).
    Statistical Applications in Genetics and Molecular Biology 10/2015; 14(5). DOI:10.1515/sagmb-2014-0050
  • [Show abstract] [Hide abstract]
    ABSTRACT: In cellular biology, node-and-edge graph or "network" data collection often uses bait-prey technologies such as co-immunoprecipitation (CoIP). Bait-prey technologies assay relationships or "interactions" between protein pairs, with CoIP specifically measuring protein complex co-membership. Analyses of CoIP data frequently focus on estimating protein complex membership. Due to budgetary and other constraints, exhaustive assay of the entire network using CoIP is not always possible. We describe a stratified sampling scheme to select baits for CoIP experiments when protein complex estimation is the main goal. Expanding upon the classic framework in which nodes represent proteins and edges represent pairwise interactions, we define generalized nodes as sets of adjacent nodes with identical adjacency outside the set and use these as strata from which to select the next set of baits. Strata are redefined at each round of sampling to incorporate accumulating data. This scheme maintains user-specified quality thresholds for protein complex estimates and, relative to simple random sampling, leads to a marked increase in the number of correctly estimated complexes at each round of sampling. The R package seqSample contains all source code and is available at
    Statistical Applications in Genetics and Molecular Biology 08/2015; 14(4):391-411. DOI:10.1515/sagmb-2015-0007
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recent results in Markov chain Monte Carlo (MCMC) show that a chain based on an unbiased estimator of the likelihood can have a stationary distribution identical to that of a chain based on exact likelihood calculations. In this paper we develop such an estimator for elliptically contoured distributions, a large family of distributions that includes and generalizes the multivariate normal. We then show how this estimator, combined with pseudorandom realizations of an elliptically contoured distribution, can be used to run MCMC in a way that replicates the stationary distribution of a likelihood based chain, but does not require explicit likelihood calculations. Because many elliptically contoured distributions do not have closed form densities, our simulation based approach enables exact MCMC based inference in a range of cases where previously it was impossible.
    Statistical Applications in Genetics and Molecular Biology 07/2015; 14(4). DOI:10.1515/sagmb-2014-0063
  • [Show abstract] [Hide abstract]
    ABSTRACT: As one of the most recent advanced technologies developed for biomedical research, the next generation sequencing (NGS) technology has opened more opportunities for scientific discovery of genetic information. The NGS technology is particularly useful in elucidating a genome for the analysis of DNA copy number variants (CNVs). The study of CNVs is important as many genetic studies have led to the conclusion that cancer development, genetic disorders, and other diseases are usually relevant to CNVs on the genome. One way to analyze the NGS data for detecting boundaries of CNV regions on a chromosome or a genome is to phrase the problem as a statistical change point detection problem presented in the read count data. We therefore provide a statistical change point model to help detect CNVs using the NGS read count data. We use a Bayesian approach to incorporate possible parameter changes in the underlying distribution of the NGS read count data. Posterior probabilities for the change point inferences are derived. Extensive simulation studies have shown advantages of our proposed methods. The proposed methods are also applied to a publicly available lung cancer cell line NGS dataset, and CNV regions on this cell line are successfully identified.
    Statistical Applications in Genetics and Molecular Biology 07/2015; 14(4). DOI:10.1515/sagmb-2014-0054
  • [Show abstract] [Hide abstract]
    ABSTRACT: Copy number alteration (CNA) data have been collected to study disease related chromosomal amplifications and deletions. The CUSUM procedure and related plots have been used to explore CNA data. In practice, it is possible to observe outliers. Then, modifications of the CUSUM procedure may be required. An outlier reset modification of the CUSUM (ORCUSUM) procedure is developed in this paper. The threshold value for detecting outliers or significant CUSUMs can be derived using results for sums of independent truncated normal random variables. Bartel's non-parametric test for autocorrelation is also introduced to the analysis of copy number variation data. Our simulation results indicate that the ORCUSUM procedure can still be used even in the situation where the degree of autocorrelation level is low. Furthermore, the results show the outlier's impact on the traditional CUSUM's performance and illustrate the advantage of the ORCUSUM's outlier reset feature. Additionally, we discuss how the ORCUSUM can be applied to examine CNA data with a simulated data set. To illustrate the procedure, recently collected single nucleotide polymorphism (SNP) based CNA data from The Cancer Genome Atlas (TCGA) Research Network is analyzed. The method is applied to a data set collected in an ovarian cancer study. Three cytogenetic bands (cytobands) are considered to illustrate the method. The cytobands 11q13 and 9p21 have been shown to be related to ovarian cancer. They are presented as positive examples. The cytoband 3q22, which is less likely to be disease related, is presented as a negative example. These results illustrate the usefulness of the ORCUSUM procedure as an exploratory tool for the analysis of SNP based CNA data.
    Statistical Applications in Genetics and Molecular Biology 06/2015; 14(4). DOI:10.1515/sagmb-2014-0027
  • [Show abstract] [Hide abstract]
    ABSTRACT: Adaptive transmission disequilibrium test (aTDT) and MAX3 test are two robust-efficient association tests for case-parent family trio data. Both tests incorporate information of common genetic models including recessive, additive and dominant models and are efficient in power and robust to genetic model specifications. The aTDT uses information of departure from Hardy-Weinberg disequilibrium to identify the potential genetic model underlying the data and then applies the corresponding TDT-type test, and the MAX3 test is defined as the maximum of the absolute value of three TDT-type tests under the three common genetic models. In this article, we propose three robust Bayes procedures, the aTDT based Bayes factor, MAX3 based Bayes factor and Bayes model averaging (BMA), for association analysis with case-parent trio design. The asymptotic distributions of aTDT under the null and alternative hypothesis are derived in order to calculate its Bayes factor. Extensive simulations show that the Bayes factors and the p-values of the corresponding tests are generally consistent and these Bayes factors are robust to genetic model specifications, especially so when the priors on the genetic models are equal. When equal priors are used for the underlying genetic models, the Bayes factor method based on aTDT is more powerful than those based on MAX3 and Bayes model averaging. When the prior placed a small (large) probability on the true model, the Bayes factor based on aTDT (BMA) is more powerful. Analysis of a simulation data about RA from GAW15 is presented to illustrate applications of the proposed methods.
    Statistical Applications in Genetics and Molecular Biology 06/2015; 14(3):253-264. DOI:10.1515/sagmb-2014-0051
  • [Show abstract] [Hide abstract]
    ABSTRACT: A nonparametric estimator of mutual information is proposed and is shown to have asymptotic normality and efficiency, and a bias decaying exponentially in sample size. The asymptotic normality and the rapidly decaying bias together offer a viable inferential tool for assessing mutual information between two random elements on finite alphabets where the maximum likelihood estimator of mutual information greatly inflates the probability of type I error. The proposed estimator is illustrated by three examples in which the association between a pair of genes is assessed based on their expression levels. Several results of simulation study are also provided.
    Statistical Applications in Genetics and Molecular Biology 05/2015; DOI:10.1515/sagmb-2014-0047
  • [Show abstract] [Hide abstract]
    ABSTRACT: In genome-wide association studies (GWAS), it is of interest to identify genetic variants associated with phenotypes. For a given phenotype, the associated genetic variants are usually a sparse subset of all possible variants. Traditional Lasso-type estimation methods can therefore be used to detect important genes. But the relationship between genotypes at one variant and a phenotype may be influenced by other variables, such as sex and life style. Hence it is important to be able to incorporate gene-covariate interactions into the sparse regression model. In addition, because there is biological knowledge on the manner in which genes work together in structured groups, it is desirable to incorporate this information as well. In this paper, we present a novel sparse regression methodology for gene-covariate models in association studies that not only allows such interactions but also considers biological group structure. Simulation results show that our method substantially outperforms another method, in which interaction is considered, but group structure is ignored. Application to data on total plasma immunoglobulin E (IgE) concentrations in the Framingham Heart Study (FHS), using sex and smoking status as covariates, yields several potentially interesting gene-covariate interactions.
    Statistical Applications in Genetics and Molecular Biology 05/2015; DOI:10.1515/sagmb-2014-0073

  • Statistical Applications in Genetics and Molecular Biology 04/2015; 14(2). DOI:10.1515/sagmb-2014-0100
  • [Show abstract] [Hide abstract]
    ABSTRACT: The use of fold-change (FC) to prioritize differentially expressed genes (DEGs) for post-hoc characterization is a common technique in the analysis of RNA sequencing datasets. However, the use of FC can overlook certain population of DEGs, such as high copy number transcripts which undergo metabolically expensive changes in expression yet fail to exceed the ratiometric FC cut-off, thereby missing potential important biological information. Here we evaluate an alternative approach to prioritizing RNAseq data based on absolute changes in normalized transcript counts (ΔT) between control and treatment conditions. In five pairwise comparisons with a wide range of effect sizes, rank-ordering of DEGs based on the magnitude of ΔT produced a power curve-like distribution, in which 4.7-5.0% of transcripts were responsible for 36-50% of the cumulative change. Thus, differential gene expression is characterized by the high production-cost expression of a small number of genes (large ΔT genes), while the differential expression of the majority of genes involves a much smaller metabolic investment by the cell. To determine whether the large ΔT datasets are representative of coordinated changes in the transcriptional program, we evaluated large ΔT genes for enrichment of gene ontologies (GOs) and predicted protein interactions. In comparison to randomly selected DEGs, the large ΔT transcripts were significantly enriched for both GOs and predicted protein interactions. Furthermore, enrichments were were consistent with the biological context of each comparison yet distinct from those produced using equal-sized populations of large FC genes, indicating that the large ΔT genes represent an orthagonal transcriptional response. Finally, the composition of the large ΔT gene sets were unique to each pairwise comparison, indicating that they represent coherent and context-specific responses to biological conditions rather than the non-specific upregulation of a family of genes. These findings suggest that the large ΔT genes are not a product of random or stochastic phenomenon, but rather represent biologically meaningful changes in the transcriptional program. They furthermore imply that high abundance transcripts are associated with particularly cellular states, and as cells change in response to internal or external conditions, the relative distribution of the abundant transcripts changes accordingly. Thus, prioritization of DEGs based on the concept of metabolic cost is a simple yet powerful method to identify biologically important transcriptional changes and provide novel insights into cellular behaviors.
    Statistical Applications in Genetics and Molecular Biology 03/2015; DOI:10.1515/sagmb-2014-0018
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract The increasing availability of ChIP-seq data demands for advanced statistical tools to analyze the results of such experiments. The inherent features of high-throughput sequencing output call for a modelling framework that can account for the spatial dependency between neighboring regions of the genome and the temporal dimension that arises from observing the protein binding process at progressing time points; also, multiple biological/technical replicates of the experiment are usually produced and methods to jointly account for them are needed. Furthermore, the antibodies used in the experiment lead to potentially different immunoprecipitation efficiencies, which can affect the capability of distinguishing between the true signal in the data and the background noise. The statistical procedure proposed consist of a discrete mixture model with an underlying latent Markov random field: the novelty of the model is to allow both spatial and temporal dependency to play a role in determining the latent state of genomic regions involved in the protein binding process, while combining all the information of the replicates available instead of treating them separately. It is also possible to take into account the different antibodies used, in order to obtain better insights of the process and exploit all the biological information available.
    Statistical Applications in Genetics and Molecular Biology 02/2015; 14(2). DOI:10.1515/sagmb-2014-0074