Statistical Applications in Genetics and Molecular Biology (STAT APPL GENET MOL )

Publisher: Berkeley Electronic Press

Description

Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies.

  • Impact factor
    1.52
    Show impact factor history
     
    Impact factor
  • 5-year impact
    1.70
  • Cited half-life
    7.20
  • Immediacy index
    0.12
  • Eigenfactor
    0.00
  • Article influence
    0.91
  • Website
    Statistical Applications in Genetics and Molecular Biology website
  • Other titles
    Statistical applications in genetics and molecular biology, SAGMB
  • ISSN
    1544-6115
  • OCLC
    52157137
  • Material type
    Document, Periodical, Internet resource
  • Document type
    Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details

Berkeley Electronic Press

  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author can archive a post-print version
  • Conditions
    • On non-commercial authors personal website, non-commercial authors open-access university and employers institutional repository and non-commercial authors course website
    • PubMed and UK PubMed after 12 months (automatic for several journals)
    • Publisher copyright and source must be acknowledged
    • Publisher's version/PDF may be used
  • Classification
    ​ green

Publications in this journal

  • [show abstract] [hide abstract]
    ABSTRACT: Abstract We present a novel characterization of the generalized family wise error rate: kFWER. The interpretation allows researchers to view kFWER as a function of the test statistics rather than current methods based on p-values. Using this interpretation we present several theorems and methods (parametric and non-parametric) for estimating kFWER in various data settings. With this version of kFWER, researchers will have an estimate of kFWER in addition to knowing what tests are significant at the estimated kFWER. Additionally, we present methods that use empirical null distributions in place of parametric distributions in standard p-value kFWER controlling schemes. These advancements represent an improvement over common kFWER methods which are based on parametric assumptions and merely report the tests that are significant under a given value for kFWER.
    Statistical Applications in Genetics and Molecular Biology 03/2014;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract Risk prediction models can link high-dimensional molecular measurements, such as DNA methylation, to clinical endpoints. For biological interpretation, often a sparse fit is desirable. Different molecular aggregation levels, such as considering DNA methylation at the CpG, gene, or chromosome level, might demand different degrees of sparsity. Hence, model building and estimation techniques should be able to adapt their sparsity according to the setting. Additionally, underestimation of coefficients, which is a typical problem of sparse techniques, should also be addressed. We propose a comprehensive approach, based on a boosting technique that allows a flexible adaptation of model sparsity and addresses these problems in an integrative way. The main motivation is to have an automatic sparsity adaptation. In a simulation study, we show that this approach reduces underestimation in sparse settings and selects more adequate model sizes than the corresponding non-adaptive boosting technique in non-sparse settings. Using different aggregation levels of DNA methylation data from a study in kidney carcinoma patients, we illustrate how automatically selected values of the sparsity tuning parameter can reflect the underlying structure of the data. In addition to that, prediction performance and variable selection stability is compared to the non-adaptive boosting approach.
    Statistical Applications in Genetics and Molecular Biology 03/2014;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract In association mapping of quantitative traits, the F-test based on an assumed genetic model is a basic statistical tool for testing association of each candidate locus with the trait of interest. However, the true underlying genetic model is often unknown, and using an incorrect model may cause serious loss of power. For case-control studies, it is known that the combination of several tests that are optimal for different models is robust to model misspecification. In this paper, we extend the test combination approach to quantitative trait association. We first derive the exact correlations among transformed test statistics and discuss interesting special cases. We then propose and evaluate a multivariate normality based approximation to the joint distribution of test statistics, such that the marginal distributions and pairwise correlations among test statistics are accounted for. Through simulations, we show that the sizes of the resulting approximate combined tests are accurate for practical purposes under a variety of situations. We find that the combination of the tests from the additive model and the genotypic model performs well, because it demonstrates both robustness to incorrect models and satisfactory power. A mouse lipoprotein data set is used to demonstrate the method.
    Statistical Applications in Genetics and Molecular Biology 03/2014;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract A tiling array yields a series of abundance measurements across the genome using evenly spaced probes. These data can be used for detecting sequences that exhibit a particular behavior. Scanning window statistics are often employed for testing each probe while accounting for local correlation and smoothing noisy measurements. However, window testing may yield false probe discoveries around the sequences and false non-discoveries within the sequences, resulting in biased predicted intervals. We propose to avoid this problem by stipulating that a sequence of interest can appear at most once within a defined region, such as a gene; thus, only one window statistic is considered per region. This substantially reduces the number of tests and hence, is potentially more powerful. We compare this approach to a genome-wise scan that does not require pre-defined search regions, but considers clumps of adjacent probe discoveries. Simulations show that the gene-wise search maintains the nominal FDR level, while the genome-wise scan yields FDR that exceeds the nominal level for low interval effects, and achieves slightly less power. Using arrays to map introns in yeast, we identified 71% of the previously published introns, detected nine previously undiscovered introns, and observed no false intron discoveries by either method.
    Statistical Applications in Genetics and Molecular Biology 02/2014;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract We have developed a modified Patient Rule-Induction Method (PRIM) as an alternative strategy for analyzing representative samples of non-experimental human data to estimate and test the role of genomic variations as predictors of disease risk in etiologically heterogeneous sub-samples. A computational limit of the proposed strategy is encountered when the number of genomic variations (predictor variables) under study is large (>500) because permutations are used to generate a null distribution to test the significance of a term (defined by values of particular variables) that characterizes a sub-sample of individuals through the peeling and pasting processes. As an alternative, in this paper we introduce a theoretical strategy that facilitates the quick calculation of Type I and Type II errors in the evaluation of terms in the peeling and pasting processes carried out in the execution of a PRIM analysis that are under-estimated and non-existent, respectively, when a permutation-based hypothesis test is employed. The resultant savings in computational time makes possible the consideration of larger numbers of genomic variations (an example genome-wide association study is given) in the selection of statistically significant terms in the formulation of PRIM prediction models.
    Statistical Applications in Genetics and Molecular Biology 02/2014;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract In this study, we propose a novel statistical framework for detecting progressive changes in molecular traits as response to a pathogenic stimulus. In particular, we propose to employ Bayesian hierarchical models to analyse changes in mean level, variance and correlation of metabolic traits in relation to covariates. To illustrate our approach we investigate changes in urinary metabolic traits in response to cadmium exposure, a toxic environmental pollutant. With the application of the proposed approach, previously unreported variations in the metabolism of urinary metabolites in relation to urinary cadmium were identified. Our analysis highlights the potential effect of urinary cadmium on the variance and correlation of a number of metabolites involved in the metabolism of choline as well as changes in urinary alanine. The results illustrate the potential of the proposed approach to investigate the gradual effect of pathogenic stimulus in molecular traits.
    Statistical Applications in Genetics and Molecular Biology 02/2014;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract Through integration of genomic data from multiple sources, we may obtain a more accurate and complete picture of the molecular mechanisms underlying tumorigenesis. We discuss the integration of DNA copy number and mRNA gene expression data from an observational integrative genomics study involving cancer patients. The two molecular levels involved are linked through the central dogma of molecular biology. DNA copy number aberrations abound in the cancer cell. Here we investigate how these aberrations affect gene expression levels within a pathway using observational integrative genomics data of cancer patients. In particular, we aim to identify differential edges between regulatory networks of two groups involving these molecular levels. Motivated by the rate equations, the regulatory mechanism between DNA copy number aberrations and gene expression levels within a pathway is modeled by a simultaneous-equations model, for the one- and two-group case. The latter facilitates the identification of differential interactions between the two groups. Model parameters are estimated by penalized least squares using the lasso (L1) penalty to obtain a sparse pathway topology. Simulations show that the inclusion of DNA copy number data benefits the discovery of gene-gene interactions. In addition, the simulations reveal that cis-effects tend to be over-estimated in a univariate (single gene) analysis. In the application to real data from integrative oncogenomic studies we show that inclusion of prior information on the regulatory network architecture benefits the reproducibility of all edges. Furthermore, analyses of the TP53 and TGFb signaling pathways between ER+ and ER- samples from an integrative genomics breast cancer study identify reproducible differential regulatory patterns that corroborate with existing literature.
    Statistical Applications in Genetics and Molecular Biology 02/2014;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract RNA-seq studies allow for the quantification of transcript expression by aligning millions of short reads to a reference genome. However, transcripts share much of their sequence, so that many reads map to more than one place and their origin remains uncertain. This problem can be dealt using mixtures of distributions and transcript expression reduces to estimating the weights of the mixture. In this paper, variational Bayesian (VB) techniques are used in order to approximate the posterior distribution of transcript expression. VB has previously been shown to be more computationally efficient for this problem than Markov chain Monte Carlo. VB methodology can precisely estimate the posterior means, but leads to variance underestimation. For this reason, a novel approach is introduced which integrates the latent allocation variables out of the VB approximation. It is shown that this modification leads to a better marginal likelihood bound and improved estimate of the posterior variance. A set of simulation studies and application to real RNA-seq datasets highlight the improved performance of the proposed method.
    Statistical Applications in Genetics and Molecular Biology 01/2014;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract Complex traits result from an interplay between genes and environment. A better understanding of their joint effects can help refine understanding of the epidemiology of the trait. Various tests have been proposed to assess the statistical interaction between genes and the environment (G×E) in case-parent trio data. However, these tests can lose power when the form of G×E departs from that for which the test was developed. To address this limitation, we propose a data-smoothing approach to estimate and test G×E between a single nucleotide polymorphism and a continuous environmental covariate. For estimating G×E, we fit a generalized additive model using penalized likelihood. The resulting point- and interval-estimates of G×E lead to a graphical display, which can serve as a visualization tool for exploring the form of interaction. For testing G×E, we propose a permutation approach, which accounts for the extra uncertainty introduced by the smoothing process. We investigate the statistical properties of the proposed methods through simulation. We also illustrate the use of the approach with an example data set. We conclude that the approach is useful for exploring novel interactions in data-rich settings.
    Statistical Applications in Genetics and Molecular Biology 01/2014;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract To locate multiple interacting quantitative trait loci (QTL) influencing a trait of interest within experimental populations, usually methods as the Cockerham's model are applied. Within this framework, interactions are understood as the part of the joined effect of several genes which cannot be explained as the sum of their additive effects. However, if a change in the phenotype (as disease) is caused by Boolean combinations of genotypes of several QTLs, this Cockerham's approach is often not capable to identify them properly. To detect such interactions more efficiently, we propose a logic regression framework. Even though with the logic regression approach a larger number of models has to be considered (requiring more stringent multiple testing correction) the efficient representation of higher order logic interactions in logic regression models leads to a significant increase of power to detect such interactions as compared to a Cockerham's approach. The increase in power is demonstrated analytically for a simple two-way interaction model and illustrated in more complex settings with simulation study and real data analysis.
    Statistical Applications in Genetics and Molecular Biology 01/2014;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract Fernández-Durán, J. J. (2004): "Circular distributions based on nonnegative trigonometric sums," Biometrics, 60, 499-503, developed a family of univariate circular distributions based on nonnegative trigonometric sums. In this work, we extend this family of distributions to the multivariate case by using multiple nonnegative trigonometric sums to model the joint distribution of a vector of angular random variables. Practical examples of vectors of angular random variables include the wind direction at different monitoring stations, the directions taken by an animal on different occasions, the times at which a person performs different daily activities, and the dihedral angles of a protein molecule. We apply the proposed new family of multivariate distributions to three real data-sets: two for the study of protein structure and one for genomics. The first is related to the study of a bivariate vector of dihedral angles in proteins. In the second real data-set, we compare the fit of the proposed multivariate model with the bivariate generalized von Mises model of [Shieh, G. S., S. Zheng, R. A. Johnson, Y.-F. Chang, K. Shimizu, C.-C. Wang, and S.-L. Tang (2011): "Modeling and comparing the organization of circular genomes," Bioinformatics, 27(7), 912-918.] in a problem related to orthologous genes in pairs of circular genomes. The third real data-set consists of observed values of three dihedral angles in γ-turns in a protein and serves as an example of trivariate angular data. In addition, a simulation algorithm is presented to generate realizations from the proposed multivariate angular distribution.
    Statistical Applications in Genetics and Molecular Biology 01/2014;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract DNA microarray experiments require the use of multiple hypothesis testing procedures because thousands of hypotheses are simultaneously tested. We deal with this problem from a Bayesian decision theory perspective. We propose a decision criterion based on an estimation of the number of false null hypotheses (FNH), taking as an error measure the proportion of the posterior expected number of false positives with respect to the estimated number of true null hypotheses. The methodology is applied to a Gaussian model when testing bilateral hypotheses. The procedure is illustrated with both simulated and real data examples and the results are compared to those obtained by the Bayes rule when an additive loss function is considered for each joint action and the generalized loss 0-1 function for each individual action. Our procedure significantly reduced the percentage of false negatives whereas the percentage of false positives remains at an acceptable level.
    Statistical Applications in Genetics and Molecular Biology 12/2013;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract Mass spectrometry is an important high-throughput technique for profiling small molecular compounds in biological samples and is widely used to identify potential diagnostic and prognostic compounds associated with disease. Commonly, this data generated by mass spectrometry has many missing values resulting when a compound is absent from a sample or is present but at a concentration below the detection limit. Several strategies are available for statistically analyzing data with missing values. The accelerated failure time (AFT) model assumes all missing values result from censoring below a detection limit. Under a mixture model, missing values can result from a combination of censoring and the absence of a compound. We compare power and estimation of a mixture model to an AFT model. Based on simulated data, we found the AFT model to have greater power to detect differences in means and point mass proportions between groups. However, the AFT model yielded biased estimates with the bias increasing as the proportion of observations in the point mass increased while estimates were unbiased with the mixture model except if all missing observations came from censoring. These findings suggest using the AFT model for hypothesis testing and mixture model for estimation. We demonstrated this approach through application to glycomics data of serum samples from women with ovarian cancer and matched controls.
    Statistical Applications in Genetics and Molecular Biology 12/2013; 12(6):703-722.
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Abstract We propose a non-parametric regression methodology, Random Forests on Distance Matrices (RFDM), for detecting genetic variants associated to quantitative phenotypes, obtained using neuroimaging techniques, representing the human brain's structure or function. RFDM, which is an extension of decision forests, requires a distance matrix as the response that encodes all pair-wise phenotypic distances in the random sample. We discuss ways to learn such distances directly from the data using manifold learning techniques, and how to define such distances when the phenotypes are non-vectorial objects such as brain connectivity networks. We also describe an extension of RFDM to detect espistatic effects while keeping the computational complexity low. Extensive simulation results and an application to an imaging genetics study of Alzheimer's Disease are presented and discussed.
    Statistical Applications in Genetics and Molecular Biology 12/2013; 12(6):757-786.
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract MicroRNAs (miRNAs) are short non-coding RNAs that play critical roles in numerous cellular processes through post-transcriptional functions. The aberrant role of miRNAs has been reported in a number of diseases. A robust computational method is vital to discover novel miRNAs where level of noise varies dramatically across the different miRNAs. In this paper, we propose a flexible rank-based procedure for estimating a weighted log partial area under the receiver operating characteristic (ROC) curve statistic for selecting differentially expressed miRNAs. The statistic combines results taking partial area under the curve (pAUC) and their corresponding variance. The proposed method does not involve complicated formulas and does not require advanced programming skills. Two real datasets are analyzed to illustrate the method and a simulation study is carried out to assess the performance of different miRNA ranking statistics. We conclude that the proposed method offers robust results with large samples for miRNA expression data, and the method can be used as an alternative analytical tool for identifying a list of target miRNAs for further biological and clinical investigation.
    Statistical Applications in Genetics and Molecular Biology 12/2013; 12(6):743-755.
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract With the increasing availability of experimental data on gene interactions, modeling of gene regulatory pathways has gained special attention. Gradient descent algorithms have been widely used for regression and classification applications. Unfortunately, results obtained after training a model by gradient descent are often highly variable. In this paper, we present a new second order learning rule based on the Newton's method for inferring optimal gene regulatory pathways. Unlike the gradient descent method, the proposed optimization rule is independent of the learning parameter. The flow vectors are estimated based on biomass conservation. A set of constraints is formulated incorporating weighting coefficients. The method calculates the maximal expression of the target gene starting from a given initial gene through these weighting coefficients. Our algorithm has been benchmarked and validated on certain types of functions and on some gene regulatory networks, gathered from literature. The proposed method has been found to perform better than the gradient descent learning. Extensive performance comparison with the extreme pathway analysis method has underlined the effectiveness of our proposed methodology.
    Statistical Applications in Genetics and Molecular Biology 11/2013;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract Multiple comparisons or multiple testing has been viewed as a thorny issue in genetic association studies aiming to detect disease-associated genetic variants from a large number of genotyped variants. We alleviate the problem of multiple comparisons by proposing a hierarchical modeling approach that is fundamentally different from the existing methods. The proposed hierarchical models simultaneously fit as many variables as possible and shrink unimportant effects towards zero. Thus, the hierarchical models yield more efficient estimates of parameters than the traditional methods that analyze genetic variants separately, and also coherently address the multiple comparisons problem due to largely reducing the effective number of genetic effects and the number of statistically "significant" effects. We develop a method for computing the effective number of genetic effects in hierarchical generalized linear models, and propose a new adjustment for multiple comparisons, the hierarchical Bonferroni correction, based on the effective number of genetic effects. Our approach not only increases the power to detect disease-associated variants but also controls the Type I error. We illustrate and evaluate our method with real and simulated data sets from genetic association studies. The method has been implemented in our freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
    Statistical Applications in Genetics and Molecular Biology 11/2013;
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract In many biological applications it is necessary to cluster DNA sequences into groups that represent underlying organismal units, such as named species or genera. In metagenomics this grouping needs typically to be achieved on the basis of relatively short sequences which contain different types of errors, making the use of a statistical modeling approach desirable. Here we introduce a novel method for this purpose by developing a stochastic partition model that clusters Markov chains of a given order. The model is based on a Dirichlet process prior and we use conjugate priors for the Markov chain parameters which enables an analytical expression for comparing the marginal likelihoods of any two partitions. To find a good candidate for the posterior mode in the partition space, we use a hybrid computational approach which combines the EM-algorithm with a greedy search. This is demonstrated to be faster and yield highly accurate results compared to earlier suggested clustering methods for the metagenomics application. Our model is fairly generic and could also be used for clustering of other types of sequence data for which Markov chains provide a reasonable way to compress information, as illustrated by experiments on shotgun sequence type data from an Escherichia coli strain.
    Statistical Applications in Genetics and Molecular Biology 11/2013;

Related Journals