Andrew B. Nobel

University of North Carolina at Chapel Hill, North Carolina, United States

Are you Andrew B. Nobel?

Claim your profile

Publications (98)284.77 Total impact

  • John Palowitch · Shankar Bhamidi · Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: Community detection is the process of grouping strongly connected nodes in a network. Many community detection methods for un-weighted networks have a theoretical basis in a null model, which provides an interpretation of resulting communities in terms of statistical significance. In this paper, we introduce a null for sparse weighted networks called the continuous configuration model. We prove a Central Limit Theorem for sums of edge weights under the model, and propose a community extraction method called CCME which combines this result with an iterative multiple testing framework. To benchmark the method, we provide a simulation framework that incorporates the continuous configuration model as a way to plant null or "background" nodes in weighted networks with communities. We show CCME to be competitive with existing methods in accurately identifying both disjoint and overlapping communities, while being particularly effective in ignoring background nodes when they exist. We present two real-world data sets with potential background nodes and analyze them with CCME, yielding results that correspond to known features of the data.
    No preview · Article · Jan 2016
  • Kevin McGoff · Andrew Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the limiting behavior of the average per-state cost when trajectories of a topological dynamical system are used to track a trajectory from an observed ergodic system. We establish a variational characterization of the limiting average cost in terms of dynamically invariant couplings, also known as joinings, of the two dynamical systems, and we show that the set of optimal joinings is convex and compact in the weak topology. Using these results, we establish a general convergence theorem for the limiting behavior of statistical inference procedures based on optimal tracking. The setting considered here is general enough to encompass traditional statistical problems with weakly dependent, real-valued observations. As applications of the general inference result, we consider the consistency of regression estimation under ergodic sampling and of system identification from quantized observations.
    No preview · Article · Jan 2016
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Aging is one of the most important biological processes and is a known risk factor for many age-related diseases in human. Studying age-related transcriptomic changes in tissues across the whole body can provide valuable information for a holistic understanding of this fundamental process. In this work, we catalogue age-related gene expression changes in nine tissues from nearly two hundred individuals collected by the Genotype-Tissue Expression (GTEx) project. In general, we find the aging gene expression signatures are very tissue specific. However, enrichment for some well-known aging components such as mitochondria biology is observed in many tissues. Different levels of cross-tissue synchronization of age-related gene expression changes are observed, and some essential tissues (e.g., heart and lung) show much stronger "co-aging" than other tissues based on a principal component analysis. The aging gene signatures and complex disease genes show a complex overlapping pattern and only in some cases, we see that they are significantly overlapped in the tissues affected by the corresponding diseases. In summary, our analyses provide novel insights to the co-regulation of age-related gene expression in multiple tissues; it also presents a tissue-specific view of the link between aging and age-related diseases.
    Full-text · Article · Oct 2015 · Scientific Reports
  • Source
    Shankar Bhamidi · Jimmy Jin · Andrew Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: Inspired by empirical data on real world complex networks, the last few years have seen an explosion in proposed generative models to understand and explain observed properties of real world networks, including power law degree distribution and "small world" distance scaling. In this context, a natural question is the phenomenon of {\it change point}, understanding how abrupt changes in parameters driving the network model change structural properties of the network. We study this phenomenon in one popular class of dynamically evolving networks: preferential attachment models. We derive asymptotic properties of various functionals of the network including the degree distribution as well as maximal degree asymptotics, in essence showing that the change point does effect the degree distribution but does {\bf not} change the degree exponent. This provides further evidence for long range dependence and sensitive dependence of the evolution of the process on the initial evolution of the process in such self-reinforced systems. We then propose an estimator for the change point and prove consistency properties of this estimator. The methodology developed highlights the effect of the non-ergodic nature of the evolution of the network on classical change point estimators.
    Full-text · Article · Aug 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Understanding the functional consequences of genetic variation, and how it affects complex human disease and quantitative traits, remains a critical challenge for biomedicine. We present an analysis of RNA sequencing data from 1641 samples across 43 tissues from 175 individuals, generated as part of the pilot phase of the Genotype-Tissue Expression (GTEx) project. We describe the landscape of gene expression across tissues, catalog thousands of tissue-specific and shared regulatory expression quantitative trait loci (eQTL) variants, describe complex network relationships, and identify signals from genome-wide association studies explained by eQTLs. These findings provide a systematic understanding of the cellular and biological consequences of human genetic variation and of the heterogeneity of such effects among a diverse set of human tissues.
    Full-text · Article · May 2015 · Science
  • Gen Li · Dan Yang · Andrew B. Nobel · Haipeng Shen
    [Show abstract] [Hide abstract]
    ABSTRACT: A supervised singular value decomposition (SupSVD) model has been developed for supervised dimension reduction where the low rank structure of the data of interest is potentially driven by additional variables measured on the same set of samples. The SupSVD model can make use of the information in the additional variables to accurately extract underlying structures that are more interpretable. The model is general and includes the principal component analysis model and the reduced rank regression model as two extreme cases. The model is formulated in a hierarchical fashion using latent variables, and a modified expectation-maximization algorithm for parameter estimation is developed, which is computationally efficient. The asymptotic properties for the estimated parameters are derived. We use comprehensive simulations and a real data example to illustrate the advantages of the SupSVD model.
    No preview · Article · Mar 2015 · Journal of Multivariate Analysis
  • Jeremy Sabourin · Andrew B. Nobel · William Valdar
    [Show abstract] [Hide abstract]
    ABSTRACT: Genomewide association studies (GWAS) sometimes identify loci at which both the number and identities of the underlying causal variants are ambiguous. In such cases, statistical methods that model effects of multiple single-nucleotide polymorphisms (SNPs) simultaneously can help disentangle the observed patterns of association and provide information about how those SNPs could be prioritized for follow-up studies. Current multi-SNP methods, however, tend to assume that SNP effects are well captured by additive genetics; yet when genetic dominance is present, this assumption translates to reduced power and faulty prioritizations. We describe a statistical procedure for prioritizing SNPs at GWAS loci that efficiently models both additive and dominance effects. Our method, LLARRMA-dawg, combines a group LASSO procedure for sparse modeling of multiple SNP effects with a resampling procedure based on fractional observation weights. It estimates for each SNP the robustness of association with the phenotype both to sampling variation and to competing explanations from other SNPs. In producing an SNP prioritization that best identifies underlying true signals, we show the following: our method easily outperforms a single-marker analysis; when additive-only signals are present, our joint model for additive and dominance is equivalent to or only slightly less powerful than modeling additive-only effects; and when dominance signals are present, even in combination with substantial additive effects, our joint model is unequivocally more powerful than a model assuming additivity. We also describe how performance can be improved through calibrated randomized penalization, and discuss how dominance in ungenotyped SNPs can be incorporated through either heterozygote dosage or multiple imputation.
    No preview · Article · Nov 2014 · Genetic Epidemiology
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A common and important problem arising in the study of networks is how to divide the vertices of a given network into one or more groups, called communities, in such a way that vertices of the same community are more interconnected than vertices belonging to different ones. We propose and investigate a testing based community detection procedure called Extraction of Statistically Significant Communities (ESSC). The ESSC procedure is based on p-values for the strength of connection between a single vertex and a set of vertices under a reference distribution derived from a conditional configuration network model. The procedure automatically selects both the number of communities in the network, and their size. Moreover, ESSC can handle overlapping communities and, unlike the majority of existing methods, identifies "background" vertices that do not belong to a well-defined community. The method has only one parameter, which controls the stringency of the hypothesis tests. We investigate the performance and potential use of ESSC, and compare it with a number of existing methods, through a validation study using four real network datasets. In addition, we carry out a simulation study to assess the effectiveness of ESSC in networks with various types of community structure including networks with overlapping communities and those with background vertices. These results suggest that ESSC is an effective exploratory tool for the discovery of relevent community structure in complex network systems.
    Full-text · Article · Sep 2014 · The Annals of Applied Statistics
  • Source
    Jeremy A. Sabourin · William Valdar · Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a simple, efficient, permutation based procedure for selecting the penalty parameter in the LASSO. The procedure, which is intended for applications where variable selection is the primary focus, can be applied in a variety of structural settings, including generalized linear models. We briefly discuss connections between permutation selection and existing theory for the LASSO. In addition, we present a simulation study and an analysis of three real data sets in which permutation selection is compared with cross-validation (CV), the Bayesian information criterion (BIC), and a selection method based on recently developed testing procedures for the LASSO.
    Preview · Article · Apr 2014 · Biometrics
  • Source
    Terrence M. Adams · Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: We define a notion of entropy for an infinite family $\mathcal{C}$ of measurable sets in a probability space. We show that the mean ergodic theorem holds uniformly for $\mathcal{C}$ under every ergodic transformation if and only if $\mathcal{C}$ has zero entropy. When the entropy of $\mathcal{C}$ is positive, we establish a strong converse showing that the uniform mean ergodic theorem fails generically in every isomorphism class, including the isomorphism classes of Bernoulli transformations. As a corollary of these results, we establish that every strong mixing transformation is uniformly strong mixing on $\mathcal{C}$ if and only if the entropy of $\mathcal{C}$ is zero, and obtain a corresponding result for weak mixing transformations.
    Preview · Article · Mar 2014
  • Source
    Vonn Walter · Fred A. Wright · Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the detection and identification of recurrent departures from stationary behaviour in genomic or similarly arranged data containing measurements at an ordered set of variables. Our primary focus is on departures that occur only at a single variable, or within a small window of contiguous variables, but involve more than one sample. This encompasses the identification of aberrant markers in genome-wide measurements of DNA copy number and DNA methylation, as well as meta-analyses of genome-wide association studies. We propose and analyse a cyclic shift-based procedure for testing recurrent departures from stationarity. Our analysis establishes the consistency of cyclic shift $p$-values for datasets with a fixed set of samples as the number of observed variables tends to infinity, under the assumption that each sample is an independent realization of a stationary Markov chain. Our results apply to any test statistic satisfying a simple invariance condition.
    Full-text · Article · Mar 2014 · Biometrika
  • Source
    James D. Wilson · Shankar Bhamidi · Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: Partitioning a network into different communities so that vertices of the same community share meaningful density- and pattern-based similarities is an important area of research in the field of network science. For directed networks identifying communities turns out to be especially challenging since the directed nature of the edges makes it difficult to evaluate and interpret the significance of a candidate community. In this paper, we consider the strength of connections from a single vertex to a prespecified collection of vertices in directed networks. We propose a methodology to measure the statistical significance of these connections through the use of p-values derived from a directed configuration null model. We derive the asymptotic distribution of the number of edges between a vertex and a community under the null model and show how to calculate p-values using this reference distribution. Using both simulated and real data sets we show that these conditionally based p-values can provide novel insights into the local structure of directed networks.
    Full-text · Conference Paper · Dec 2013
  • Source
    Gen Li · Andrey A. Shabalin · Ivan Rusyn · Fred A. Wright · Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: Expression quantitative trait loci (eQTL) analysis identifies single nucleotide polymorphisms (SNPs) that are associated with the expression of a gene. To date, most eQTL studies have considered the connection between genetic variation and expression in a single tissue. Multi-tissue eQTL analysis has the potential to improve the findings of single tissue analyses by borrowing strength across tissues, and the potential to elucidate the genotypic basis of differences between tissues. In this paper we introduce and study a multivariate hierarchical Bayesian model (MT-eQTL) for multi-tissue eQTL analysis. MT-eQTL directly models the vector of correlations between expression and genotype across tissues. The model explicitly captures patterns of variation in the presence or absence of eQTLs, as well as the heterogeneity of effect sizes across tissues. Moreover, the MT-eQTL model is applicable to complex designs in which the set of donors can vary from tissue to tissue, and can exhibit incomplete overlap between tissues. The model also possesses the desirable property that the model for a subset of tissues can be obtained from the full model via marginalization. Fitting of the MT-eQTL model is carried out via empirical Bayes, using an approximate EM algorithm. Inferences concerning eQTL detection and configuration are derived from adaptive thresholding of local false discovery rates, and maximum a-posteriori estimation, respectively. We investigate the method through a simulation study using parameters derived from an ongoing analysis of real data.
    Preview · Article · Nov 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the asymptotic consistency of maximum likelihood parameter estimation for dynamical systems observed with noise. Under suitable conditions on the dynamical systems and the observations, we show that maximum likelihood parameter estimation is consistent. Our proof involves ideas from both information theory and dynamical systems. Furthermore, we show how some well-studied properties of dynamical systems imply the general statistical properties related to maximum likelihood estimation. Finally, we exhibit classical families of dynamical systems for which maximum likelihood estimation is consistent. Examples include shifts of finite type with Gibbs measures and Axiom A attractors with SRB measures.
    Preview · Article · Jun 2013 · The Annals of Statistics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide association studies have identified thousands of loci for common diseases, but, for the majority of these, the mechanisms underlying disease susceptibility remain unknown. Most associated variants are not correlated with protein-coding changes, suggesting that polymorphisms in regulatory regions probably contribute to many disease phenotypes. Here we describe the Genotype-Tissue Expression (GTEx) project, which will establish a resource database and associated tissue bank for the scientific community to study the relationship between genetic variation and gene expression in human tissues.
    Full-text · Article · May 2013 · Nature Genetics
  • Vonn Walter · Andrew B. Nobel · D. Neil Hayes · Fred A. Wright
    [Show abstract] [Hide abstract]
    ABSTRACT: Genetic mutations and alterations are the hallmark of cancer. When these alterations change the expression or protein product of a gene, increased invasiveness into surrounding tissue can result from unchecked cell cycle progression and improper regulation of cell death. In turn, these can contribute to tumor genesis, development, and expansion. A variety of somatic mutations can occur in tumor tissue, including point mutations, changes in methylation status, and gains and losses of chromosomal regions. Here we will focus primarily on mutations of the last type, which are termed DNA copy number aberrations (CNAs).Many CNAs can arise due to general genomic instability, and occur sporadically in locations throughout the genome. A smaller subset of CNAs appears to be recurrent, occurring repeatedly in the same region across multiple individuals. Recurrent CNAs are thought to be due to regional chromosome structure, or to a selection effect in which gain or loss of important regions leads to increased tumor growth rate. The identification of true recurrent CNAs is important, because these regions may play a role in the initiation and progression of tumors, perhaps even highlighting individual genes for further study or targeted treatment. The detection of recurrent CNAs is largely a statistical problem, and a number of methods have been proposed to address this problem. In this chapter, we survey several methods for analyzing DNA copy number data in tumors, with the DiNAMIC approach [1] described in some detail.The nomenclature in the literature on DNA copy number mutations has sometimes been inconsistent, so we begin by providing relevant definitions. Next we discuss the biological changes that can lead to alterations in tumor DNA copy number, as well as how tumors can result from these changes. The analysis of DNA copy number relies crucially on genomic technologies, and we survey platforms for assaying copy number, noting some of the challenges associated with these data.In Sections 13.3 and 13.4, we survey some of the available methods for analyzing DNA copy number data. Methods for detecting recurrent CNAs share common features including computation of summary statistics in genomic regions, use of resampling in order to create “null” distributions, and adjustments for multiple comparisons. Some of these methods require specific preprocessing steps, and we describe these as well.Section 13.5 is devoted to DiNAMIC. This method uses a novel permutation scheme called cyclic shift to compute its null distribution, and we describe comparisons to other permutation schemes. Although some of the issues may seem technical, the simulation results of Walter et al. [1] suggest that DiNAMIC's cyclic shift procedure is attractive in comparison to other permutation schemes, and leads to proper control of error rates under a variety of realistic marker correlation structures. We conclude by introducing a confidence interval procedure for recurrent CNAs [2]. Publicly available tumor datasets were analyzed with DiNAMIC and the confidence interval procedure, and the results briefly surveyed here and in [2] have underlying biological support.
    No preview · Chapter · Apr 2013
  • Source
    Eric F Lock · Katherine A Hoadley · J S Marron · Andrew B Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: Research in several fields now requires the analysis of datasets in which multiple high-dimensional types of data are available for a common set of objects. In particular, The Cancer Genome Atlas (TCGA) includes data from several diverse genomic technologies on the same cancerous tumor samples. In this paper we introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such datasets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across data types, low-rank approximations for structured variation individual to each data type, and residual noise. JIVE quantifies the amount of joint variation between data types, reduces the dimensionality of the data, and provides new directions for the visual exploration of joint and individual structure. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. A JIVE analysis of gene expression and miRNA data on Glioblastoma Multiforme tumor samples reveals gene-miRNA associations and provides better characterization of tumor types.
    Full-text · Article · Mar 2013 · The Annals of Applied Statistics
  • Source
    Xing Sun · Andrew B Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the maximal size of distinguished submatrices of a Gaussian random matrix. Of interest are submatrices whose entries have an average greater than or equal to a positive constant, and submatrices whose entries are well fit by a two-way ANOVA model. We identify size thresholds and associated (asymptotic) probability bounds for both large-average and ANOVA-fit submatrices. Probability bounds are obtained when the matrix and submatrices of interest are square and, in rectangular cases, when the matrix and submatrices of interest have fixed aspect ratios. Our principal result is an almost sure interval concentration result for the size of large average submatrices in the square case.
    Preview · Article · Feb 2013 · Bernoulli
  • Source
    Shankar Bhamidi · Partha S. Dey · Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: The problem of finding large average submatrices of a real-valued matrix arises in the exploratory analysis of data from a variety of disciplines, ranging from genomics to social sciences. In this paper we provide a detailed asymptotic analysis of large average submatrices of an $n \times n$ Gaussian random matrix. The first part of the paper addresses global maxima. For fixed $k$ we identify the average and the joint distribution of the $k \times k$ submatrix having largest average value. As a dual result, we establish that the size of the largest square sub-matrix with average bigger than a fixed positive constant is, with high probability, equal to one of two consecutive integers that depend on the threshold and the matrix dimension $n$. The second part of the paper addresses local maxima. Specifically we consider submatrices with dominant row and column sums that arise as the local optima of iterative search procedures for large average submatrices. For fixed $k$, we identify the limiting average value and joint distribution of a $k \times k$ submatrix conditioned to be a local maxima. In order to understand the density of such local optima and explain the quick convergence of such iterative procedures, we analyze the number $L_n(k)$ of local maxima, beginning with exact asymptotic expressions for the mean and fluctuation behavior of $L_n(k)$. For fixed $k$, the mean of $L_{n}(k)$ is $\Theta(n^{k}/(\log{n})^{(k-1)/2})$ while the standard deviation is $\Theta(n^{2k^2/(k+1)}/(\log{n})^{k^2/(k+1)})$. Our principal result is a Gaussian central limit theorem for $L_n(k)$ that is based on a new variant of Stein's method.
    Full-text · Article · Nov 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Significance testing one SNP at a time has proven useful for identifying genomic regions that harbor variants affecting human disease. But after an initial genome scan has identified a "hit region" of association, single-locus approaches can falter. Local linkage disequilibrium (LD) can make both the number of underlying true signals and their identities ambiguous. Simultaneous modeling of multiple loci should help. However, it is typically applied ad hoc: conditioning on the top SNPs, with limited exploration of the model space and no assessment of how sensitive model choice was to sampling variability. Formal alternatives exist but are seldom used. Bayesian variable selection is coherent but requires specifying a full joint model, including priors on parameters and the model space. Penalized regression methods (e.g., LASSO) appear promising but require calibration, and, once calibrated, lead to a choice of SNPs that can be misleadingly decisive. We present a general method for characterizing uncertainty in model choice that is tailored to reprioritizing SNPs within a hit region under strong LD. Our method, LASSO local automatic regularization resample model averaging (LLARRMA), combines LASSO shrinkage with resample model averaging and multiple imputation, estimating for each SNP the probability that it would be included in a multi-SNP model in alternative realizations of the data. We apply LLARRMA to simulations based on case-control genome-wide association studies data, and find that when there are several causal loci and strong LD, LLARRMA identifies a set of candidates that is enriched for true signals relative to single locus analysis and to the recently proposed method of Stability Selection.
    Preview · Article · Jul 2012 · Genetic Epidemiology

Publication Stats

10k Citations
284.77 Total Impact Points

Institutions

  • 1995-2015
    • University of North Carolina at Chapel Hill
      • • Department of Statistics and Operations Research
      • • Department of Biostatistics
      North Carolina, United States
  • 2006
    • University of North Carolina at Charlotte
      Charlotte, North Carolina, United States
  • 1994-1995
    • University of Illinois, Urbana-Champaign
      • Beckman Institute for Advanced Science and Technology
      Urbana, Illinois, United States
  • 1992
    • Stanford University
      • Information Systems Laboratory
      Palo Alto, California, United States