Andrew B. Nobel

University of North Carolina at Chapel Hill, Chapel Hill, NC, United States

Are you Andrew B. Nobel?

Claim your profile

Publications (84)279.28 Total impact

  • Jeremy Sabourin, Andrew B. Nobel, William Valdar
    [Show abstract] [Hide abstract]
    ABSTRACT: Genomewide association studies (GWAS) sometimes identify loci at which both the number and identities of the underlying causal variants are ambiguous. In such cases, statistical methods that model effects of multiple single-nucleotide polymorphisms (SNPs) simultaneously can help disentangle the observed patterns of association and provide information about how those SNPs could be prioritized for follow-up studies. Current multi-SNP methods, however, tend to assume that SNP effects are well captured by additive genetics; yet when genetic dominance is present, this assumption translates to reduced power and faulty prioritizations. We describe a statistical procedure for prioritizing SNPs at GWAS loci that efficiently models both additive and dominance effects. Our method, LLARRMA-dawg, combines a group LASSO procedure for sparse modeling of multiple SNP effects with a resampling procedure based on fractional observation weights. It estimates for each SNP the robustness of association with the phenotype both to sampling variation and to competing explanations from other SNPs. In producing an SNP prioritization that best identifies underlying true signals, we show the following: our method easily outperforms a single-marker analysis; when additive-only signals are present, our joint model for additive and dominance is equivalent to or only slightly less powerful than modeling additive-only effects; and when dominance signals are present, even in combination with substantial additive effects, our joint model is unequivocally more powerful than a model assuming additivity. We also describe how performance can be improved through calibrated randomized penalization, and discuss how dominance in ungenotyped SNPs can be incorporated through either heterozygote dosage or multiple imputation.
    Genetic Epidemiology 11/2014; · 4.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a simple, efficient, permutation based procedure for selecting the penalty parameter in the LASSO. The procedure, which is intended for applications where variable selection is the primary focus, can be applied in a variety of structural settings, including generalized linear models. We briefly discuss connections between permutation selection and existing theory for the LASSO. In addition, we present a simulation study and an analysis of three real data sets in which permutation selection is compared with cross-validation (CV), the Bayesian information criterion (BIC), and a selection method based on recently developed testing procedures for the LASSO.
    04/2014;
  • Source
    Terrence M. Adams, Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: We define a notion of entropy for an infinite family $\mathcal{C}$ of measurable sets in a probability space. We show that the mean ergodic theorem holds uniformly for $\mathcal{C}$ under every ergodic transformation if and only if $\mathcal{C}$ has zero entropy. When the entropy of $\mathcal{C}$ is positive, we establish a strong converse showing that the uniform mean ergodic theorem fails generically in every isomorphism class, including the isomorphism classes of Bernoulli transformations. As a corollary of these results, we establish that every strong mixing transformation is uniformly strong mixing on $\mathcal{C}$ if and only if the entropy of $\mathcal{C}$ is zero, and obtain a corresponding result for weak mixing transformations.
    03/2014;
  • Source
    Vonn Walter, Fred A. Wright, Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: Genomic aberrations, such as somatic copy number alterations, are frequently observed in tumor tissue. Recurrent aberrations, occurring in the same region across multiple subjects, are of interest because they may highlight genes associated with tumor development or progression. A number of tools have been proposed to assess the statistical significance of recurrent DNA copy number aberrations, but their statistical properties have not been carefully studied. Cyclic shift testing, a permutation procedure using independent random shifts of genomic marker observations on the genome, has been proposed to identify recurrent aberrations, and is potentially useful for a wider variety of purposes, including identifying regions with methylation aberrations or overrepresented in disease association studies. For data following a countable-state Markov model, we prove the asymptotic validity of cyclic shift $p$-values under a fixed sample size regime as the number of observed markers tends to infinity. We illustrate cyclic shift testing for a variety of data types, producing biologically relevant findings for three publicly available datasets.
    03/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Expression quantitative trait loci (eQTL) analysis identifies single nucleotide polymorphisms (SNPs) that are associated with the expression of a gene. To date, most eQTL studies have considered the connection between genetic variation and expression in a single tissue. Multi-tissue eQTL analysis has the potential to improve the findings of single tissue analyses by borrowing strength across tissues, and the potential to elucidate the genotypic basis of differences between tissues. In this paper we introduce and study a multivariate hierarchical Bayesian model (MT-eQTL) for multi-tissue eQTL analysis. MT-eQTL directly models the vector of correlations between expression and genotype across tissues. The model explicitly captures patterns of variation in the presence or absence of eQTLs, as well as the heterogeneity of effect sizes across tissues. Moreover, the MT-eQTL model is applicable to complex designs in which the set of donors can vary from tissue to tissue, and can exhibit incomplete overlap between tissues. The model also possesses the desirable property that the model for a subset of tissues can be obtained from the full model via marginalization. Fitting of the MT-eQTL model is carried out via empirical Bayes, using an approximate EM algorithm. Inferences concerning eQTL detection and configuration are derived from adaptive thresholding of local false discovery rates, and maximum a-posteriori estimation, respectively. We investigate the method through a simulation study using parameters derived from an ongoing analysis of real data.
    11/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A common and important problem arising in the study of networks is how to divide the vertices of a given network into one or more groups, called communities, in such a way that vertices of the same community are more interconnected than vertices belonging to different ones. We propose and investigate a testing based community detection procedure called Extraction of Statistically Significant Communities (ESSC). The ESSC procedure is based on p-values for the strength of connection between a single vertex and a set of vertices under a reference distribution derived from a conditional configuration network model. The procedure automatically selects both the number of communities in the network, and their size. Moreover, ESSC can handle overlapping communities and, unlike the majority of existing methods, identifies "background" vertices that do not belong to a well-defined community. The method has only one parameter, which controls the stringency of the hypothesis tests. We investigate the performance and potential use of ESSC, and compare it with a number of existing methods, through a validation study using four real network datasets. In addition, we carry out a simulation study to assess the effectiveness of ESSC in networks with various types of community structure including networks with overlapping communities and those with background vertices. These results suggest that ESSC is an effective exploratory tool for the discovery of relevent community structure in complex network systems.
    08/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the asymptotic consistency of maximum likelihood parameter estimation for dynamical systems observed with noise. Under suitable conditions on the dynamical systems and the observations, we show that maximum likelihood parameter estimation is consistent. Our proof involves ideas from both information theory and dynamical systems. Furthermore, we show how some well-studied properties of dynamical systems imply the general statistical properties related to maximum likelihood estimation. Finally, we exhibit classical families of dynamical systems for which maximum likelihood estimation is consistent. Examples include shifts of finite type with Gibbs measures and Axiom A attractors with SRB measures.
    06/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide association studies have identified thousands of loci for common diseases, but, for the majority of these, the mechanisms underlying disease susceptibility remain unknown. Most associated variants are not correlated with protein-coding changes, suggesting that polymorphisms in regulatory regions probably contribute to many disease phenotypes. Here we describe the Genotype-Tissue Expression (GTEx) project, which will establish a resource database and associated tissue bank for the scientific community to study the relationship between genetic variation and gene expression in human tissues.
    Nature Genetics 05/2013; · 35.21 Impact Factor
  • Source
    Nature Genetics 05/2013; 45(6):580-585. · 35.21 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Research in several fields now requires the analysis of datasets in which multiple high-dimensional types of data are available for a common set of objects. In particular, The Cancer Genome Atlas (TCGA) includes data from several diverse genomic technologies on the same cancerous tumor samples. In this paper we introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such datasets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across data types, low-rank approximations for structured variation individual to each data type, and residual noise. JIVE quantifies the amount of joint variation between data types, reduces the dimensionality of the data, and provides new directions for the visual exploration of joint and individual structure. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. A JIVE analysis of gene expression and miRNA data on Glioblastoma Multiforme tumor samples reveals gene-miRNA associations and provides better characterization of tumor types.
    The Annals of Applied Statistics 03/2013; 7(1):523-542. · 2.24 Impact Factor
  • Source
    Xing Sun, Andrew B Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the maximal size of distinguished submatrices of a Gaussian random matrix. Of interest are submatrices whose entries have an average greater than or equal to a positive constant, and submatrices whose entries are well fit by a two-way ANOVA model. We identify size thresholds and associated (asymptotic) probability bounds for both large-average and ANOVA-fit submatrices. Probability bounds are obtained when the matrix and submatrices of interest are square and, in rectangular cases, when the matrix and submatrices of interest have fixed aspect ratios. Our principal result is an almost sure interval concentration result for the size of large average submatrices in the square case.
    Bernoulli 01/2013; 19(1):275-294. · 0.94 Impact Factor
  • Source
    Shankar Bhamidi, Partha S. Dey, Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: The problem of finding large average submatrices of a real-valued matrix arises in the exploratory analysis of data from a variety of disciplines, ranging from genomics to social sciences. In this paper we provide a detailed asymptotic analysis of large average submatrices of an $n \times n$ Gaussian random matrix. The first part of the paper addresses global maxima. For fixed $k$ we identify the average and the joint distribution of the $k \times k$ submatrix having largest average value. As a dual result, we establish that the size of the largest square sub-matrix with average bigger than a fixed positive constant is, with high probability, equal to one of two consecutive integers that depend on the threshold and the matrix dimension $n$. The second part of the paper addresses local maxima. Specifically we consider submatrices with dominant row and column sums that arise as the local optima of iterative search procedures for large average submatrices. For fixed $k$, we identify the limiting average value and joint distribution of a $k \times k$ submatrix conditioned to be a local maxima. In order to understand the density of such local optima and explain the quick convergence of such iterative procedures, we analyze the number $L_n(k)$ of local maxima, beginning with exact asymptotic expressions for the mean and fluctuation behavior of $L_n(k)$. For fixed $k$, the mean of $L_{n}(k)$ is $\Theta(n^{k}/(\log{n})^{(k-1)/2})$ while the standard deviation is $\Theta(n^{2k^2/(k+1)}/(\log{n})^{k^2/(k+1)})$. Our principal result is a Gaussian central limit theorem for $L_n(k)$ that is based on a new variant of Stein's method.
    11/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Significance testing one SNP at a time has proven useful for identifying genomic regions that harbor variants affecting human disease. But after an initial genome scan has identified a "hit region" of association, single-locus approaches can falter. Local linkage disequilibrium (LD) can make both the number of underlying true signals and their identities ambiguous. Simultaneous modeling of multiple loci should help. However, it is typically applied ad hoc: conditioning on the top SNPs, with limited exploration of the model space and no assessment of how sensitive model choice was to sampling variability. Formal alternatives exist but are seldom used. Bayesian variable selection is coherent but requires specifying a full joint model, including priors on parameters and the model space. Penalized regression methods (e.g., LASSO) appear promising but require calibration, and, once calibrated, lead to a choice of SNPs that can be misleadingly decisive. We present a general method for characterizing uncertainty in model choice that is tailored to reprioritizing SNPs within a hit region under strong LD. Our method, LASSO local automatic regularization resample model averaging (LLARRMA), combines LASSO shrinkage with resample model averaging and multiple imputation, estimating for each SNP the probability that it would be included in a multi-SNP model in alternative realizations of the data. We apply LLARRMA to simulations based on case-control genome-wide association studies data, and find that when there are several causal loci and strong LD, LLARRMA identifies a set of candidates that is enriched for true signals relative to single locus analysis and to the recently proposed method of Stability Selection.
    Genetic Epidemiology 04/2012; 36(5):451-62. · 4.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Breast cancer is a heterogeneous disease with known expression-defined tumor subtypes. DNA copy number studies have suggested that tumors within gene expression subtypes share similar DNA Copy number aberrations (CNA) and that CNA can be used to further sub-divide expression classes. To gain further insights into the etiologies of the intrinsic subtypes, we classified tumors according to gene expression subtype and next identified subtype-associated CNA using a novel method called SWITCHdna, using a training set of 180 tumors and a validation set of 359 tumors. Fisher's exact tests, Chi-square approximations, and Wilcoxon rank-sum tests were performed to evaluate differences in CNA by subtype. To assess the functional significance of loss of a specific chromosomal region, individual genes were knocked down by shRNA and drug sensitivity, and DNA repair foci assays performed. Most tumor subtypes exhibited specific CNA. The Basal-like subtype was the most distinct with common losses of the regions containing RB1, BRCA1, INPP4B, and the greatest overall genomic instability. One Basal-like subtype-associated CNA was loss of 5q11-35, which contains at least three genes important for BRCA1-dependent DNA repair (RAD17, RAD50, and RAP80); these genes were predominantly lost as a pair, or all three simultaneously. Loss of two or three of these genes was associated with significantly increased genomic instability and poor patient survival. RNAi knockdown of RAD17, or RAD17/RAD50, in immortalized human mammary epithelial cell lines caused increased sensitivity to a PARP inhibitor and carboplatin, and inhibited BRCA1 foci formation in response to DNA damage. These data suggest a possible genetic cause for genomic instability in Basal-like breast cancers and a biological rationale for the use of DNA repair inhibitor related therapeutics in this breast cancer subtype.
    Breast Cancer Research and Treatment 11/2011; 133(3):865-80. · 4.47 Impact Factor
  • Source
    Vonn Walter, Andrew B Nobel, Fred A Wright
    [Show abstract] [Hide abstract]
    ABSTRACT: DNA copy number gains and losses are commonly found in tumor tissue, and some of these aberrations play a role in tumor genesis and development. Although high resolution DNA copy number data can be obtained using array-based techniques, no single method is widely used to distinguish between recurrent and sporadic copy number aberrations. Here we introduce Discovering Copy Number Aberrations Manifested In Cancer (DiNAMIC), a novel method for assessing the statistical significance of recurrent copy number aberrations. In contrast to competing procedures, the testing procedure underlying DiNAMIC is carefully motivated, and employs a novel cyclic permutation scheme. Extensive simulation studies show that DiNAMIC controls false positive discoveries in a variety of realistic scenarios. We use DiNAMIC to analyze two publicly available tumor datasets, and our results show that DiNAMIC detects multiple loci that have biological relevance. Source code implemented in R, as well as text files containing examples and sample datasets are available at http://www.bios.unc.edu/research/genomic_software/DiNAMIC.
    Bioinformatics 03/2011; 27(5):678-85. · 5.47 Impact Factor
  • Source
    Terrence M. Adams, Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: For any family of measurable sets in a probability space, we show that either (i) the family has infinite Vapnik-Chervonenkis (VC) dimension or (ii) for every epsilon > 0 there is a finite partition pi such the pi-boundary of each set has measure at most epsilon. Immediate corollaries include the fact that a family with finite VC dimension has finite bracketing numbers, and satisfies uniform laws of large numbers for every ergodic process. From these corollaries, we derive analogous results for VC major and VC graph families of functions. Comment: 13 pages, no figures
    Bernoulli 10/2010; · 0.94 Impact Factor
  • Source
    Terrence M. Adams, Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: We show that if $\mathcal{X}$ is a complete separable metric space and $\mathcal{C}$ is a countable family of Borel subsets of $\mathcal{X}$ with finite VC dimension, then, for every stationary ergodic process with values in $\mathcal{X}$, the relative frequencies of sets $C\in\mathcal{C}$ converge uniformly to their limiting probabilities. Beyond ergodicity, no assumptions are imposed on the sampling process, and no regularity conditions are imposed on the elements of $\mathcal{C}$. The result extends existing work of Vapnik and Chervonenkis, among others, who have studied uniform convergence for i.i.d. and strongly mixing processes. Our method of proof is new and direct: it does not rely on symmetrization techniques, probability inequalities or mixing conditions. The uniform convergence of relative frequencies for VC-major and VC-graph classes of functions under ergodic sampling is established as a corollary of the basic result for sets. Comment: Published in at http://dx.doi.org/10.1214/09-AOP511 the Annals of Probability (http://www.imstat.org/aop/) by the Institute of Mathematical Statistics (http://www.imstat.org)
    The Annals of Probability 10/2010; · 1.38 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Analysis of microarray experiments often involves testing for the overrepresentation of pre-defined sets of genes among lists of genes deemed individually significant. Most popular gene set testing methods assume the independence of genes within each set, an assumption that is seriously violated, as extensive correlation between genes is a well-documented phenomenon. We conducted a meta-analysis of over 200 datasets from the Gene Expression Omnibus in order to demonstrate the practical impact of strong gene correlation patterns that are highly consistent across experiments. We show that a common independence assumption-based gene set testing procedure produces very high false positive rates when applied to data sets for which treatment groups have been randomized, and that gene sets with high internal correlation are more likely to be declared significant. A reanalysis of the same datasets using an array resampling approach properly controls false positive rates, leading to more parsimonious and high-confidence gene set findings, which should facilitate pathway-based interpretation of the microarray data. These findings call into question many of the gene set testing results in the literature and argue strongly for the adoption of resampling based gene set testing criteria in the peer reviewed biomedical literature.
    BMC Genomics 10/2010; 11:574. · 4.40 Impact Factor
  • Source
    Andrey Shabalin, Andrew Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we study the problem of reconstruction of a low-rank matrix observed with additive Gaussian noise. First we show that under mild assumptions (about the prior distribution of the signal matrix) we can restrict our attention to reconstruction methods that are based on the singular value decomposition of the observed matrix and act only on its singular values (preserving the singular vectors). Then we determine the effect of noise on the SVD of low-rank matrices by building a connection between matrix reconstruction problem and spiked population model in random matrix theory. Based on this knowledge, we propose a new reconstruction method, called RMT, that is designed to reverse the effect of the noise on the singular values of the signal matrix and adjust for its effect on the singular vectors. With an extensive simulation study we show that the proposed method outperform even oracle versions of both soft and hard thresholding methods and closely matches the performance of a general oracle scheme. Comment: 34 pages, 7 figures
    Journal of Multivariate Analysis 07/2010; · 1.06 Impact Factor
  • Source
    Terrence M. Adams, Andrew B. Nobel
    [Show abstract] [Hide abstract]
    ABSTRACT: We show that the sets in a family with finite VC dimension can be uniformly approximated within a given error by a finite partition. Immediate corollaries include the fact that VC classes have finite bracketing numbers, satisfy uniform laws of averages under strong dependence, and exhibit uniform mixing. Our results are based on recent work concerning uniform laws of averages for VC classes under ergodic sampling. Comment: 10 pages
    07/2010;

Publication Stats

7k Citations
279.28 Total Impact Points

Institutions

  • 1994–2013
    • University of North Carolina at Chapel Hill
      • • Department of Statistics and Operations Research
      • • Department of Biostatistics
      • • Department of Environmental Sciences and Engineering
      • • Department of Genetics
      • • Department of Computer Science
      Chapel Hill, NC, United States
  • 2009
    • Johns Hopkins Medicine
      • Department of Biostatistics
      Baltimore, MD, United States
  • 2008
    • Florida State University
      Tallahassee, Florida, United States
  • 2006
    • University of North Carolina at Pembroke
      North Carolina, United States
  • 1994–1995
    • University of Illinois, Urbana-Champaign
      • Beckman Institute for Advanced Science and Technology
      Urbana, Illinois, United States
  • 1993
    • Stanford University
      Palo Alto, California, United States