
Charles Lawrence- PhD
- Professor (Full) at Brown University
Charles Lawrence
- PhD
- Professor (Full) at Brown University
About
179
Publications
27,396
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
12,506
Citations
Introduction
With collaborators we developed a series of models to infer the ages of marine sediments using HMMs and profile HMMs. More recently we developed a Gaussian process (GP) regression models to infer paleo sea surface temperatures (SST) and GP state space models to to align marine sediments. Previously we developed an HMM model to infer events in the regulation of gene expression, and a probabilistic sampling algorithm to infer the Boltzmann ensemble of RNA secondary structures.
Current institution
Additional affiliations
September 2004 - September 2014
Publications
Publications (179)
We present the energy-balance model–Kalman filter (EBM–KF), a hybrid model projecting and assimilating the global mean surface temperature (GMST) and ocean heat content anomaly (OHCA). It combines an annual energy-balance model (difference equations) with 17 parameters drawn from the literature and a statistical extended Kalman filter assimilating...
Plain Language Summary
To determine the age of deep‐sea sediments, often the oxygen isotope ratios of microfossils are measured and compared to a previously compiled global benchmark. Recently, one of the most widely used oxygen isotope benchmarks has been challenged based on a comparison with several Atlantic records. In this study we assess sever...
Temporal offsets (“lags”) between benthic δ¹⁸O (δ¹⁸Ob) signals in different locations are not only a source of age model uncertainty during δ¹⁸Ob stratigraphic alignment but also provide an opportunity to improve reconstructions of deep ocean circulation change during Termination 1 (T1). While methods based on the visual identification of identical...
The geological record encodes the relationship between climate and atmospheric carbon dioxide (CO 2 ) over long and short timescales, as well as potential drivers of evolutionary transitions. However, reconstructing CO 2 beyond direct measurements requires the use of paleoproxies and herein lies the challenge, as proxies differ in their assumptions...
Previously developed software packages that generate probabilistic age models for ocean sediment cores are designed to either interpolate between different age proxies at discrete depths (e.g., radiocarbon, tephra layers, or tie points) or perform a probabilistic stratigraphic alignment to a dated target (e.g., of benthic δ18O) and cannot combine a...
Full preprint text available at nearby DOI.
We present the Energy Balance Model – Kalman Filter (EBM-KF), a hybrid model of the global mean surface temperature (GMST) and ocean heat content anomaly (OHCA). It combines an energy balance model with parameters drawn from the literature and a statistical Extended Kalman Filter assimilating observed an...
Climate variability over the past 800,000 years has long been described as being dominated by ~100-kyr glacial cycles, and researchers have debated whether these glacial cycles were driven by Earth’s orbital cycles of eccentricity, obliquity and precession. Some recent studies have suggested that these ~100-kyr glacial cycles are best characterized...
Reconstructing past climate events relies on the relevant proxies and how they are related. Depending only on such relationships, however, could not be robust because only few proxy observations are usually available at each age. A state-space model employs a prior to make the hidden past climate events correlated with one another so that extreme i...
Previously developed software packages that generate probabilistic age models for ocean sediment cores are designed to use either age proxies (e.g., radiocarbon or tephra layers) or stratigraphic alignment (e.g., of benthic δ18O) and cannot combine age inferences from both techniques. Furthermore, many radiocarbon dating packages are not specifical...
Because interpretations of paleoclimate records from ocean sediment cores rely on age models to estimate age as a function of core depth, age model uncertainty can affect the conclusions drawn from paleoclimate data. To assess the age model uncertainty associated with three methods, we compare the age models they generate over the last glacial cycl...
Uncovering how transcription factors regulate their targets at DNA, RNA and protein levels over time is critical to define gene regulatory networks (GRNs) and assign mechanisms in normal and diseased states. RNA-seq is a standard method measuring gene regulation using an established set of analysis stages. However, none of the currently available p...
A common transcriptome assembly error is to mistake different transcripts of the same gene as transcripts from multiple closely related genes. This error is difficult to identify during assembly, but in a phylogenetic analysis such errors can be diagnosed from gene phylogenies where they appear as clades of tips from the same species with improbabl...
Uncovering how transcription factors (TFs) regulate their targets at the DNA, RNA and protein levels over time is critical to define gene regulatory networks (GRNs) in normal and diseased states. RNA-seq has become a standard method to measure gene regulation using an established set of analysis steps. However, none of the currently available pipel...
Key features of late Neogene climate remain uncertain due to conflicting records derived from different sea surface temperature (SST) proxies. To understand scenarios in which proxy‐derived temperature estimates can be used interchangeably or are instead measuring different aspects of the same system, it is necessary to explore both the consistenci...
Correlation does not necessarily imply a causation, but in climatology and paleoclimatology, correlation is used to identify potential cause-and-effect relationships because linking mechanisms are difficult to observe. Confounding by an often unknown outside variable that drives the sets of observables is one of the major factors that lead to corre...
To restore the historical sea surface temperatures (SSTs) better, it is important to construct a good calibration model for the associated proxies. In this paper, we introduce a new model for alkenone (${\rm{U}}_{37}^{\rm{K}'}$) based on the heteroscedastic Gaussian process (GP) regression method. Our nonparametric approach not only deals with the...
Abstract—To restore the historical sea surface temperatures
(SSTs) better, it is important to construct a good
calibration model for the associated proxies. In this paper,
we introduce a new model for alkenone (UK0
37 ) based on the heteroscedastic Gaussian process (GP) regression method. Our nonparametric approach not only deals with the
variable...
Ages in ocean sediment cores are often inferred using either benthic ${\delta}^{18}{\rm{O}}$ or planktonic ${}^{14}{\rm{C}}$ of foraminiferal calcite. Existing probabilistic dating methods infer ages in two distinct approaches: ages are either inferred directly using radionuclides, e.g. Bacon [Blaauw and Christen (2011)]; or indirectly based on the...
Background
Procedures for controlling the false discovery rate (FDR) are widely applied as a solution to the multiple comparisons problem of high-dimensional statistics. Current FDR-controlling procedures require accurately calculated p-values and rely on extrapolation into the unknown and unobserved tails of the null distribution. Both of these in...
Background:
The nearest neighbor model and associated dynamic programming algorithms allow for the efficient estimation of the RNA secondary structure Boltzmann ensemble. However because a given RNA secondary structure only contains a fraction of the possible helices that could form from a given sequence, the Boltzmann ensemble is multimodal. Seve...
Correlation does not necessarily imply a causation, but in climatology and paleoclimatology, correlation is used to identify potential cause-and-effect relationships because linking mechanisms are difficult to observe. Confounding by an often unknown outside variable that drives the sets of observables is one of the major factors that lead to corre...
Understanding the mechanisms behind any changes in the climate system often requires establishing the timing of events imprinted on the geological record. However, these proxy records are prone to large uncertainties, which may preclude meaningful conclusions about the relative timing of events. In this study, we put forth a framework to estimate t...
Motivation
One of the most common transcriptome assembly errors is to mistake different transcripts of the same gene as transcripts from multiple closely related genes. It is difficult to identify these errors during assembly, but in a phylogenetic analysis these errors can be diagnosed from gene trees containing clades of tips from the same specie...
Background
Repetitive elements are now known to have relevant cellular functions, including self-complementary sequences that form double stranded (ds) RNA. There are numerous pathways that determine the fate of endogenous dsRNA, and misregulation of endogenous dsRNA is a driver of autoimmune disease, particularly in the brain. Unfortunately, the a...
We present a 5-Myr probabilistic stack, which we name the Prob-stack. It is constructed from 180 globally distributed benthic δ¹⁸O records using a profile hidden Markov model (HMM). Benthic stacks have been extensively employed to estimate ages in marine sediment cores and lead-lag relationships between paleoclimate proxies. Because this stack is p...
Molecular phylogenetics is the study of evolutionary relationships between biological sequences, often to infer the evolutionary relationships of organisms. These studies require many analysis components, including sequence assembly, identification of homologous sequences, gene tree inference, and species tree inference. At present, each component...
Non-allelic homologous recombination (NAHR) is a common mechanism for generating genome rearrangements and is implicated in numerous genetic disorders, but its detection in high-throughput sequencing data poses a serious challenge. We present a probabilistic model of NAHR and demonstrate its ability to find NAHR in low-coverage sequencing data from...
The assessment of age uncertainty in stratigraphically aligned records is a pressing need in paleoceanographic research. The alignment of ocean sediment cores is used to develop mutually consistent age models for climate proxies and is often based on the δ18O of calcite from benthic foraminifera, which records a global ice volume and deep water tem...
The accurate and thorough genome-wide detection of adenosine-to-inosine editing, a biologically indispensable process, has proven challenging. Here, we present a discovery pipeline in adult Drosophila, with 3,581 high-confidence editing sites identified with an estimated accuracy of 87%. The target genes and specific sites highlight global biologic...
Validation and reproducibility of results is a central and pressing issue in genomics. Several recent embarrassing incidents involving the irreproducibility of high-profile studies have illustrated the importance of this issue and the need for rigorous methods for the assessment of reproducibility.
Here, we describe an existing statistical model th...
We describe an efficient, exact Bayesian algorithm applicable to both variable selection and model averaging problems. A fully Bayesian approach provides a more complete characterization of the posterior ensemble of possible sub-models, but presents a computational challenge as the number of candidate variables increases. While several approximatio...
Simulated sequence data Data sets: • 100 sets of simulated sequences were generated from a phylogenetic tree (γ-proteobacterial or yeast). • 0 to 4 simulated sites were planted in the simulated sequences: Crp sites for γ-proteobacterial and STB5p sites for yeast. • Four different algorithms were tested: − Gibbs Recursive Sampler (provides MAP solut...
Auto-regulatory feedback loops are a common molecular strategy used to optimize protein function. In Drosophila, many messenger RNAs involved in neuro-transmission are re-coded at the RNA level by the RNA-editing enzyme, dADAR, leading to the incorporation of amino acids that are not directly encoded by the genome. dADAR also re-codes its own trans...
Abstract:
In this paper, we introduce the Bayesian Change Point and Variable Selection algorithm which utilizes dynamic programming recursions to draw direct samples from a very high-dimensional space in a computationally efficient manner, and apply this algorithm to a geoscience problem that concerns the Earth’s history of glaciation. Strong evide...
In this paper, we introduce the Bayesian Change Point and Variable Selection algorithm which utilizes dynamic programming recursions to draw direct samples from a very high-dimensional space in a computationally efficient manner, and apply this algorithm to a geoscience problem that concerns the Earth's history of glaciation. Strong evidence exists...
Stratigraphic alignment is the primary way in which long marine climate
records are placed on a common age model. We present a probabilistic
pairwise alignment algorithm based on Hidden Markov models (HMM) to
estimate alignment uncertainty and apply it to the alignment of
Pleistocene benthic d18O records. This probabilistic algorithm improves
upon...
RNA secondary structure plays an important role in the function of many RNAs, and structural features are often key to their interaction with other cellular components. Thus, there has been considerable interest in the prediction of secondary structures for RNA families. In this article, we present a new global structural alignment algorithm, RNAG,...
Chromatin structure affects the accessibility of DNA to transcription, repair, and replication. Changes in chromatin structure occur during development, but less is known about changes during aging. We examined the state of chromatin structure and its effect on gene expression during aging in Drosophila at the whole genome and cellular level using...
Although different paleoenvironmental time series resolve past climatic change at different time scales, nearly all share one characteristic: they are nonstationary over the length of the record sampled. We describe a recursive dynamic programming change point algorithm that is well suited to identify shifts in the Earth system's variability, as it...
Computational biology is replete with high-dimensional discrete prediction and inference problems. Dynamic programming recursions can be applied to several of the most important of these, including sequence alignment, RNA secondary-structure prediction, phylogenetic inference, and motif finding. In these problems, attention is frequently focused on...
A recently developed DNaseI assay has given us our first genome-wide view of chromatin structure. In addition to cataloging DNaseI hypersensitive sites, these data allows us to more completely characterize overall features of chromatin accessibility. We employed a Bayesian hierarchical change-point model (CPM), a generalization of a hidden Markov M...
Gene, ND95, and P-Quantile information on all 20 sequence pairs.
(0.04 MB DOC)
Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain,...
Maximum likelihood estimators and other direct optimization-based estimators dominated statistical estimation and prediction for decades. Yet, the principled foundations supporting their dominance do not apply to the discrete high-dimensional inference problems of the 21st century. As it is well known, statistical decision theory shows that maximum...
RNA interference (RNAi) mediated by small interfering RNAs (siRNAs) or short hairpin RNAs (shRNAs) has become a powerful tool for gene knockdown studies. However, the levels of knockdown vary greatly. Here, we examine the effect of target disruption energy, a novel measure of target accessibility, along with other parameters that may affect RNAi ef...
Identification of functionally conserved regulatory elements in sequence data from closely related organisms is becoming feasible, due to the rapid growth of public sequence databases. Closely related organisms are most likely to have common regulatory motifs; however, the recent speciation of such organisms results in the high degree of correlatio...
The Gibbs Centroid Sampler is a software package designed for locating conserved elements in biopolymer sequences. The Gibbs Centroid Sampler reports a centroid alignment, i.e. an alignment that has the minimum total distance to the set of samples chosen from the a posteriori probability distribution of transcription factor binding-site alignments....
Under the assumption that a significant motivation for sequencing the genomes of mammals is the resulting ability to help us locate and characterize functional DNA segments shared with humans, we have developed a statistical analysis to quantify the expected advantage. Examining uncertainty in terms of the width of a confidence interval, we show th...
Approaches based upon sequence weights, to construct a position weight matrix of nucleotides from aligned inputs, are popular but little effort has been expended to measure their quality.We derive optimal sequence weights that minimize the sum of the variances of the estimators of base frequency parameters for sequences related by a phylogenetic tr...
The Gibbs Motif Sampler (Gibbs) is a software package used to predict conserved elements in biopolymer sequences. Although the software can be used to locate conserved motifs in protein sequences, its most common use is the prediction of transcription factor binding sites (TFBSs) in promoters upstream of gene sequences. We will describe approaches...
When transcription factor binding sites are known for a particular transcription factor, it is possible to construct a motif model that can be used to scan sequences for additional sites. However, few statistically significant sites are revealed when a transcription factor binding site motif model is used to scan a genome-scale database.
We have de...
Supplementary Table 4. This table lists the sites and the q-values for each of the PurR binding site prediction experiments in Table 1 of the text.
Additional Information. This file includes legends for Supplementary Tables 2–4, which are included as additional files (see below). It includes samples of calculations described in Methods.
Supplementary Table 2. This table lists the orthologs and the orthologous intergenic regions used in this study.
Supplementary Table 3. This table lists the sites and the q-values for each of the Crp binding site prediction experiments in Table 1 of the text.
There is growing evidence of translational gene regulation at the mRNA level, and of the important roles of RNA secondary structure in these regulatory processes. Because mRNAs likely exist in a population of structures, the popular free energy minimization approach may not be well suited to prediction of mRNA structures in studies of post-transcri...
Histone amino termini are post-translationally modified by both transcriptional coactivators and corepressors, but the extent to which the relevant histone modifications contribute to gene expression, and the mechanisms by which they do so, are incompletely understood. To address this issue, we have examined the contributions of the histone H3 and...
Unlabelled:
The energy landscape of RNA secondary structures is often complex, and the Boltzmann-weighted ensemble usually contains distinct clusters. Furthermore, the minimum free energy structure often lies outside of the cluster containing the structure determined by comparative sequence analysis. We have developed procedures to characterize an...
Prediction of RNA secondary structure by free energy minimization has been the standard for over two decades. Here we describe a novel method that forsakes this paradigm for predictions based on Boltzmann-weighted structure ensemble. We introduce the notion of a centroid structure as a representative for a set of structures and describe a procedure...
The Gibbs Motif Sampler (Gibbs) is a software package for discovering conserved elements in biopolymer sequences. This unit describes the basic operation of the Web-based interface to Gibbs, along with advanced examples of its use, and the Web interface to dscan, a sequence database search program.
The γ-proteobacterium Shewanella oneidensis strain MR-1 is a metabolically versatile organism that can reduce a wide range of organic compounds, metal ions, and radionuclides. Similar to most other sequenced organisms, ≈40% of the predicted ORFs in the S. oneidensis genome were annotated as uncharacterized “hypothetical” genes. We implemented an in...
Introduction In eukaryotic organisms, RNA interference (RNAi) is the sequence-specific gene silencing that is induced by double-stranded RNA (dsRNA) homologous to the silenced gene. In the cytoplasm of mammalian cells, long dsRNAs (>30 nt) can activate the potent interferon and a protein kinase-mediated pathway, which lead to non-sequence-specific...
Approaches based upon sequence weights, to construct a position weight matrix of nucleotides from aligned inputs, are popular but little effort has been expended to measure their quality. We derive optimal sequence weights that minimize the sum of the variances of the estimators of base frequency parameters for sequences related by a phylogenetic t...
Rhodopseudomonas palustris, an alpha-proteobacterium, carries out three of the chemical reactions that support life on this planet: the conversion of sunlight to chemical-potential energy; the absorption of carbon dioxide, which it converts to cellular material; and the fixation of atmospheric nitrogen into ammonia. Insight into the transcription-r...
Clusters of transcription factor binding sites (TFBSs) which direct gene expression constitute cis-regulatory modules (CRMs). We present a novel algorithm, based on Gibbs sampling, which locates, de novo, the cis features of these CRMs, their component TFBSs, and the properties of their spatial distribution. The algorithm finds 69% of experimentall...
The Sfold web server provides user-friendly access to Sfold, a recently developed nucleic acid folding software package, via the World Wide Web (WWW). The software is based on a
new statistical sampling paradigm for the prediction of RNA secondary structure. One of the main objectives of this software
is to offer computational tools for the rationa...
An RNA molecule, particularly a long-chain mRNA, may exist as a population of structures. Further more, multiple structures have been demonstrated to play important functional roles. Thus, a representation of the ensemble of probable structures is of interest. We present a statistical algorithm to sample rigorously and exactly from the Boltzmann en...
The Gibbs Motif Sampler is a software package for locating common elements in collections of biopolymer sequences. In this
paper we describe a new variation of the Gibbs Motif Sampler, the Gibbs Recursive Sampler, which has been developed specifically
for locating multiple transcription factor binding sites for multiple transcription factors simult...
The Valencia International Meetings on Bayesian Statistics, held every four years, provide the main forum for researchers in the area of Bayesian Statistics to come together to present and discuss frontier developments in the field. The resulting Proceedings provide a definitive, up-to-date overview encompassing a wide range of theoretical and appl...
The identification of co-regulated genes and their transcription-factor binding sites (TFBS) are key steps toward understanding transcription regulation. In addition to effective laboratory assays, various computational approaches for the detection of TFBS in promoter regions of coexpressed genes have been developed. The availability of complete ge...
The clustering problem has attracted much attention from both statisticians and computer scientists in the past fifty years. Methods such as hierarchical clustering and the K-means method are convenient and competitive first choices off the shelf for the scientist. Gaussian mixture modeling is another popular but computationally expensive clusterin...
As the number of sequenced genomes has grown, the questions of which species are most useful and how many genomes are sufficient for comparison have become increasingly important for comparative genomics studies. We have systematically addressed these questions with respect to phylogenetic footprinting of transcription factor (TF) binding sites in...
Particle classification is an important component of multivariate statistical analysis methods that has been used extensively to extract information from electron micrographs of single particles. Here we describe a new Bayesian Gibbs sampling algorithm for the classification of such images. This algorithm, which is applied after dimension reduction...
The alpha/beta hydrolases constitute a large protein superfamily that mainly consists of enzymes that catalyze a diverse range of reactions. These proteins exhibit the alpha/beta hydrolase fold, the essential features of which have recently been delineated: the presence of at least five parallel beta-strands, a catalytic triad in a specific order (...
The α/β hydrolases constitute a large protein superfamily that mainly consists of enzymes that catalyze a diverse range of reactions. These proteins exhibit the α/β hydrolase fold, the essential features of which have recently been delineated: the presence of at least five parallel β-strands, a catalytic triad in a specific order (nucleophile-acid-...
The Smith-Waterman algorithm yields a single alignment, which, albeit optimal, can be strongly affected by the choice of the scoring matrix and the gap penalties. Additionally, the scores obtained are dependent upon the lengths of the aligned sequences, requiring a post-analysis conversion. To overcome some of these shortcomings, we developed a Bay...
A maximum likelihood approach has been proposed for finding protein binding sites on strands of DNA [G.D. Stormo, G.W. Hartzell, Proceedings of the National Academy of Sciences of the USA 86 (1989) 1183]. We formulate an optimization model for the problem and present calculations with experimental sequence data to study the behavior of this site id...
Single-stranded regions in RNA secondary structure are important for RNA-RNA and RNA-protein interactions. We present a probability profile approach for the prediction of these regions based on a statistical algorithm for sampling RNA secondary structures. For the prediction of phylogenetically-determined single-stranded regions in secondary struct...
Like many transposons the bacterial insertion sequence IS903 was thought to insert randomly. However, using both genetic and statistical approaches, we have derived a target site for IS903 that is used 84% of the time. Computational and genetic analyses of multiple IS903 insertion sites predicted a preferred target consisting of a 21 bp palindromic...
Toward the goal of identifying complete sets of transcription factor (TF)-binding sites in the genomes of several gamma proteobacteria,
and hence describing their transcription regulatory networks, we present a phylogenetic footprinting method for identifying
these sites. Probable transcription regulatory sites upstream of Escherichia coli genes we...
Toward the goal of identifying complete sets of transcription factor (TF)-binding sites in the genomes of several gamma proteobacteria, and hence describing their transcription regulatory networks, we present a phylogenetic footprinting method for identifying these sites. Probable transcription regulatory sites upstream of Escherichia coli genes we...
Questions
Question (1)
Bayesian Adaptive Sequence Alignment Algorithms. Zhu J, Liu JS, Lawrence CE. Bioinformatics, 14:25-39, 1998.
The Bayesian Change Point and Variable Selection Algorithm: Application to the δ18O Proxy Record of the Plio-Pleistocene (2014) Ruggieri, Eric, Lawrence, C. E., Journal of Computational and Graphical Statistics, Volume 23, Number 1, Pages 87–110 DOI: 10.1080/10618600.2012.707852