-
[show abstract]
[hide abstract]
ABSTRACT: Because of the increasing diversity of data sets and measurement techniques in biology, a growing spectrum of modeling methods is being developed. It is generally recognized that it is critical to pick the appropriate method to exploit the amount and type of biological data available for a given system. Here, we describe a method for use in situations where temporal data from a network is collected over multiple time points, and in which little prior information is available about the interactions, mathematical structure, and statistical distribution of the network. Our method results in models that we term Nonparametric exterior derivative estimation Ordinary Differential Equation (NODE) model's. We illustrate the method's utility using spatiotemporal gene expression data from Drosophila melanogaster embryos. We demonstrate that the NODE model's use of the temporal characteristics of the network leads to quantifiable improvements in its predictive ability over nontemporal models that only rely on the spatial characteristics of the data. The NODE model provides exploratory visualizations of network behavior and structure, which can identify features that suggest additional experiments. A new extension is also presented that uses the NODE model to generate a comb diagram, a figure that presents a list of possible network structures ranked by plausibility. By being able to quantify a continuum of interaction likelihoods, this helps to direct future experiments.
Methods in cell biology 01/2012; 110:243-61. · 2.05 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Collinearity and near-collinearity of predictors cause difficulties when
doing regression. In these cases, variable selection becomes untenable because
of mathematical issues concerning the existence and numerical stability of the
regression coefficients, and interpretation of the coefficients is ambiguous
because gradients are not defined. Using a differential geometric
interpretation, in which the regression coefficients are interpreted as
estimates of the exterior derivative of a function, we develop a new method to
do regression in the presence of collinearities. Our regularization scheme can
improve estimation error, and it can be easily modified to include lasso-type
regularization. These estimators also have simple extensions to the "large $p$,
small $n$" context.
03/2011;
-
[show abstract]
[hide abstract]
ABSTRACT: In a smooth semiparametric estimation problem, the marginal posterior for the
parameter of interest is expected to be asymptotically normal and satisfy
frequentist criteria of optimality if the model is endowed with a suitable
prior. It is shown that, under certain straightforward and interpretable
conditions, the assertion of Le Cam's acclaimed, but strictly parametric,
Bernstein-von Mises theorem [Univ. California Publ. Statist. 1 (1953) 277-329]
holds in the semiparametric situation as well. As a consequence, Bayesian
point-estimators achieve efficiency, for example, in the sense of H\'{a}jek's
convolution theorem [Z. Wahrsch. Verw. Gebiete 14 (1970) 323-330]. The model is
required to satisfy differentiability and metric entropy conditions, while the
nuisance prior must assign nonzero mass to certain Kullback-Leibler
neighborhoods [Ghosal, Ghosh and van der Vaart Ann. Statist. 28 (2000)
500-531]. In addition, the marginal posterior is required to converge at
parametric rate, which appears to be the most stringent condition in examples.
The results are applied to estimation of the linear coefficient in partial
linear regression, with a Gaussian prior on a smoothness class for the
nuisance.
07/2010;
-
[show abstract]
[hide abstract]
ABSTRACT: The correlation between the expression levels of transcription factors and their target genes can be used to infer interactions within animal regulatory networks, but current methods are limited in their ability to make correct predictions.
Here we describe a novel approach which uses nonparametric statistics to generate ordinary differential equation (ODE) models from expression data. Compared to other dynamical methods, our approach requires minimal information about the mathematical structure of the ODE; it does not use qualitative descriptions of interactions within the network; and it employs new statistics to protect against over-fitting. It generates spatio-temporal maps of factor activity, highlighting the times and spatial locations at which different regulators might affect target gene expression levels. We identify an ODE model for eve mRNA pattern formation in the Drosophila melanogaster blastoderm and show that this reproduces the experimental patterns well. Compared to a non-dynamic, spatial-correlation model, our ODE gives 59% better agreement to the experimentally measured pattern. Our model suggests that protein factors frequently have the potential to behave as both an activator and inhibitor for the same cis-regulatory module depending on the factors' concentration, and implies different modes of activation and repression.
Our method provides an objective quantification of the regulatory potential of transcription factors in a network, is suitable for both low- and moderate-dimensional gene expression datasets, and includes improvements over existing dynamic and static models.
BMC Bioinformatics 01/2010; 11:413. · 2.75 Impact Factor
-
Stewart MacArthur,
Xiao-Yong Li,
Jingyi Li,
James B Brown,
Hou Cheng Chu,
Lucy Zeng,
Brandi P Grondona,
Aaron Hechmer,
Lisa Simirenko,
Soile V E Keränen,
David W Knowles,
Mark Stapleton, Peter Bickel,
Mark D Biggin,
Michael B Eisen
[show abstract]
[hide abstract]
ABSTRACT: We previously established that six sequence-specific transcription factors that initiate anterior/posterior patterning in Drosophila bind to overlapping sets of thousands of genomic regions in blastoderm embryos. While regions bound at high levels include known and probable functional targets, more poorly bound regions are preferentially associated with housekeeping genes and/or genes not transcribed in the blastoderm, and are frequently found in protein coding sequences or in less conserved non-coding DNA, suggesting that many are likely non-functional.
Here we show that an additional 15 transcription factors that regulate other aspects of embryo patterning show a similar quantitative continuum of function and binding to thousands of genomic regions in vivo. Collectively, the 21 regulators show a surprisingly high overlap in the regions they bind given that they belong to 11 DNA binding domain families, specify distinct developmental fates, and can act via different cis-regulatory modules. We demonstrate, however, that quantitative differences in relative levels of binding to shared targets correlate with the known biological and transcriptional regulatory specificities of these factors.
It is likely that the overlap in binding of biochemically and functionally unrelated transcription factors arises from the high concentrations of these proteins in nuclei, which, coupled with their broad DNA binding specificities, directs them to regions of open chromatin. We suggest that most animal transcription factors will be found to show a similar broad overlapping pattern of binding in vivo, with specificity achieved by modulating the amount, rather than the identity, of bound factor.
Genome biology 08/2009; 10(7):R80. · 6.63 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Some astronomy projects require a blind search through a vast number of hypotheses to detect objects of interest. The number of hypotheses to test can be in the billions. A naive blind search over every single hypothesis would be far too costly computationally. We propose a hierarchical scheme for blind search, using various "resolution" levels. At lower resolution levels, "regions" of interest in the search space are singled out with a low computational cost. These regions are refined at intermediate resolution levels and only the most promising candidates are finally tested at the original fine resolution. The optimal search strategy is found by dynamic programming. We demonstrate the procedure for pulsar search from satellite gamma-ray observations and show that the power of the naive blind search can almost be matched with the hierarchical scheme while reducing the computational burden by more than three orders of magnitude. Comment: Published in at http://dx.doi.org/10.1214/08-AOAS180 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)
12/2007;
-
[show abstract]
[hide abstract]
ABSTRACT: This paper treats the problem of detecting periodicity in a sequence of photon arrival times, which occurs, for example, in attempting to detect gamma-ray pulsars. A particular focus is on how auxiliary information, typically source intensity, background intensity, and incidence angles and energies associated with each photon arrival should be used to maximize the detection power. We construct a class of likelihood-based tests, score tests, which give rise to event weighting in a principled and natural way, and derive expressions quantifying the power of the tests. These results can be used to compare the efficacies of different weight functions, including cuts in energy and incidence angle. The test is targeted toward a template for the periodic lightcurve, and we quantify how deviation from that template affects the power of detection.
06/2007;
-
[show abstract]
[hide abstract]
ABSTRACT: Transcription factors and many other DNA-binding proteins recognize more than one specific sequence. Among sequences recognized by a given DNA-binding protein, different positions exhibit varying degrees of conservation. The reason is that base pairs that are more extensively contacted by the protein tend to be more conserved. This observation can be used in the discovery of transcription factor binding sites. Here we present a rigorous means to accomplish this. In particular, we constrain the order of the information (entropy) in the columns of the position specific weight matrix (PWM) which characterizes the motif being sought. We then show how to compute the maximum likelihood estimate of a PWM under such order restrictions. This computation is easily integrated with the EM algorithm or the Gibbs sampler to enhance performance in the search for motifs in unaligned sequences. We demonstrate our method on a well-known data set of binding sites of the transcription factor Crp in E. coli.
Statistical Applications in Genetics and Molecular Biology 02/2007; 4(1):1-1. · 1.52 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: State space models have long played an important role in signal processing. The Gaussian case can be treated algorithmically using the famous Kalman filter. Similarly since the 1970s there has been extensive application of Hidden Markov models in speech recognition with prediction being the most important goal. The basic theoretical work here, in the case $X$ and $Y$ finite (small) providing both algorithms and asymptotic analysis for inference is that of Baum and colleagues. During the last 30-40 years these general models have proved of great value in applications ranging from genomics to finance. Unless the $X,Y$ are jointly Gaussian or $X$ is finite and small the problem of calculating the distributions discussed and the likelihood exactly are numerically intractable and if $Y$ is not finite asymptotic analysis becomes much more difficult. Some new developments have been the construction of so-called ``particle filters'' (Monte Carlo type) methods for approximate calculation of these distributions (see Doucet et al. [4]) for instance and general asymptotic methods for analysis of statistical methods in HMM [2] and other authors. We will discuss these methods and results in the light of exponential mixing properties of the conditional (posterior) distribution of $(X_1,X_2,...)$ given $(Y_1,Y_2,...)$ already noted by Baum and Petrie and recent work of the authors Bickel, Ritov and Ryden, Del Moral and Jacod, Douc and Matias.
12/2002;
-
[show abstract]
[hide abstract]
ABSTRACT: This paper gives an overview of some of our activities, focusing on gathering statistics on traffic flow over the network of freeways in Los Angeles and on the prediction of travel times over this network. The paper is organized as follows: The freeway system of Los Angeles is equipped with a densely deployed array of sensors, loop detectors, which we describe in the next section. Information from these sensors is captured in real time, displayed, and archived by the Freeway Performance Measurement System, as described in a section 1.3. In section 1.4 we describe briefly our attempts to globally model the evolution of the fascinating spatial-temporal field of traffic flow. Ultimately, however, rather than trying to fit and update such comprehensive models, we found it preferable to use simpler, direct methods. These are described in section 1.5 for the purpose of predicting the particular functional of interest, travel time. Section 1.6 contains final remarks
01/2002;
-
Katherine Campbell,
Sallie Keller-mcnulty,
Elizabeth Kelly,
Contributors Richard,
A. Berk,
Robert Fovell,
Rodman Linn,
Frederic Schoenberg,
Nagui Rouphail,
Jerome Sacks,
Byungkyu Park,
Alan Perelson, Peter Bickel
[show abstract]
[hide abstract]
ABSTRACT: . As decision- and policy-makers come to rely increasingly on estimates and simulations produced by computerized models of the world, in areas as diverse and climate prediction, transportation planning, economic policy and civil engineering, the need for objective evaluation of the accuracy and utility of such models likewise becomes more urgent. This article summarizes a two-day workshop that took place in Santa Fe, New Mexico in December 1999, whose focus was the evaluation of complex computer models. Approximately half of the workshop was taken up with formal presentation of four computer models by their creators, each paired with an initial assessment by a statistician. These prepared papers are presented, in shortened form, in Section 3 of this paper. The remainder of the workshop was devoted to introductory and summary comments, short contributed descriptions of related models, and a great deal of floor discussion, which was recorded by assigned rapporteurs. These are presented in Sections 2 and 4 in the paper. In the introductory and concluding sections we attempt to summarize the progress made by the workshop and suggest next steps. Key words and phrases: model accuracy, model evaluation, model validation, uncertainty analysis, computer experiments, statistically equivalent models, modelbased decisions 1.
08/2001;
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents an approach to estimate future travel times on a freeway using flow and occupancy data from single loop detectors and historical travel time information. The work uses linear regression with stepwise variable selection method and more advanced tree based methods. The analysis considers forecasts ranging from a few minutes into the future up to an hour ahead. Leave-a-day-out cross-validation was used to evaluate the prediction errors without under-estimation. The current traffic state proved to be a good predictor for the near future, up to 20 minutes, while historical data is more informative for longerrange predictions. Tree based methods and linear regression both performed satisfactorily, showing slightly different qualitative behaviors for each condition examined in this analysis. Unlike preceding works that rely on simulation, this study uses real traffic data. Although the current implementation uses measured travel times from probe vehicles, the ultimate goal of this research is an autonomous system that relies strictly on detector data. In the course of presenting the prediction system, the paper examines how travel times change from day-to-day and develops several metrics to quantify these changes. The metrics can be used as input for travel time prediction, but they should be also beneficial for other applications such as calibrating traffic models and planning models. Keywords: loop detectors, travel time prediction, advanced traveler information systems (ATIS), regression, cross-validation.
02/2001;
-
[show abstract]
[hide abstract]
ABSTRACT: As advanced traveler information systems become increasingly prevalent the importance of accurately estimating link travel times grows. Unfortunately, the predominant source of highway traffic information comes from single-trap loop detectors which do not directly measure vehicle speed. The conventional method of estimating speed, and hence travel time, from the single-trap data is to make a common vehicle length assumption and to use a resulting identity relating density, flow, and speed. Hall and Persaud [Hall and Persaud, 1989] and Pushkar, Hall, and Acha-Daza [Pushkar et al., 1994] presented results that suggest that these speed estimates may be flawed. In this paper we present a methodology to estimate link travel times directly from the single-trap loop detector flow and occupancy data without heavy reliance on the possibly flawed speed calculations. Our methods arise naturally from an intuitive stochastic model of traffic flow. We demonstrate by example on data collected on I-880 dat...
04/1999;
-
[show abstract]
[hide abstract]
ABSTRACT: In this paper, we conduct an empirical comparison of travel time estimation methods based on single-loop detector data. The methods of concern are the regression method based on an intuitive stochastic model as proposed by Petty et al. in [7], and the conventional method of using an identity relating speed, flow and occupancy with the assumption of a common vehicle length. The analysis is tailored to fit in the limitations imposed by available field data sets. We also introduce several variations of the regression method and give examples which suggest directions for future work to further improve the regression method. The comparison is composed of three interrelated parts, each with a different focus:local comparison (concerning a single link of freeway), comparison of estimated section travel times over a prolonged stretch of freeway with multiple links and a visualized approach which enables investigation of performance patterns in time and space of the estimation methods.
Institute of Transportation Studies, UC Berkeley, Institute of Transportation Studies, Research Reports, Working Papers, Proceedings. 01/1999;
-
[show abstract]
[hide abstract]
ABSTRACT: As advanced traveler information systems become increasingly prevalent the importance of accurately estimating link travel times grows. Unfortunately, the predominant source of highway traffic information comes from single-loop loop detectors which do not directly measure vehicle speed. The conventional method of estimating speed, and hence travel time, from the single-loop data is to make a common vehicle length assumption and to use a resulting identity relating density, flow, and speed. Hall and Persaud (Transportation Research Record 1232, 9-16, 1989) and Pushkar et al. (Transportation Research Record 1457, 149-157, 1994) show that these speed estimates are flawed. In this paper we present a methodology to estimate link travl times directly from the single-loop loop detector flow and occupancy data without heavy reliance on the flawed speed calculations. Our methods arise naturally from an intuitive stochastic model of traffic flow. We demonstrate by example on data collected on I-880 data (Skabardonis et al. Technical Report UCB-ITS-PRR-95-S, Institute of Transportation Studies, University of California, 1994) that when the loop detector data has a fine resolution (about one second), the single-loop based estimates of travel time can accurately track the true travel time through many degrees of congestion. Probe vehicle data and double-loop based travel time estimates corroborate the accuracy of our methods in our examples.
Transportation Research Part A Policy and Practice 01/1998; 32(1):1-17. · 2.35 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: As advanced traveler information systems become increasingly prevalent the importance of accurately estimating link travel times grows. Unfortunately, the predominant source of highway traffic information comes from single-loop loop detectors which do not directly measure vehicle speed. The conventional method of estimating speed, and hence travel time, from the single-loop data is to make a common vehicle length assumption and to use a resulting identity relating density, flow, and speed. Hall and Persaud (Transportation Research Record1232, 9–16, 1989) and Pushkar et al. (Transportation Research Record1457, 149–157, 1994) show that these speed estimates are flawed. In this paper we present a methodology to estimate link travl times directly from the single-loop loop detector flow and occupancy data without heavy reliance on the flawed speed calculations. Our methods arise naturally from an intuitive stochastic model of traffic flow. We demonstrate by example on data collected on I-880 data (Skabardonis et al. Technical Report UCB-ITS-PRR-95-S, Institute of Transportation Studies, University of California, 1994) that when the loop detector data has a fine resolution (about one second), the single-loop based estimates of travel time can accurately track the true travel time through many degrees of congestion. Probe vehicle data and double-loop based travel time estimates corroborate the accuracy of our methods in our examples.
Transportation Research Part A: Policy and Practice.
-
Richard A. Berk, Peter Bickel,
Katherine Campbell,
Robert Fovell,
Sallie Keller-mcnulty,
Elizabeth Kelly,
Rodman Linn,
Byungkyu Park,
Alan Perelson,
Nagui Rouphail,
Jerome Sacks,
Frederic Schoenberg
[show abstract]
[hide abstract]
ABSTRACT: As decision- and policy-makers come to rely increasingly on estimates and simulations produced by computerized models of the world, in areas as diverse as climate prediction, transportation planning, economic policy and civil engineering, the need for objective evaluation of the accuracy and utility of such models likewise becomes more urgent. This article summarizes a two-day workshop that took place in Santa Fe, New Mexico in December 1999, whose focus was the evaluation of complex computer models. Approximately half of the workshop was taken up with formal presentation of four computer models by their creators, each paired with an initial assessment by a statistician. These prepared papers are presented, in shortened form, in Section 3 of this paper. The remainder of the workshop was devoted to introductory and summary comments, short contributed descriptions of related models and a great deal of floor discussion, which was recorded by assigned rapporteurs. These are presented in Sections 2 and 4 in the paper. In the introductory and concluding sections we attempt to summarize the progress made by the workshop and suggest next steps.