Multiple Hypotheses Testing on Dependent Count Data with Covariate Effects

To read the file of this research, you can request a copy directly from the authors.


Dynamics in the sequence of count data is usually not only affected by the underlying hidden states to be detected , but also quite likely associated with other static or dynamically changing covariates. The multiple hypotheses testing procedure developed here takes these covariates into consideration by the Poisson regression model. Also, a hidden Markov process is applied to model the switches between the null and non-null states as well as the dependence across counts. All model parameters are estimated through Bayesian computation. While a simple distribution is assumed on the null state, the observation distribution under the non-null state usually requires more flexibility. Here a mixture of parametric distributions is assumed. The number of mixture components is decided by model selection criteria , including the Bayesian Information Criterion as well as marginal likelihood methods. Simulation studies are carried out to evaluate the performance of the proposed model and that of the model selection methods. The real data example shows the application of the proposed model and its inference goal differs from the previous testing procedure with no covariate effects considered.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Full-text available
We introduce the weighted likelihood bootstrap (WLB) as a way to simulate approximately from a posterior distribution. This method is often easy to implement, requiring only an algorithm for calculating the maximum likelihood estimator, such as iteratively reweighted least squares. In the generic weighting scheme, the WLB is first order correct under quite general conditions. Inaccuracies can be removed by using the WLB as a source of samples in the sampling-importance resampling (SIR) algorithm, which also allows incorporation of particular prior information. The SIR- adjusted WLB can be a competitive alternative to other integration methods in certain models. Asymptotic expansions elucidate the second- order properties of the WLB, which is a generalization of Rubin’s Bayesian bootstrap [D. B. Rubin, Ann. Stat. 9, 130-134 (1981)]. The calculation of approximate Bayes factors for model comparison is also considered. We note that, given a sample simulated from the posterior distribution, the required marginal likelihood may be simulation consistently estimated by the harmonic mean of the associated likelihood values; a modification of this estimator that avoids instability is also noted. These methods provide simple ways of calculating approximate Bayes factors and posterior model probabilities for a very wide class of models.
Conference Paper
Full-text available
Poisson regression models the noisy output of a counting function as a Poisson random variable, with a log-mean parameter that is a linear function of the input vector. In this work, we analyze Poisson regression in a Bayesian setting, by introducing a prior distribution on the weights of the linear function. Since exact inference is analytically unobtainable, we derive a closed-form approximation to the predictive distribution of the model. We show that the predictive distribution can be kernelized, enabling the representation of non-linear log-mean functions. We also derive an approximate marginal likelihood that can be optimized to learn the hyperparameters of the kernel. We then relate the proposed approximate Bayesian Poisson regression to Gaussian processes. Finally, we present experimental results using Bayesian Poisson regression for crowd counting from low-level features.
Wearable devices including accelerometers are increasingly being used to collect high-frequency human activity data in situ. There is tremendous potential to use such data to inform medical decision making and public health policies. However, modeling such data is challenging as they are high-dimensional, heterogeneous, and subject to informative missingness, e.g., zero readings when the device is removed by the participant. We propose a flexible and extensible continuous-time hidden Markov model to extract meaningful activity patterns from human accelerometer data. To facilitate estimation with massive data we derive an efficient learning algorithm that exploits the hierarchical structure of the parameters indexing the proposed model. We also propose a bootstrap procedure for interval estimation. The proposed methods are illustrated using data from the 2003–2004 and 2005–2006 National Health and Nutrition Examination Survey.
Multiple testing on dependent count data faces two basic modelling elements: the choice of distributions under the null and the non-null states and the modelling of the dependence structure across obser- vations. A Bayesian hidden Markov model is constructed for Pois- son count data to handle these two issues. The proposed Bayesian method is based on the posterior probability of the null state and exhibits the property of an optimal test procedure, which has the lowest false-negative rate with the false discovery rate under con- trol. Furthermore, the model has either single or mixture of Poisson distributions used under the non-null state. Model selection meth- ods are employed here to decide the number of components in the mixture. Different approaches of calculating marginal likelihood are discussed. Extensive simulation studies and a case study are employed to examine and compare a collection of commonly used testing procedures and model selection criteria.
Hidden Markov models are stochastic models in which the observations are assumed to follow a mixture distribution, but the parameters of the components are governed by a Markov chain which is unobservable. The issues related to the estimation of Poisson-hidden Markov models in which the observations are coming from mixture of Poisson distributions and the parameters of the component Poisson distributions are governed by an m-state Markov chain with an unknown transition probability matrix are explained here. These methods were applied to the data on Vibrio cholerae counts reported every month for 11-year span at Christian Medical College, Vellore, India. Using Viterbi algorithm, the best estimate of the state sequence was obtained and hence the transition probability matrix. The mean passage time between the states were estimated. The 95% confidence interval for the mean passage time was estimated via Monte Carlo simulation. The three hidden states of the estimated Markov chain are labelled as 'Low', 'Moderate' and 'High' with the mean counts of 1.4, 6.6 and 20.2 and the estimated average duration of stay of 3, 3 and 4 months, respectively. Environmental risk factors were studied using Markov ordinal logistic regression analysis. No significant association was found between disease severity levels and climate components.
An optimal and flexible multiple hypotheses testing procedure is constructed for dependent data based on Bayesian techniques, aiming at handling two challenges, namely dependence structure and non-null distribution specification. Ignoring dependence among hypotheses tests may lead to loss of efficiency and bias in decision. Misspecification in the non-null distribution, on the other hand, can result in both false positive and false negative errors. Hidden Markov models are used to accommodate the dependence structure among the tests. Dirichlet mixture process prior is applied on the non-null distribution to overcome the potential pitfalls in distribution misspecification. The testing algorithm based on Bayesian techniques optimizes the false negative rate (FNR) while controlling the false discovery rate (FDR). The procedure is applied to pointwise and clusterwise analysis. Its performance is compared with existing approaches using both simulated and real data examples.
This paper surveys several econometric techniques for dealing wirh switching regressions. More general Jorniulations, designed o produce maximum likelihood estimates, are introduced, and the problenc. of numerical optimization discussed. Also examined are extensions to Markoc modeLs, simultuneou eqiul-tions, and switching of causal directions.
Let pi(w) ,i =1 , 2, be two densities with common support where each density is known up to a normalizing constant: pi(w )= qi(w)/ci .W e have draws from each density (e.g., via Markov chain Monte Carlo), and we want to use these draws to simulate the ratio of the normalizing constants, c1/c2. Such a compu- tational problem is often encountered in likelihood and Bayesian inference, and arises in fields such as physics and genetics. Many methods proposed in statistical and other literature (e.g., computational physics) for dealing with this problem are based on various special cases of the following simple identity: c1 c2 = E2(q1(w)α(w)) E1(q2(w)α(w)) . Here Ei denotes the expectation with respect to pi (i =1 , 2), and α is an arbitrary function such that the denominator is non-zero. A main purpose of this paper is to provide a theoretical study of the usefulness of this identity, with focus on (asymptotically) optimal and practical choices of α. Using a simple but informa- tive example, we demonstrate that with sensible (not necessarily optimal) choices of α, we can reduce the simulation error by orders of magnitude when compared to the conventional importance sampling method, which corresponds to α =1 /q2. We also introduce several generalizations of this identity for handling more compli- cated settings (e.g., estimating several ratios simultaneously) and pose several open problems that appear to have practical as well as theoretical value. Furthermore, we discuss related theoretical and empirical work.
Conference Paper
Time-series of count data are generated in many different contexts, such as web access logging, freeway traffic mon- itoring, and security logs associated with buildings. Since this data measures the aggregated behavior of individual human beings, it typically exhibits a periodicity in time on a number of scales (daily, weekly, etc.) that reflects the rhythms of the underlying human activity and makes the data appear non-homogeneous. At the same time, the data is often corrupted by a number of bursty periods of unusual behavior such as building events, traffic accidents, and so forth. The data mining problem of finding and extracting these anomalous events is made difficult by both of these elements. In this paper we describe a framework for unsu- pervised learning in this context, based on a time-varying Poisson process model that can also account for anomalous events. We show how the parameters of this model can be learned from count time series using statistical estima- tion techniques. We demonstrate the utility of this model on two data sets for which we have partial ground truth in the form of known events, one from freeway traffic data and another from building access data, and show that the model performs significantly better than a non-probabilistic, threshold-based technique. We also describe how the model can be used to investigate different degrees of periodicity in the data, including systematic day-of-week and time-of- day effects, and make inferences about the detected events (e.g., popularity or level of attendance). Our experimen- tal results indicate that the proposed time-varying Poisson model provides a robust and accurate framework for adap- tively and autonomously learning how to separate unusual bursty events from traces of normal human activity.
The paper considers the problem of multiple testing under dependence in a compound decision theoretic framework. The observed data are assumed to be generated from an underlying two-state hidden Markov model. We propose oracle and asymptotically optimal data-driven procedures that aim to minimize the false non-discovery rate FNR subject to a constraint on the false discovery rate FDR. It is shown that the performance of a multiple-testing procedure can be substantially improved by adaptively exploiting the dependence structure among hypotheses, and hence conventional FDR procedures that ignore this structural information are inefficient. Both theoretical properties and numerical performances of the procedures proposed are investigated. It is shown that the procedures proposed control FDR at the desired level, enjoy certain optimality properties and are especially powerful in identifying clustered non-null cases. The new procedure is applied to an influenza-like illness surveillance study for detecting the timing of epidemic periods. Copyright (c) 2009 Royal Statistical Society.
The specification of Smooth Transition Regression models consists of a sequence of tests, which are typically based on the assumption of i.i.d. errors. In this paper we examine the impact of conditional heteroskedasticity and investigate the performance of several heteroskedasticity robust versions. Simulation evidence indicates that conventional tests can frequently result in finding spurious nonlinearity. Conversely, when the true process is nonlinear in mean, the tests appear to have low size adjusted power and can lead to the selection of misspecified models. The above deficiencies also hold for tests based on Heteroskedasticity Consistent Covariance Matrix Estimators but not for the Fixed Design Wild Bootstrap. We highlight the importance of robust inference through empirical applications.
Bayesian and classical analysis of Poisson regression
  • C A Irvine
Irvine, CA: University of California, School of Information and Computer Science El-Sayyad GM (1973) Bayesian and classical analysis of Poisson regression. Journal of the Royal Statistical Society, Series B: Methodological 35(3): 445-451
Ohio Supercomputer Center
  • Ohio Supercomputer
  • Center
Ohio Supercomputer Center (1987) Ohio Supercomputer Center. Columbus, OH: Ohio Supercomputer Center. /19495/f5s1ph73
HMMtesting 2.0: Bayesian hidden Markov models for dependent multiple hypotheses testing
  • X Wang
Wang X (2020) HMMtesting 2.0: Bayesian hidden Markov models for dependent multiple hypotheses testing. 10.13140/RG.2.2.33172. 53129
HMMtesting 2.1: Bayesian hidden Markov models for dependent multiple hypotheses testing
  • X Wang
Wang X (2021) HMMtesting 2.1: Bayesian hidden Markov models for dependent multiple hypotheses testing. 10.13140/RG.2.2.33172. 53129/1