We present a novel nonparametric method for bioassay and benchmark analysis in risk assessment, which averages isotonic MLEs based on disjoint subgroups of dosages. The asymptotic theory for the methodology is derived, showing that the MISEs (mean integrated squared error) of the estimates of both the dose-response curve F and its inverse F(-1) achieve the optimal rate O(N(-4/5)). Also, we compute the asymptotic distribution of the estimate ζ~p of the effective dosage ζ(p) = F(-1) (p) which is shown to have an optimally small asymptotic variance.
We demonstrate that clinical trials using response adaptive randomized treatment assignment rules are subject to substantial bias if there are time trends in unknown prognostic factors and standard methods of analysis are used. We develop a general class of randomization tests based on generating the null distribution of a general test statistic by repeating the adaptive randomized treatment assignment rule holding fixed the sequence of outcome values and covariate vectors actually observed in the trial. We develop broad conditions on the adaptive randomization method and the stochastic mechanism by which outcomes and covariate vectors are sampled that ensure that the type I error is controlled at the level of the randomization test. These conditions ensure that the use of the randomization test protects the type I error against time trends that are independent of the treatment assignments. Under some conditions in which the prognosis of future patients is determined by knowledge of the current randomization weights, the type I error is not strictly protected. We show that response-adaptive randomization can result in substantial reduction in statistical power when the type I error is preserved. Our results also ensure that type I error is controlled at the level of the randomization test for adaptive stratification designs used for balancing covariates.
Consider the correlation between two random variables (X, Y), both not directly observed. One only observes (X) over tilde= phi(1) (U)X +phi(2) (U) and (Y) over tilde = psi(1) (U)Y + psi(2) (U), where all four functions {phi(1)(.), psi(1)(.), l = 1, 2} are unknown/unspecified smooth functions of an observable covariate U. We consider consistent estimation of the correlation between the unobserved variables X and Y, adjusted for the above general dual additive and multiplicative effects of U, based on the observed data ((X) over tilde, (Y) over tilde, U).
Assessing agreement is often of interest in biomedical sciences to evaluate the similarity of measurements produced by different raters or methods on the same subjects. We investigate the agreement structure for a class of frailty models that are commonly used for analyzing correlated survival outcomes. Conditional on the shared frailty, bivariate survival times are assumed to be independent with Weibull baseline hazard distribution. We present the analytic expressions for the concordance correlation coefficient (CCC) for several commonly used frailty distributions. Furthermore, we develop a time-dependent CCC for measuring agreement between survival times among subjects who survive beyond a specified time point. We characterize the temporal pattern in the time-dependent CCC for various frailty distributions. Our results provide a better understanding of the agreement structure implied by different frailty models.
We define conditions under which sums of dependent spatial data will be approximately normally distributed. A theorem on the asymptotic distribution of a sum of dependent random variables defined on a 3-dimensional lattice is presented. Examples are also presented.
In this paper we address the problem of learning discrete Bayesian networks from noisy data. Considered is a graphical model based on mixture of Gaussian distributions with categorical mixing structure coming from a discrete Bayesian network. The network learning is formulated as a Maximum Likelihood estimation problem and performed by employing an EM algorithm. The proposed approach is relevant to a variety of statistical problems for which Bayesian network models are suitable - from simple regression analysis to learning gene/protein regulatory networks from microarray data.
We focus on Bayesian variable selection in regression models. One challenge is to search the huge model space adequately, while identifying high posterior probability regions. In the past decades, the main focus has been on the use of Markov chain Monte Carlo (MCMC) algorithms for these purposes. In this article, we propose a new computational approach based on sequential Monte Carlo (SMC), which we refer to as particle stochastic search (PSS). We illustrate PSS through applications to linear regression and probit models.
Let {S(n) : n >/= 0} (S(0) = 0) denote the successive sums of independent non-negative random variates, of possibly differing distributions. Define: (1) the number N(b) = inf{n >/= 0 : S(n) > b} of sums in the interval [0,b]; and (2) the overshoot R(b) = S(N(b)) -b. This paper bounds the tail P{R(b) > c} and the moments Rbk.
We present an asymptotic exponential bound for the deviation of the survival function estimator of the Cox model. We show that the bound holds even when the proportional hazards assumption does not hold.
Prediction models that use gene expression levels are now being proposed for personalized treatment of cancer, but building accurate models that are easy to interpret remains a challenge. In this paper, we describe an integrative clinical-genomic approach that combines both genomic pathway and clinical information. First, we summarize information from genes in each pathway using Supervised Principal Components (SPCA) to obtain pathway-based genomic predictors. Next, we build a prediction model based on clinical variables and pathway-based genomic predictors using Random Survival Forests (RSF). Our rationale for this two-stage procedure is that the underlying disease process may be influenced by environmental exposure (measured by clinical variables) and perturbations in different pathways (measured by pathway-based genomic variables), as well as their interactions. Using two cancer microarray datasets, we show that the pathway-based clinical-genomic model outperforms gene-based clinical-genomic models, with improved prediction accuracy and interpretability.
In this note, we address the problem of surrogacy using a causal modelling framework that differs substantially from the potential outcomes model that pervades the biostatistical literature. The framework comes from econometrics and conceptualizes direct effects of the surrogate endpoint on the true endpoint. While this framework can incorporate the so-called semi-competing risks data structure, we also derive a fundamental non-identifiability result. Relationships to existing causal modelling frameworks are also discussed.
We consider the problem of estimation in semiparametric varying coefficient models where the covariate modifying the varying coefficients is functional and is modeled nonparametrically. We develop a kernel-based estimator of the nonparametric component and a profiling estimator of the parametric component of the model and derive their asymptotic properties. Specifically, we show the consistency of the nonparametric functional estimates and derive the asymptotic expansion of the estimates of the parametric component. We illustrate the performance of our methodology using a simulation study and a real data application.
Quantitative trait loci mapping is focused on identifying the positions and effect of genes underlying an an observed trait. We present a collaborative targeted maximum likelihood estimator in a semi-parametric model using a newly proposed 2-part super learning algorithm to find quantitative trait loci genes in listeria data. Results are compared to the parametric composite interval mapping approach.
We proposed a permutation test for non-inferiority of the linear discriminant function to the optimal combination of multiple tests based on Mann-Whitney statistic estimate of the area under the receiver operating characteristic curve. Monte Carlo simulations showed its good performance.
Multiple alternative diagnostic tests for one disease are commonly available to clinicians. It's important to use all the good diagnostic predictors simultaneously to establish a new predictor with higher statistical utility. Under the generalized linear model for binary outcomes, the linear combination of multiple predictors in the link function is proved optimal in the sense that the area under the receiver operating characteristic (ROC) curve of this combination is the largest among all possible linear combination. The result was applied to analysis of the data from the Study of Osteoporotic Fractures (SOF) with comparison to Su and Liu's approach.
A conditionally specified joint model is convenient to use in fields such as spatial data modeling, Gibbs sampling, and missing data imputation. One potential problem with such an approach is that the conditionally specified models may be incompatible, which can lead to serious problems in applications. We propose an odds ratio representation of a joint density to study the issue and derive conditions under which conditionally specified distributions are compatible and yield a joint distribution. Our conditions are the simplest to verify compared with those proposed in the literature. The proposal also explicitly construct joint densities that are fully compatible with the conditionally specified densities when the conditional densities are compatible, and partially compatible with the conditional densities when they are incompatible. The construction result is then applied to checking the compatibility of the conditionally specified models. Ways to modify the conditionally specified models based on the construction of the joint models are also discussed when the conditionally specified models are incompatible.
We prove uniform consistency of Random Survival Forests (RSF), a newly introduced forest ensemble learner for analysis of right-censored survival data. Consistency is proven under general splitting rules, bootstrapping, and random selection of variables-that is, under true implementation of the methodology. Under this setting we show that the forest ensemble survival function converges uniformly to the true population survival function. To prove this result we make one key assumption regarding the feature space: we assume that all variables are factors. Doing so ensures that the feature space has finite cardinality and enables us to exploit counting process theory and the uniform consistency of the Kaplan-Meier survival function.
We consider the problem of testing for a constant nonparametric effect in a general semi-parametric regression model when there is the potential for interaction between the parametrically and nonparametrically modeled variables. The work was originally motivated by a unique testing problem in genetic epidemiology (Chatterjee, et al., 2006) that involved a typical generalized linear model but with an additional term reminiscent of the Tukey one-degree-of-freedom formulation, and their interest was in testing for main effects of the genetic variables, while gaining statistical power by allowing for a possible interaction between genes and the environment. Later work (Maity, et al., 2009) involved the possibility of modeling the environmental variable nonparametrically, but they focused on whether there was a parametric main effect for the genetic variables. In this paper, we consider the complementary problem, where the interest is in testing for the main effect of the nonparametrically modeled environmental variable. We derive a generalized likelihood ratio test for this hypothesis, show how to implement it, and provide evidence that our method can improve statistical power when compared to standard partially linear models with main effects only. We use the method for the primary purpose of analyzing data from a case-control study of colorectal adenoma.
Continuous time random walks impose random waiting times between particle jumps. This paper computes the fractal dimensions of their process limits, which represent particle traces in anomalous diffusion.
Crossing hazard functions have extensive applications in modeling survival data. However, existing studies in the literature mainly focus on comparing crossed hazard functions and estimating the time at which the hazard functions cross, and there is little theoretical work on conditions under which hazard functions from a model will have a crossing. In this paper, we investigate crossing status of hazard functions from the proportional hazards (PH) model, the accelerated hazard (AH) model, and the accelerated failure time (AFT) model. We provide and prove conditions under which the hazard functions from the AH and the AFT models have no crossings or a single crossing. A few examples are also provided to demonstrate how the conditions can be used to determine crossing status of hazard functions from the three models.
Minimax optimal designs can be useful for estimating response surface but they are notoriously difficult to study analytically. We provide formulae for three types of minimax optimal designs over a user-specified region. We focus on polynomial models with various types of heteroscedastic errors but the design strategy is applicable to other types of linear models and optimality criteria. Relationships among the three types of minimax optimal designs are discussed.
Conditional independence assumptions are very important in causal inference modelling as well as in dimension reduction methodologies. These are two very strikingly different statistical literatures, and we study links between the two in this article. The concept of covariate sufficiency plays an important role, and we provide theoretical justification when dimension reduction and partial least squares methods will allow for valid causal inference to be performed. The methods are illustrated with application to a medical study and to simulated data.
An extension of some standard likelihood based procedures to heteroscedastic nonlinear regression models under scale mixtures of skew-normal (SMSN) distributions is developed. We derive a simple EM-type algorithm for iteratively computing maximum likelihood (ML) estimates and the observed information matrix is derived analytically. Simulation studies demonstrate the robustness of this flexible class against outlying and influential observations, as well as nice asymptotic properties of the proposed EM-type ML estimates. Finally, the methodology is illustrated using an ultrasonic calibration data.
We study the class of general step-down multiple testing procedures, which contains the usually considered procedures determined by a nondecreasing sequence of thresholds (we call them threshold step-down, or TSD, procedures) as a parametric subclass. We show that all procedures in this class satisfying the natural condition of monotonicity and controlling the family-wise error rate (FWER) at a prescribed level are dominated by one of them - the classical Holm procedure. This generalizes an earlier result pertaining to the subclass of TSD procedures (Lehmann and Romano, Testing Statistical Hypotheses, 3rd ed., 2005). We also derive a relation between the levels at which a monotone step-down procedure controls the FWER and the generalized FWER (the probability of k or more false rejections).
We consider repeated measures interval-observed data with informative dropouts. We model the repeated outcomes via an unobserved random intercept and it is assumed that the probability of dropout during the study period is linearly related to the random intercept in a complementary log-log scale. Assuming the random effect follows the power variance function (PVF) family suggested by Hougaard (2000), we derive the marginal likelihood in a closed form. We evaluate the performance of the maximum likelihood estimation via simulation studies and apply the proposed method to a real data set.
We address the problem of selecting the best linear unbiased predictor (BLUP) of the latent value (e.g., serum glucose fasting level) of sample subjects with heteroskedastic measurement errors. Using a simple example, we compare the usual mixed model BLUP to a similar predictor based on a mixed model framed in a finite population (FPMM) setup with two sources of variability, the first of which corresponds to simple random sampling and the second, to heteroskedastic measurement errors. Under this last approach, we show that when measurement errors are subject-specific, the BLUP shrinkage constants are based on a pooled measurement error variance as opposed to the individual ones generally considered for the usual mixed model BLUP. In contrast, when the heteroskedastic measurement errors are measurement condition-specific, the FPMM BLUP involves different shrinkage constants. We also show that in this setup, when measurement errors are subject-specific, the usual mixed model predictor is biased but has a smaller mean squared error than the FPMM BLUP which point to some difficulties in the interpretation of such predictors.
The existence of the posterior distribution for one-way random effect probit models has been investigated when the uniform prior is applied to the overall mean and a class of noninformative priors are applied to the variance parameter. The sufficient conditions to ensure the propriety of the posterior are given for the cases with replicates at some factor levels. It is shown that the posterior distribution is never proper if there is only one observation at each factor level. For this case, however, a class of proper priors for the variance parameter can provide the necessary and sufficient conditions for the propriety of the posterior.
We describe a novel approach to nonparametric point and interval estimation of a treatment effect in the presence of many continuous confounders. We show the problem can be reduced to that of point and interval estimation of the expected conditional covariance between treatment and response given the confounders. Our estimators are higher order U-statistics. The approach applies equally to the regular case where the expected conditional covariance is root-n estimable and to the irregular case where slower non-parametric rates prevail.
In the presence of interference, the exposure of one individual may affect the outcomes of others. We provide new effect partitioning results under interferences that express the overall effect as a sum of (i) the indirect (or spillover) effect and (ii) a contrast between two direct effects.
We study the marginal longitudinal nonparametric regression problem and some of its semiparametric extensions. We point out that, while several elaborate proposals for efficient estimation have been proposed, a relative simple and straightforward one, based on penalized splines, has not. After describing our approach, we then explain how Gibbs sampling and the BUGS software can be used to achieve quick and effective implementation. Illustrations are provided for nonparametric regression and additive models.
It is increasingly common to be faced with longitudinal or multi-level data sets that have large numbers of predictors and/or a large sample size. Current methods of fitting and inference for mixed effects models tend to perform poorly in such settings. When there are many variables, it is appealing to allow uncertainty in subset selection and to obtain a sparse characterization of the data. Bayesian methods are available to address these goals using Markov chain Monte Carlo (MCMC), but MCMC is very computationally expensive and can be infeasible in large p and/or large n problems. As a fast approximate Bayes solution, we recommend a novel approximation to the posterior relying on variational methods. Variational methods are used to approximate the posterior of the parameters in a decomposition of the variance components, with priors chosen to obtain a sparse solution that allows selection of random effects. The method is evaluated through a simulation study, and applied to an epidemiological application.
Rosenblatt's transformation has been used extensively for evaluation of model goodness-of-fit, but it only applies to models whose joint distribution is continuous. In this paper we generalize the transformation so that it applies to arbitrary probability models. The transformation is simple, but has a wide range of possible applications, providing a tool for exploratory data analysis and formal goodness-of-fit testing for a very general class of probability models. The method is demonstrated with specific examples.
Titterington (see Journal of Royal Statistical Society, B, vol.46, p.257-67, 1984) proposed a recursive parameter estimation algorithm for finite mixture models. However, due to the well known problem of singularities and multiple maximum, minimum and saddle points that are possible on the likelihood surfaces, convergence analysis has seldom been made in the past years. In this paper, under mild conditions, we show the global convergence of Titterington's recursive estimator and its MAP variant for mixture models of the full regular exponential family.
A squared multiple correlation ratio of a random vector y on another random vector x is defined by [eta]2(y,x)=V(E(yx))/V(y). The advantages of the present multiple correlation ratio over the one defined by Sampson (1984) are pointed out.
Starr (1979) has proved that in the species problem, Robbins' estimate in a search of size n + m has negative correlation with the quantity of interest Un, m ⩾ 1; Un, the probability of the unobserved species in a search of size n is a random function of a parameter. Nevertheless the question whether or not one could do better was left open. An estimate Vn,m is provided that has high positive correlation with Un. The form of the proposed estimate Vn,m suggests a natural class of estimates that are highly (positively) correlated with the random function of the parameter that is of interest at any given situation when additional sampling is allowed. Finally, a class of functions of the probabilities of the unobserved species, that cannot be estimated in an unbiased way without additional searches is also offered.
Recently a new multiple dimensional stochastic order named correlation order was proposed to examine how the dependence among the individual risks effects riskness of the portfolios. This paper aims to discuss the relationship between correlation order and the supermodular order in general. Moreover, this paper also makes a comparison in the closeness of distributions of order statistics of random vectors ordered by correlation order. Such a comparison leads to a result more general than that in Hu and Hu [1997. Statist. Probab. Lett. 37, 1-6].
In this paper, we study a regression model in which explanatory variables are sampling points of a continuous-time process. We propose an estimator of regression by means of a Functional Principal Component Analysis analogous to the one introduced by Bosq [(1991) NATO, ASI Series, pp. 509-529] in the case of Hilbertian AR processes. Both convergence in probability and almost sure convergence of this estimator are stated.
We construct some self-similar processes with continuous paths by using the local time on hyperplanes of a d-dimensional symmetric α-stable Lévy process, d ⩾ 2 and 1 < α ⩽ 2, and its stochastic integral with respect to Gaussian white noise. Our construction gives a certain higher-dimensional extension of the previous work of Kesten and Spitzer (1979).
We provide a simple and explicit construction of a family of stochastic exponentials with expectation k[set membership, variant](0,1). Our family of stochastic exponentials can be constructed to be either strictly positive or merely non-negative.
Let [tau] be a regular metric as defined below for the D=D[0,1] space. Even when (D,[tau]) is not a separable and complete metric space we show (i) that the usual conditions on a sequence of probability measures in (D,[tau]) ensures its weak convergence and (ii) that Prohorov's theorem in (D,[tau]) can be derived as a consequence of our results.
A random graph Gn(x) is constructed on independent random points U1,...,Un distributed uniformly on [0,1]d, d[greater-or-equal, slanted]1, in which two distinct such points are joined by an edge if the l[infinity]-distance between them is at most some prescribed value 0<x<1. The connectivity distance cn, the smallest x for which Gn(x) is connected, is shown to satisfy For d[greater-or-equal, slanted]2, the random graph Gn(x) behaves like a d-dimensional version of the random graphs of Erdös and Rényi, despite the fact that its edges are not independent: cn/dn-->1, a.s., as n-->[infinity], where dn is the largest nearest-neighbor link, the smallest x for which Gn(x) has no isolated vertices.
This paper provides a strong law of large numbers for independent and nonidentically distributed random variables taking their values in the space D[0,1], where D[0,1] is equipped with the uniform topology.
The weak convergence of the empirical process of strong mixing or associated random variables is studied in LP(0,1). We find minimal rates of convergence to zero of the mixing coefficients or the covariances, in either case, supposing stationarity of the underlying variables. The rates obtained improve, for p not too large, the corresponding results in the classical D(0,1) framework.
An ARIMA(p,1,0) signal contaminated by MA(q) noise is a restricted ARIMA(p,1,p + q + 1) process. For this model restricted by nonlinear constraints, it is shown that the maximum likelihood estimator of the unit root is strongly consistent and its limiting distribution is the same as that of the least squares estimator of the unit root in an AR(1) process tabulated by Dickey and Fuller.
We give sharp upper and lower bounds for the median of the [Gamma](n+1,1) distribution, thus providing an immediate proof of two conjectures by Chen and Rubin (Statist. Probab. Lett. 4 (1986) 281) referring to the median of the Poisson distribution. Our approach uses a differential calculus for nonnecessarily smooth functions of the standard Poisson process and the central limit theorem.
Jin et al. (2001) proposed a clever resampling method useful for calculating a variance estimate of the solution to an estimating equation. The method multiplies each independent subject's contribution to the estimating equation by a randomly sampled random variable with mean and variance 1. They showed that this resampling technique gives consistent variance estimates under mild conditions. Rubin (1981. The Bayesian Bootstrap. Ann. Statist. 9, 130-134) proposed the Bayesian Bootstrap as a modification of the usual bootstrap. In this note, we show that the Bayesian Bootstrap is a special case of Jin et al.'s resampling approach.
We give sharp upper and lower bounds for the median of the Γ(n+1,1) distribution, thus providing an immediate proof of two conjectures by Chen and Rubin (Statist. Probab. Lett. 4 (1986) 281) referring to the median of the Poisson distribution. Our approach uses a differential calculus for nonnecessarily smooth functions of the standard Poisson process and the central limit theorem.
This paper considers some structural properties of Box-Cox transformed threshold GARCH(1,1) process. First, a sufficient and necessary condition for the strict stationarity of this threshold GARCH process is given. Second, some simple conditions for the existence of the moments of the threshold GARCH process are also derived. Finally, we describe the tail of the marginal distribution of the threshold GARCH process. It gives a precise meaning to the statement “light-tailed input causes heavy-tailed output”.
In this paper, we look at a simple relationship between a random vector having a continuous distribution on the unit m-sphere and m random variables, m-1 of which have a distribution on the interval (-1,1), while the final random variable is a discrete one taking on the values -1 and 1. This relationship can be particularly useful when these m random variables are independently distributed. In this case, it can be used to construct distributions on the unit m-sphere having specific features as well as to generate random vectors having these distributions.