The Annals of Statistics

Published by Institute of Mathematical Statistics
Print ISSN: 0090-5364
A topological multiple testing scheme for one-dimensional domains is proposed where, rather than testing every spatial or temporal location for the presence of a signal, tests are performed only at the local maxima of the smoothed observed sequence. Assuming unimodal true peaks with finite support and Gaussian stationary ergodic noise, it is shown that the algorithm with Bonferroni or Benjamini-Hochberg correction provides asymptotic strong control of the family wise error rate and false discovery rate, and is power consistent, as the search space and the signal strength get large, where the search space may grow exponentially faster than the signal strength. Simulations show that error levels are maintained for nonasymptotic conditions, and that power is maximized when the smoothing kernel is close in shape and bandwidth to the signal peaks, akin to the matched filter theorem in signal processing. The methods are illustrated in an analysis of electrical recordings of neuronal cell activity.
We study semiparametric varying-coefficient partially linear models when some linear covariates are not observed, but ancillary variables are available. Semiparametric profile least-square-based estimation procedures are developed for parametric and nonparametric components after we calibrate the error-prone covariates. Asymptotic properties of the proposed estimators are established. We also propose the profile least-square-based ratio test and Wald test to identify significant parametric and nonparametric components. To improve accuracy of the proposed tests for small or moderate sample sizes, Wild bootstrap version is also proposed to calculate the critical values. Intensive simulation experiments are conducted to illustrate the proposed approaches.
We consider the problem of model selection and estimation in situations where the number of parameters diverges with the sample size. When the dimension is high, an ideal method should have the oracle property (Fan and Li, 2001; Fan and Peng, 2004) which ensures the optimal large sample performance. Furthermore, the high-dimensionality often induces the collinearity problem which should be properly handled by the ideal method. Many existing variable selection methods fail to achieve both goals simultaneously. In this paper, we propose the adaptive Elastic-Net that combines the strengths of the quadratic regularization and the adaptively weighted lasso shrinkage. Under weak regularity conditions, we establish the oracle property of the adaptive Elastic-Net. We show by simulations that the adaptive Elastic-Net deals with the collinearity problem better than the other oracle-like methods, thus enjoying much improved finite sample performance.
We consider a nonparametric additive model of a conditional mean function in which the number of variables and additive components may be larger than the sample size but the number of nonzero additive components is "small" relative to the sample size. The statistical problem is to determine which additive components are nonzero. The additive components are approximated by truncated series expansions with B-spline bases. With this approximation, the problem of component selection becomes that of selecting the groups of coefficients in the expansion. We apply the adaptive group Lasso to select nonzero components, using the group Lasso to obtain an initial estimator and reduce the dimension of the problem. We give conditions under which the group Lasso selects a model whose number of components is comparable with the underlying model, and the adaptive group Lasso selects the nonzero components correctly with probability approaching one as the sample size increases and achieves the optimal rate of convergence. The results of Monte Carlo experiments show that the adaptive group Lasso procedure works well with samples of moderate size. A data example is used to illustrate the application of the proposed method.
The patterns of the nonparametric functions of BMI and Age (solid lines) with ± s.e. (shaded areas) using the R function, gam, for the Pima study. 
We study generalized additive partial linear models, proposing the use of polynomial spline smoothing for estimation of nonparametric functions, and deriving quasi-likelihood based estimators for the linear parameters. We establish asymptotic normality for the estimators of the parametric components. The procedure avoids solving large systems of equations as in kernel-based procedures and thus results in gains in computational simplicity. We further develop a class of variable selection procedures for the linear parameters by employing a nonconcave penalized quasi-likelihood, which is shown to have an asymptotic oracle property. Monte Carlo simulations and an empirical example are presented for illustration.
Variable selection is fundamental to high-dimensional statistical modeling. Many variable selection techniques may be implemented by maximum penalized likelihood using various penalty functions. Optimizing the penalized likelihood function is often challenging because it may be nondifferentiable and/or nonconcave. This article proposes a new class of algorithms for finding a maximizer of the penalized likelihood for a broad class of penalty functions. These algorithms operate by perturbing the penalty function slightly to render it differentiable, then optimizing this differentiable function using a minorize-maximize (MM) algorithm. MM algorithms are useful extensions of the well-known class of EM algorithms, a fact that allows us to analyze the local and global convergence of the proposed algorithm using some of the techniques employed for EM algorithms. In particular, we prove that when our MM algorithms converge, they must converge to a desirable point; we also discuss conditions under which this convergence may be guaranteed. We exploit the Newton-Raphson-like aspect of these algorithms to propose a sandwich estimator for the standard errors of the estimators. Our method performs well in numerical tests.
The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.
Classification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is largely poorly understood. In a seminal paper, Bickel and Levina (2004) show that the Fisher discriminant performs poorly due to diverging spectra and they propose to use the independence rule to overcome the problem. We first demonstrate that even for the independence classification rule, classification using all the features can be as bad as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as bad as the random guessing. Thus, it is paramountly important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The conditions under which all the important features can be selected by the two-sample t-statistic are established. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.
The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu (2011), taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied.
Standard statistical techniques often require transforming data to have mean 0 and standard deviation 1. Typically, this process of "standardization" or "normalization" is applied across subjects when each subject produces a single number. High throughput genomic and financial data often come as rectangular arrays, where each coordinate in one direction concerns subjects, who might have different status (case or control, say); and each coordinate in the other designates "outcome" for a specific feature, for example "gene," "polymorphic site," or some aspect of financial profile. It may happen when analyzing data that arrive as a rectangular array that one requires BOTH the subjects and features to be "on the same footing." Thus, there may be a need to standardize across rows and columns of the rectangular matrix. There arises the question as to how to achieve this double normalization. We propose and investigate the convergence of what seems to us a natural approach to successive normalization, which we learned from colleague Bradley Efron. We also study the implementation of the method on simulated data and also on data that arose from scientific experimentation.
The receiver operating characteristic (ROC) curve, the positive predictive value (PPV) curve and the negative predictive value (NPV) curve are three measures of performance for a continuous diagnostic biomarker. The ROC, PPV and NPV curves are often estimated empirically to avoid assumptions about the distributional form of the biomarkers. Recently, there has been a push to incorporate group sequential methods into the design of diagnostic biomarker studies. A thorough understanding of the asymptotic properties of the sequential empirical ROC, PPV and NPV curves will provide more flexibility when designing group sequential diagnostic biomarker studies. In this paper we derive asymptotic theory for the sequential empirical ROC, PPV and NPV curves under case-control sampling using sequential empirical process theory. We show that the sequential empirical ROC, PPV and NPV curves converge to the sum of independent Kiefer processes and show how these results can be used to derive asymptotic results for summaries of the sequential empirical ROC, PPV and NPV curves.
We consider tests of hypotheses when the parameters are not identifiable under the null in semiparametric models, where regularity conditions for profile likelihood theory fail. Exponential average tests based on integrated profile likelihood are constructed and shown to be asymptotically optimal under a weighted average power criterion with respect to a prior on the nonidentifiable aspect of the model. These results extend existing results for parametric models, which involve more restrictive assumptions on the form of the alternative than do our results. Moreover, the proposed tests accommodate models with infinite dimensional nuisance parameters which either may not be identifiable or may not be estimable at the usual parametric rate. Examples include tests of the presence of a change-point in the Cox model with current status data and tests of regression parameters in odds-rate models with right censored data. Optimal tests have not previously been studied for these scenarios. We study the asymptotic distribution of the proposed tests under the null, fixed contiguous alternatives and random contiguous alternatives. We also propose a weighted bootstrap procedure for computing the critical values of the test statistics. The optimal tests perform well in simulation studies, where they may exhibit improved power over alternative tests.
Quantile regression has been advocated in survival analysis to assess evolving covariate effects. However, challenges arise when the censoring time is not always observed and may be covariate-dependent, particularly in the presence of continuously-distributed covariates. In spite of several recent advances, existing methods either involve algorithmic complications or impose a probability grid. The former leads to difficulties in the implementation and asymptotics, whereas the latter introduces undesirable grid dependence. To resolve these issues, we develop fundamental and general quantile calculus on cumulative probability scale in this article, upon recognizing that probability and time scales do not always have a one-to-one mapping given a survival distribution. These results give rise to a novel estimation procedure for censored quantile regression, based on estimating integral equations. A numerically reliable and efficient Progressive Localized Minimization (PLMIN) algorithm is proposed for the computation. This procedure reduces exactly to the Kaplan-Meier method in the k-sample problem, and to standard uncensored quantile regression in the absence of censoring. Under regularity conditions, the proposed quantile coefficient estimator is uniformly consistent and converges weakly to a Gaussian process. Simulations show good statistical and algorithmic performance. The proposal is illustrated in the application to a clinical study.
We develop methodology for a multistage-decision problem with flexible number of stages in which the rewards are survival times that are subject to censoring. We present a novel Q-learning algorithm that is adjusted for censored data and allows a flexible number of stages. We provide finite sample bounds on the generalization error of the policy learned by the algorithm, and show that when the optimal Q-function belongs to the approximation space, the expected survival time for policies obtained by the algorithm converges to that of the optimal policy. We simulate a multistage clinical trial with flexible number of stages and apply the proposed censored-Q-learning algorithm to find individualized treatment regimens. The methodology presented in this paper has implications in the design of personalized medicine trials in cancer and in other life-threatening diseases.
We study nonparametric estimation with two types of data structures. In the first data structure n i.i.d. copies of (C, N(C)) are observed, where N is a finite state counting process jumping at time-variables of interest and C a random monitoring time. In the second data structure n i.i.d. copies of (C ∧ T, I(T ≤ C), N(C ∧ T)) are observed, where N is a counting process with a final jump at time T (e.g., death). This data structure includes observing right-censored data on T and a marker variable at the censoring time. In these data structures, easy to compute estimators, namely (weighted)-pool-adjacent-violator estimators for the marginal distributions of the unobservable time variables, and the Kaplan-Meier estimator for the time T till the final observable event, are available. These estimators ignore seemingly important information in the data. In this paper we prove that, at many continuous data generating distributions the ad hoc estimators yield asymptotically efficient estimators of √n-estimable parameters.
In the analysis of cluster data the regression coefficients are frequently assumed to be the same across all clusters. This hampers the ability to study the varying impacts of factors on each cluster. In this paper, a semiparametric model is introduced to account for varying impacts of factors over clusters by using cluster-level covariates. It achieves the parsimony of parametrization and allows the explorations of nonlinear interactions. The random effect in the semiparametric model accounts also for within cluster correlation. Local linear based estimation procedure is proposed for estimating functional coefficients, residual variance, and within cluster correlation matrix. The asymptotic properties of the proposed estimators are established and the method for constructing simultaneous confidence bands are proposed and studied. In addition, relevant hypothesis testing problems are addressed. Simulation studies are carried out to demonstrate the methodological power of the proposed methods in the finite sample. The proposed model and methods are used to analyse the second birth interval in Bangladesh, leading to some interesting findings.
Discrete mixture models provide a well-known basis for effective clustering algorithms, although technical challenges have limited their scope. In the context of gene-expression data analysis, a model is presented that mixes over a finite catalog of structures, each one representing equality and inequality constraints among latent expected values. Computations depend on the probability that independent gamma-distributed variables attain each of their possible orderings. Each ordering event is equivalent to an event in independent negative-binomial random variables, and this finding guides a dynamic-programming calculation. The structuring of mixture-model components according to constraints among latent means leads to strict concavity of the mixture log likelihood. In addition to its beneficial numerical properties, the clustering method shows promising results in an empirical study.
Motivated by recent work studying massive imaging data in the neuroimaging literature, we propose multivariate varying coefficient models (MVCM) for modeling the relation between multiple functional responses and a set of covariates. We develop several statistical inference procedures for MVCM and systematically study their theoretical properties. We first establish the weak convergence of the local linear estimate of coefficient functions, as well as its asymptotic bias and variance, and then we derive asymptotic bias and mean integrated squared error of smoothed individual functions and their uniform convergence rate. We establish the uniform convergence rate of the estimated covariance function of the individual functions and its associated eigenvalue and eigenfunctions. We propose a global test for linear hypotheses of varying coefficient functions, and derive its asymptotic distribution under the null hypothesis. We also propose a simultaneous confidence band for each individual effect curve. We conduct Monte Carlo simulation to examine the finite-sample performance of the proposed procedures. We apply MVCM to investigate the development of white matter diffusivities along the genu tract of the corpus callosum in a clinical study of neurodevelopment.
Plot of estimated intercept function of dietary beta-carotene: (a) the estimated intercept function by LS method; (b) the estimated intercept function by CQR method with q = 7.  
Summary of the ratio of MSE over 400 simulations RMSE
The complexity of semiparametric models poses new challenges to statistical inference and model selection that frequently arise from real applications. In this work, we propose new estimation and variable selection procedures for the semiparametric varying-coefficient partially linear model. We first study quantile regression estimates for the nonparametric varying-coefficient functions and the parametric regression coefficients. To achieve nice efficiency properties, we further develop a semiparametric composite quantile regression procedure. We establish the asymptotic normality of proposed estimators for both the parametric and nonparametric parts and show that the estimators achieve the best convergence rate. Moreover, we show that the proposed method is much more efficient than the least-squares-based method for many non-normal errors and that it only loses a small amount of efficiency for normal errors. In addition, it is shown that the loss in efficiency is at most 11.1% for estimating varying coefficient functions and is no greater than 13.6% for estimating parametric components. To achieve sparsity with high-dimensional covariates, we propose adaptive penalization methods for variable selection in the semiparametric varying-coefficient partially linear model and prove that the methods possess the oracle property. Extensive Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures. Finally, we apply the new methods to analyze the plasma beta-carotene level data.
Simulation results for constant η and the time-varying η(t) models. In each figure, the true model of η (solid), the constant η model (dotted) and the time-varying η(t) model (dashdotted ) are plotted and compared  
The constant parameter estimation results
This article considers estimation of constant and time-varying coefficients in nonlinear ordinary differential equation (ODE) models where analytic closed-form solutions are not available. The numerical solution-based nonlinear least squares (NLS) estimator is investigated in this study. A numerical algorithm such as the Runge-Kutta method is used to approximate the ODE solution. The asymptotic properties are established for the proposed estimators considering both numerical error and measurement error. The B-spline is used to approximate the time-varying coefficients, and the corresponding asymptotic theories in this case are investigated under the framework of the sieve approach. Our results show that if the maximum step size of the p-order numerical algorithm goes to zero at a rate faster than n(-1/(p∧4)), the numerical error is negligible compared to the measurement error. This result provides a theoretical guidance in selection of the step size for numerical evaluations of ODEs. Moreover, we have shown that the numerical solution-based NLS estimator and the sieve NLS estimator are strongly consistent. The sieve estimator of constant parameters is asymptotically normal with the same asymptotic co-variance as that of the case where the true ODE solution is exactly known, while the estimator of the time-varying parameter has the optimal convergence rate under some regularity conditions. The theoretical results are also developed for the case when the step size of the ODE numerical solver does not go to zero fast enough or the numerical error is comparable to the measurement error. We illustrate our approach with both simulation studies and clinical data on HIV viral dynamics.
The Cox model with time-dependent coefficients has been studied by a number of authors recently. In this paper, we develop empirical likelihood (EL) pointwise confidence regions for the time-dependent regression coefficients via local partial likelihood smoothing. The EL simultaneous confidence bands for a linear combination of the coefficients are also derived based on the strong approximation methods. The empirical likelihood ratio is formulated through the local partial log-likelihood for the regression coefficient functions. Our numerical studies indicate that the EL pointwise/simultaneous confidence regions/bands have satisfactory finite sample performances. Compared with the confidence regions derived directly based on the asymptotic normal distribution of the local constant estimator, the EL confidence regions are overall tighter and can better capture the curvature of the underlying regression coefficient functions. Two data sets, the gastric cancer data and the Mayo Clinic primary biliary cirrhosis data, are analyzed using the proposed method.
We study nonparametric estimation for current status data with competing risks. Our main interest is in the nonparametric maximum likelihood estimator (MLE), and for comparison we also consider a simpler 'naive estimator'. Groeneboom, Maathuis and Wellner [8] proved that both types of estimators converge globally and locally at rate n(1/3). We use these results to derive the local limiting distributions of the estimators. The limiting distribution of the naive estimator is given by the slopes of the convex minorants of correlated Brownian motion processes with parabolic drifts. The limiting distribution of the MLE involves a new self-induced limiting process. Finally, we present a simulation study showing that the MLE is superior to the naive estimator in terms of mean squared error, both for small sample sizes and asymptotically.
A number of settings arise in which it is of interest to predict Principal Component (PC) scores for new observations using data from an initial sample. In this paper, we demonstrate that naive approaches to PC score prediction can be substantially biased towards 0 in the analysis of large matrices. This phenomenon is largely related to known inconsistency results for sample eigenvalues and eigenvectors as both dimensions of the matrix increase. For the spiked eigenvalue model for random matrices, we expand the generality of these results, and propose bias-adjusted PC score prediction. In addition, we compute the asymptotic correlation coefficient between PC scores from sample and population eigenvectors. Simulation and real data examples from the genetics literature show the improved bias and numerical properties of our estimators.
We find limiting distributions of the nonparametric maximum likelihood estimator (MLE) of a log-concave density, i.e. a density of the form f(0) = exp varphi(0) where varphi(0) is a concave function on R. Existence, form, characterizations and uniform rates of convergence of the MLE are given by Rufibach (2006) and Dümbgen and Rufibach (2007). The characterization of the log-concave MLE in terms of distribution functions is the same (up to sign) as the characterization of the least squares estimator of a convex density on [0, infinity) as studied by Groeneboom, Jongbloed and Wellner (2001b). We use this connection to show that the limiting distributions of the MLE and its derivative are, under comparable smoothness assumptions, the same (up to sign) as in the convex density estimation problem. In particular, changing the smoothness assumptions of Groeneboom, Jongbloed and Wellner (2001b) slightly by allowing some higher derivatives to vanish at the point of interest, we find that the pointwise limiting distributions depend on the second and third derivatives at 0 of H(k), the "lower invelope" of an integrated Brownian motion process minus a drift term depending on the number of vanishing derivatives of varphi(0) = log f(0) at the point of interest. We also establish the limiting distribution of the resulting estimator of the mode M(f(0)) and establish a new local asymptotic minimax lower bound which shows the optimality of our mode estimator in terms of both rate of convergence and dependence of constants on population values.
For time-to-event data with finitely many competing risks, the proportional hazards model has been a popular tool for relating the cause-specific outcomes to covariates [Prentice et al. Biometrics34 (1978) 541-554]. This article studies an extension of this approach to allow a continuum of competing risks, in which the cause of failure is replaced by a continuous mark only observed at the failure time. We develop inference for the proportional hazards model in which the regression parameters depend nonparametrically on the mark and the baseline hazard depends nonparametrically on both time and mark. This work is motivated by the need to assess HIV vaccine efficacy, while taking into account the genetic divergence of infecting HIV viruses in trial participants from the HIV strain that is contained in the vaccine, and adjusting for covariate effects. Mark-specific vaccine efficacy is expressed in terms of one of the regression functions in the mark-specific proportional hazards model. The new approach is evaluated in simulations and applied to the first HIV vaccine efficacy trial.
Let A and B be independent, central Wishart matrices in p variables with common covariance and having m and n degrees of freedom, respectively. The distribution of the largest eigenvalue of (A + B)(-1)B has numerous applications in multivariate statistics, but is difficult to calculate exactly. Suppose that m and n grow in proportion to p. We show that after centering and, scaling, the distribution is approximated to second-order, O(p(-2/3)), by the Tracy-Widom law. The results are obtained for both complex and then real-valued data by using methods of random matrix theory to study the largest eigenvalue of the Jacobi unitary and orthogonal ensembles. Asymptotic approximations of Jacobi polynomials near the largest zero play a central role.
This paper studies the sparsistency and rates of convergence for estimating sparse covariance and precision matrices based on penalized likelihood with nonconvex penalty functions. Here, sparsistency refers to the property that all parameters that are zero are actually estimated as zero with probability tending to one. Depending on the case of applications, sparsity priori may occur on the covariance matrix, its inverse or its Cholesky decomposition. We study these three sparsity exploration problems under a unified framework with a general penalty function. We show that the rates of convergence for these problems under the Frobenius norm are of order (s(n) log p(n)/n)(1/2), where s(n) is the number of nonzero elements, p(n) is the size of the covariance matrix and n is the sample size. This explicitly spells out the contribution of high-dimensionality is merely of a logarithmic factor. The conditions on the rate with which the tuning parameter λ(n) goes to 0 have been made explicit and compared under different penalties. As a result, for the L(1)-penalty, to guarantee the sparsistency and optimal rate of convergence, the number of nonzero elements should be small: sn'=O(pn) at most, among O(pn2) parameters, for estimating sparse covariance or correlation matrix, sparse precision or inverse correlation matrix or sparse Cholesky factor, where sn' is the number of the nonzero elements on the off-diagonal entries. On the other hand, using the SCAD or hard-thresholding penalty functions, there is no such a restriction.
We study estimation of multivariate densities p of the form p(x) = h(g(x)) for x ∈ ℝ(d) and for a fixed monotone function h and an unknown convex function g. The canonical example is h(y) = e(-y) for y ∈ ℝ; in this case, the resulting class of densities [Formula: see text]is well known as the class of log-concave densities. Other functions h allow for classes of densities with heavier tails than the log-concave class.We first investigate when the maximum likelihood estimator p̂ exists for the class P(h) for various choices of monotone transformations h, including decreasing and increasing functions h. The resulting models for increasing transformations h extend the classes of log-convex densities studied previously in the econometrics literature, corresponding to h(y) = exp(y).We then establish consistency of the maximum likelihood estimator for fairly general functions h, including the log-concave class P(e(-y)) and many others. In a final section, we provide asymptotic minimax lower bounds for the estimation of p and its vector of derivatives at a fixed point x(0) under natural smoothness hypotheses on h and g. The proofs rely heavily on results from convex analysis.
Cook's (Cook, 1977) distance is one of the most important diagnostic tools for detecting influential individual or subsets of observations in linear regression for cross-sectional data. However, for many complex data structures (e.g., longitudinal data), no rigorous approach has been developed to address a fundamental issue: deleting subsets with different numbers of observations introduces different degrees of perturbation to the current model fitted to the data and the magnitude of Cook's distance is associated with the degree of the perturbation. The aim of this paper is to address this issue in general parametric models with complex data structures. We propose a new quantity for measuring the degree of the perturbation introduced by deleting a subset. We use stochastic ordering to quantify the stochastic relationship between the degree of the perturbation and the magnitude of Cook's distance. We develop several scaled Cook's distances to resolve the comparison of Cook's distance for different subset deletions. Theoretical and numerical examples are examined to highlight the broad spectrum of applications of these scaled Cook's distances in a formal influence analysis.
We study the absolute penalized maximum partial likelihood estimator in sparse, high-dimensional Cox proportional hazards regression models where the number of time-dependent covariates can be larger than the sample size. We establish oracle inequalities based on natural extensions of the compatibility and cone invertibility factors of the Hessian matrix at the true regression coefficients. Similar results based on an extension of the restricted eigenvalue can be also proved by our method. However, the presented oracle inequalities are sharper since the compatibility and cone invertibility factors are always greater than the corresponding restricted eigenvalue. In the Cox regression model, the Hessian matrix is based on time-dependent covariates in censored risk sets, so that the compatibility and cone invertibility factors, and the restricted eigenvalue as well, are random variables even when they are evaluated for the Hessian at the true regression coefficients. Under mild conditions, we prove that these quantities are bounded from below by positive constants for time-dependent covariates, including cases where the number of covariates is of greater order than the sample size. Consequently, the compatibility and cone invertibility factors can be treated as positive constants in our oracle inequalities.
We study the Cox models with semiparametric relative risk, which can be partially linear with one nonparametric component, or multiple additive or nonadditive nonparametric components. A penalized partial likelihood procedure is proposed to simultaneously estimate the parameters and select variables for both the parametric and the nonparametric parts. Two penalties are applied sequentially. The first penalty, governing the smoothness of the multivariate nonlinear covariate effect function, provides a smoothing spline ANOVA framework that is exploited to derive an empirical model selection tool for the nonparametric part. The second penalty, either the smoothly-clipped-absolute-deviation (SCAD) penalty or the adaptive LASSO penalty, achieves variable selection in the parametric part. We show that the resulting estimator of the parametric part possesses the oracle property, and that the estimator of the nonparametric part achieves the optimal rate of convergence. The proposed procedures are shown to work well in simulation experiments, and then applied to a real data example on sexually transmitted diseases.
Data summary with number of non-zero elements reported on the whole data set and 
High throughput genetic sequencing arrays with thousands of measurements per sample and a great amount of related censored clinical data have increased demanding need for better measurement specific model selection. In this paper we establish strong oracle properties of non-concave penalized methods for non-polynomial (NP) dimensional data with censoring in the framework of Cox's proportional hazards model. A class of folded-concave penalties are employed and both LASSO and SCAD are discussed specifically. We unveil the question under which dimensionality and correlation restrictions can an oracle estimator be constructed and grasped. It is demonstrated that non-concave penalties lead to significant reduction of the "irrepresentable condition" needed for LASSO model selection consistency. The large deviation result for martingales, bearing interests of its own, is developed for characterizing the strong oracle property. Moreover, the non-concave regularized estimator, is shown to achieve asymptotically the information bound of the oracle estimator. A coordinate-wise algorithm is developed for finding the grid of solution paths for penalized hazard regression problems, and its performance is evaluated on simulated and gene association study examples.
We develop an efficient estimation procedure for identifying and estimating the central subspace. Using a new way of parameterization, we convert the problem of identifying the central subspace to the problem of estimating a finite dimensional parameter in a semiparametric model. This conversion allows us to derive an efficient estimator which reaches the optimal semiparametric efficiency bound. The resulting efficient estimator can exhaustively estimate the central subspace without imposing any distributional assumptions. Our proposed efficient estimation also provides a possibility for making inference of parameters that uniquely identify the central subspace. We conduct simulation studies and a real data analysis to demonstrate the finite sample performance in comparison with several existing methods.
We propose a new penalized method for variable selection and estimation that explicitly incorporates the correlation patterns among predictors. This method is based on a combination of the minimax concave penalty and Laplacian quadratic associated with a graph as the penalty function. We call it the sparse Laplacian shrinkage (SLS) method. The SLS uses the minimax concave penalty for encouraging sparsity and Laplacian quadratic penalty for promoting smoothness among coefficients associated with the correlated predictors. The SLS has a generalized grouping property with respect to the graph represented by the Laplacian quadratic. We show that the SLS possesses an oracle property in the sense that it is selection consistent and equal to the oracle Laplacian shrinkage estimator with high probability. This result holds in sparse, high-dimensional settings with p ≫ n under reasonable conditions. We derive a coordinate descent algorithm for computing the SLS estimates. Simulation studies are conducted to evaluate the performance of the SLS method and a real data example is used to illustrate its application.
This paper explores the following question: what kind of statistical guarantees can be given when doing variable selection in high dimensional models? In particular, we look at the error rates and power of some multi-stage regression methods. In the first stage we fit a set of candidate models. In the second stage we select one model by cross-validation. In the third stage we use hypothesis testing to eliminate some variables. We refer to the first two stages as "screening" and the last stage as "cleaning." We consider three screening methods: the lasso, marginal regression, and forward stepwise regression. Our method gives consistent variable selection under certain conditions.
Researchers are often interested in drawing inferences regarding the order between two experimental groups on the basis of multivariate response data. Since standard multivariate methods are designed for two sided alternatives they may not be ideal for testing for order between two groups. In this article we introduce the notion of the linear stochastic order and investigate its properties. Statistical theory and methodology are developed to both estimate the direction which best separates two arbitrary ordered distributions and to test for order between the two groups. The new methodology generalizes Roy's classical largest root test to the nonparametric setting and is applicable to random vectors with discrete and/or continuous components. The proposed methodology is illustrated using data obtained from a 90-day pre-chronic rodent cancer bioassay study conducted by the National Toxicology Program (NTP).
Proportions of employees having job grade higher than 4 
(a) Asymptotic null distribution (solid) and estimated true null distribution (dotted) for the Poisson model. (b) The power function at significant level α = 0.01, 0.05 and 0.1. The captions for (c) and (d) are the same as those in (a) and (b) except that the Bernoulli model is now used. 
(a) Fitted coefficient function α 1 (·) (b) Fitted coefficient function α 2 (·). (c) The scatter plot 'TotalYrsExp' Against 'Age'. (d) Standardized residuals against the variable 'Age'. 
One-step estimate of parametric components with different bandwidths 
The generalized varying coefficient partially linear model with growing number of predictors arises in many contemporary scientific endeavor. In this paper we set foot on both theoretical and practical sides of profile likelihood estimation and inference. When the number of parameters grows with sample size, the existence and asymptotic normality of the profile likelihood estimator are established under some regularity conditions. Profile likelihood ratio inference for the growing number of parameters is proposed and Wilk’s phenomenon is demonstrated. A new algorithm, called the accelerated profile-kernel algorithm, for computing profile-kernel estimator is proposed and investigated. Simulation studies show that the resulting estimates are as efficient as the fully iterative profile-kernel estimates. For moderate sample sizes, our proposed procedure saves much computational time over the fully iterative profile-kern one and gives stabler estimates. A set of real data is analyzed using our proposed algorithm.
The dynamic properties and independence structure of stochastic kinetic models (SKMs) are analyzed. An SKM is a highly multivariate jump process used to model chemical reaction networks, particularly those in biochemical and cellular systems. We identify SKM subprocesses with the corresponding counting processes and propose a directed, cyclic graph (the kinetic independence graph or KIG) that encodes the local independence structure of their conditional intensities. Given a partition [A, D, B] of the vertices, the graphical separation A ⊥ B|D in the undirected KIG has an intuitive chemical interpretation and implies that A is locally independent of B given A ∪ D. It is proved that this separation also results in global independence of the internal histories of A and B conditional on a history of the jumps in D which, under conditions we derive, corresponds to the internal history of D. The results enable mathematical definition of a modularization of an SKM using its implied dynamics. Graphical decomposition methods are developed for the identification and efficient computation of nested modularizations. Application to an SKM of the red blood cell advances understanding of this biochemical system.
We study a marginal empirical likelihood approach in scenarios when the number of variables grows exponentially with the sample size. The marginal empirical likelihood ratios as functions of the parameters of interest are systematically examined, and we find that the marginal empirical likelihood ratio evaluated at zero can be used to differentiate whether an explanatory variable is contributing to a response variable or not. Based on this finding, we propose a unified feature screening procedure for linear models and the generalized linear models. Different from most existing feature screening approaches that rely on the magnitudes of some marginal estimators to identify true signals, the proposed screening approach is capable of further incorporating the level of uncertainties of such estimators. Such a merit inherits the self-studentization property of the empirical likelihood approach, and extends the insights of existing feature screening methods. Moreover, we show that our screening approach is less restrictive to distributional assumptions, and can be conveniently adapted to be applied in a broad range of scenarios such as models specified using general moment conditions. Our theoretical results and extensive numerical examples by simulations and data analysis demonstrate the merits of the marginal empirical likelihood approach.
Fan & Li (2001) propose a family of variable selection methods via penalized likelihood using concave penalty functions. The nonconcave penalized likelihood estimators enjoy the oracle properties, but maximizing the penalized likelihood function is computationally challenging, because the objective function is nondifferentiable and nonconcave. In this article we propose a new unified algorithm based on the local linear approximation (LLA) for maximizing the penalized likelihood for a broad class of concave penalty functions. Convergence and other theoretical properties of the LLA algorithm are established. A distinguished feature of the LLA algorithm is that at each LLA step, the LLA estimator can naturally adopt a sparse representation. Thus we suggest using the one-step LLA estimator from the LLA algorithm as the final estimates. Statistically, we show that if the regularization parameter is appropriately chosen, the one-step LLA estimates enjoy the oracle properties with good initial estimators. Computationally, the one-step LLA estimation methods dramatically reduce the computational cost in maximizing the nonconcave penalized likelihood. We conduct some Monte Carlo simulation to assess the finite sample performance of the one-step sparse estimation methods. The results are very encouraging.
Estimation of genewise variance arises from two important applications in microarray data analysis: selecting significantly differentially expressed genes and validation tests for normalization of microarray data. We approach the problem by introducing a two-way nonparametric model, which is an extension of the famous Neyman-Scott model and is applicable beyond microarray data. The problem itself poses interesting challenges because the number of nuisance parameters is proportional to the sample size and it is not obvious how the variance function can be estimated when measurements are correlated. In such a high-dimensional nonparametric problem, we proposed two novel nonparametric estimators for genewise variance function and semiparametric estimators for measurement correlation, via solving a system of nonlinear equations. Their asymptotic normality is established. The finite sample property is demonstrated by simulation studies. The estimators also improve the power of the tests for detecting statistically differentially expressed genes. The methodology is illustrated by the data from MicroArray Quality Control (MAQC) project.
In partially linear single-index models, we obtain the semiparametrically efficient profile least-squares estimators of regression coefficients. We also employ the smoothly clipped absolute deviation penalty (SCAD) approach to simultaneously select variables and estimate regression coefficients. We show that the resulting SCAD estimators are consistent and possess the oracle property. Subsequently, we demonstrate that a proposed tuning parameter selector, BIC, identifies the true model consistently. Finally, we develop a linear hypothesis test for the parametric coefficients and a goodness-of-fit test for the nonparametric component, respectively. Monte Carlo studies are also presented.
The multiple testing procedure plays an important role in detecting the presence of spatial signals for large scale imaging data. Typically, the spatial signals are sparse but clustered. This paper provides empirical evidence that for a range of commonly used control levels, the conventional FDR procedure can lack the ability to detect statistical significance, even if the p-values under the true null hypotheses are independent and uniformly distributed; more generally, ignoring the neighboring information of spatially structured data will tend to diminish the detection effectiveness of the FDR procedure. This paper first introduces a scalar quantity to characterize the extent to which the "lack of identification phenomenon" (LIP) of the FDR procedure occurs. Second, we propose a new multiple comparison procedure, called FDR(L), to accommodate the spatial information of neighboring p-values, via a local aggregation of p-values. Theoretical properties of the FDR(L) procedure are investigated under weak dependence of p-values. It is shown that the FDR(L) procedure alleviates the LIP of the FDR procedure, thus substantially facilitating the selection of more stringent control levels. Simulation evaluations indicate that the FDR(L) procedure improves the detection sensitivity of the FDR procedure with little loss in detection specificity. The computational simplicity and detection effectiveness of the FDR(L) procedure are illustrated through a real brain fMRI dataset.
Histograms and scatterplots ofˆθnofˆ ofˆθn andˆβnandˆ andˆβn in the correctly specified case for H = 0.3 (top row), H = 0.5 (middle row) and H = 0.7 (bottom row), based on 500 samples of size n = 20.
Same as Figure 2 except in the partially misspecified case.
Gene expression example: histograms ofˆθofˆ ofˆθ * n based on 1000 residual bootstrap samples and simulated responses with σ = 0.01 (left), σ = 0.03 (middle) and σ = 0.1 (right).
This paper develops a point impact linear regression model in which the trajectory of a continuous stochastic process, when evaluated at a "sensitive time point", is associated with a scalar response. The proposed model complements and is more interpretable than the functional linear regression approach that has become popular in recent years. The trajectories are assumed to have fractal (self-similar) properties in common with a fractional Brownian motion with an unknown Hurst exponent. Bootstrap confidence intervals based on the least-squares estimator of the sensitive time point are developed. Misspecification of the point impact model by a functional linear model is also investigated. Non-Gaussian limit distributions and rates of convergence determined by the Hurst exponent play an important role.
In this work we deal with correlated failure time (age at onset) data arising from population-based case-control studies, where case and control probands are selected by population-based sampling and an array of risk factor measures is collected for both cases and controls and their relatives. Parameters of interest are effects of risk factors on the failure time hazard function and within-family dependencies among failure times after adjusting for the risk factors. Due to the retrospective sampling scheme, large sample theory for existing methods has not been established. We develop a novel technique for estimating the parameters of interest under a general semiparametric shared frailty model. We also present a simple, easily computed, and non-iterative nonparametric estimator for the cumulative baseline hazard function. We provide rigorous large sample theory for the proposed method. We also present simulation results and a real data example for illustrating the utility of the proposed method.
Because many illnesses show heterogeneous response to treatment, there is increasing interest in individualizing treatment to patients [11]. An individualized treatment rule is a decision rule that recommends treatment according to patient characteristics. We consider the use of clinical trial data in the construction of an individualized treatment rule leading to highest mean response. This is a difficult computational problem because the objective function is the expectation of a weighted indicator function that is non-concave in the parameters. Furthermore there are frequently many pretreatment variables that may or may not be useful in constructing an optimal individualized treatment rule yet cost and interpretability considerations imply that only a few variables should be used by the individualized treatment rule. To address these challenges we consider estimation based on l(1) penalized least squares. This approach is justified via a finite sample upper bound on the difference between the mean response due to the estimated individualized treatment rule and the mean response due to the optimal individualized treatment rule.
Generalized linear models and quasi-likelihood method extend the ordinary regression models to accommodate more general conditional distributions of the response. Nonparametric methods need no explicit parametric specification and the resulting model is completely determined by the data themselves. However nonparametric estimation schemes generally have a slower convergence rate such as the local polynomial smoothing estimation of nonparametric generalized linear models studied in Fan, Heckman and Wand (1995). In this work, we propose two parametrically guided nonparametric estimation schemes by incorporating prior shape information on the link transformation of the response variable's conditional mean in terms of the predictor variable. Asymptotic results and numerical simulations demonstrate the improvement of our new estimation schemes over the original nonparametric counterpart.
In this paper, we are concerned with how to select significant variables in semiparametric modeling. Variable selection for semiparametric regression models consists of two components: model selection for nonparametric components and select significant variables for parametric portion. Thus, it is much more challenging than that for parametric models such as linear models and generalized linear models because traditional variable selection procedures including stepwise regression and the best subset selection require model selection to nonparametric components for each submodel. This leads to very heavy computational burden. In this paper, we propose a class of variable selection procedures for semiparametric regression models using nonconcave penalized likelihood. The newly proposed procedures are distinguished from the traditional ones in that they delete insignificant variables and estimate the coefficients of significant variables simultaneously. This allows us to establish the sampling properties of the resulting estimate. We first establish the rate of convergence of the resulting estimate. With proper choices of penalty functions and regularization parameters, we then establish the asymptotic normality of the resulting estimate, and further demonstrate that the proposed procedures perform as well as an oracle procedure. Semiparametric generalized likelihood ratio test is proposed to select significant variables in the nonparametric component. We investigate the asymptotic behavior of the proposed test and demonstrate its limiting null distribution follows a chi-squared distribution, which is independent of the nuisance parameters. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed variable selection procedures.
Suppose we have samples of a subset of a collection of random variables. No additional information is provided about the number of latent variables, nor of the relationship between the latent and observed variables. Is it possible to discover the number of hidden components, and to learn a statistical model over the entire collection of variables? We address this question in the setting in which the latent and observed variables are jointly Gaussian, with the conditional statistics of the observed variables conditioned on the latent variables being specified by a graphical model. As a first step we give natural conditions under which such latent-variable Gaussian graphical models are identifiable given marginal statistics of only the observed variables. Essentially these conditions require that the conditional graphical model among the observed variables is sparse, while the effect of the latent variables is “spread out” over most of the observed variables. Next we propose a tractable convex program based on regularized maximum-likelihood for model selection in this latent-variable setting; the regularizer uses both the ℓ<sub>1</sub> norm and the nuclear norm. Our modeling framework can be viewed as a combination of dimensionality reduction (to identify latent variables) and graphical modeling (to capture remaining statistical structure not attributable to the latent variables), and it consistently estimates both the number of hidden components and the conditional graphical model structure among the observed variables. These results are applicable in the high-dimensional setting in which the number of latent/observed variables grows with the number of samples of the observed variables. The geometric properties of the algebraic varieties of sparse matrices and of low-rank matrices play an important role in our analysis.
Consider n i.i.d. random elements on C[0; 1]. We show that under an appropriate strengthening of the domain of attraction condition natural estimators of the extreme-value index, which is now a continuous function, and the normalizing functions have a Gaussian process as limiting distribution. A key tool is the weak convergence of a weighted tail empirical process, which makes it possible to obtain the results uniformly on [0; 1]. Detailed examples are also presented.
Top-cited authors
Jerome H. Friedman
  • Stanford University
Trevor Hastie
  • Stanford University
Robert Tibshirani
  • Stanford University
Jianqing Fan
  • Princeton University
Yoav Benjamini
  • Tel Aviv University