Preprint

Test and Measure for Partial Mean Dependence Based on Machine Learning Methods

Taylor & Francis
Journal of the American Statistical Association
Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

It is of importance to investigate the significance of a subset of covariates W for the response Y given covariates Z in regression modeling. To this end, we propose a significance test for the partial mean independence problem based on machine learning methods and data splitting. The test statistic converges to the standard chi-squared distribution under the null hypothesis while it converges to a normal distribution under the fixed alternative hypothesis. Power enhancement and algorithm stability are also discussed. If the null hypothesis is rejected, we propose a partial Generalized Measure of Correlation (pGMC) to measure the partial mean dependence of Y given W after controlling for the nonlinear effect of Z. We present the appealing theoretical properties of the pGMC and establish the asymptotic normality of its estimator with the optimal root-N convergence rate. Furthermore, the valid confidence interval for the pGMC is also derived. As an important special case when there are no conditional covariates Z, we introduce a new test of overall significance of covariates for the response in a model-free setting. Numerical studies and real data analysis are also conducted to compare with existing approaches and to demonstrate the validity and flexibility of our proposed procedures.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

... Guerre and Lavergne [13] further provided the optimal minimax rate for the smoothing parameter that ensures the rate optimality of the test in the context of testing the specification of a nonlinear parametric regression function. Conditional on a subset of covariates in regression modeling, Cai et al. [14] proposed a significance test for the partial mean independence problem based on machine learning methods and data splitting. Tan and Zhu [15] proposed a residual-marked empirical process that adapts to the underlying model, forming the basis of a goodness-of-fit test for parametric single-index models with a diverging number of predictors. ...
Article
Full-text available
A model specification test is a statistical procedure used to assess whether a given statistical model accurately represents the underlying data-generating process. The smoothing-based nonparametric specification test is widely used due to its efficiency against “singular” local alternatives. However, large modern datasets create various computational problems when implementing the nonparametric specification test. The divide-and-conquer algorithm is highly effective for handling large datasets, as it can break down a large dataset into more manageable datasets. By applying divide-and-conquer, the nonparametric specification test can handle the computational problems induced by the massive size of the modern datasets, leading to improved scalability and efficiency and reduced processing time. However, the selection of smoothing parameters for optimal power of the distributed algorithm is an important problem. The rate of the smoothing parameter that ensures rate optimality of the test in the context of testing the specification of a nonlinear parametric regression function is studied in the literature. In this paper, we verified the uniqueness of the rate of the smoothing parameter that ensures the rate optimality of divide-and-conquer-based tests. By employing a penalty method to select the smoothing parameter, we obtain a test with an asymptotic normal null distribution and adaptiveness properties. The performance of this test is further illustrated through numerical simulations.
... In Assumption 4, mild moment conditions for the error term are imposed. Assumption 5 is a mild condition on the rate of estimating the nuisance function g(·), which is available for many ML methods (Chernozhukov et al. 2018;Williamson et al. 2021;Chi et al. 2022;Bauer and Kohler 2019;Cai et al. 2024). ...
Article
Full-text available
In this paper, we consider the tests for high-dimensional partially linear regression models. The presence of high-dimensional nuisance covariates and the unknown nuisance function makes the inference problem very challenging. We adopt machine learning methods to estimate the unknown nuisance function and introduce quadratic-form test statistics. Interestingly, though the machine learning methods can be very complex, under suitable conditions, we establish the asymptotic normality of our introduced test statistics under the null hypothesis and local alternative hypotheses. We further propose a power-enhanced procedure to improve the performance of test statistics. Two thresholding determination methods are provided for the proposed power-enhanced procedure. We show that the power enhancement procedure is powerful to detect signals under either sparse or dense alternatives and it can still control the type-I error asymptotically under the null hypothesis. Numerical studies are carried out to illustrate the empirical performance of our introduced procedures.
Article
Full-text available
In this paper, we propose semiparametric efficient estimators of genetic relatedness between two traits in a model-free framework. Most existing methods require specifying certain parametric models involving the traits and genetic variants. However, the bias due to model misspecification may yield misleading statistical results. Moreover, the semiparametric efficient bounds for estimators of genetic relatedness are still lacking. In this paper, we develop semiparametric efficient estimators with machine learning methods and construct valid confidence intervals for two important measures of genetic relatedness: genetic covariance and genetic correlation, allowing both continuous and discrete responses. Based on the derived efficient influence functions of genetic relatedness, we propose a consistent estimator of the genetic covariance as long as one of the genetic values is consistently estimated. The data of two traits may be collected from the same group or different groups of individuals. To validate our approach, we conduct various numerical studies to illustrate the introduced estimation procedures. Additionally, we apply our proposed methodologies to analyze data from the Carworth Farms White mice genome-wide association study.
Preprint
Full-text available
The effectiveness of Earth Pressure Balance (EPB) Tunnel Boring Machines (TBMs) in urban underground construction relies on understanding and optimizing their performance under variable geotechnical conditions. This study investigates the key parameters impacting TBM efficiency during the construction of the Jakarta Mass Rapid Transit (MRT) Underground Section CP106. Data from TBM operation were analyzed using statistical and machine learning techniques, including Mutual Information (MI), Partial Dependence Plots (PDP), and Analysis of Variance (ANOVA), to identify influential parameters such as Tensile Strength, Uniaxial Strength, Spacing, and Penetration. Predictive models, including Gradient Boosting Regressor, Random Forest Regressor, and Linear Regression, were evaluated based on error metrics and R-squared values, with Gradient Boosting Regressor showing the highest predictive accuracy. Clustering analyses using K-Means and Principal Component Analysis (PCA) further classified operational states, identifying conditions that optimize energy efficiency and reduce mechanical wear. The findings suggest that TBM configurations with lower Specific Energy, Normal Force, and Rolling Force contribute to more efficient, less force-intensive tunneling. These insights provide a basis for refining TBM operations and predictive modeling in urban tunneling projects.
Article
Full-text available
Testing the significance of predictors in a regression model is one of the most important topics in statistics. This problem is especially difficult without any parametric assumptions on the data. This paper aims to test the null hypothesis that given confounding variables Z, X does not significantly contribute to the prediction of Y under the model-free setting, where X and Z are possibly high dimensional. We propose a general framework that first fits nonparametric machine learning regression algorithms on [Formula: see text] and [Formula: see text], then compares the prediction power of the two models. The proposed method allows us to leverage the strength of the most powerful regression algorithms developed in the modern machine learning community. The P value for the test can be easily obtained by permutation. In simulations, we find that the proposed method is more powerful compared to existing methods. The proposed method allows us to draw biologically meaningful conclusions from two gene expression data analyses without strong distributional assumptions: 1) testing the prediction power of sequencing RNA for the proteins in cellular indexing of transcriptomes and epitopes by sequencing data and 2) identification of spatially variable genes in spatially resolved transcriptomics data.
Article
Full-text available
Inference for the parameters indexing generalised linear models is routinely based on the assumption that the model is correct and a priori specified. This is unsatisfactory because the chosen model is usually the result of a data‐adaptive model selection process, which may induce excess uncertainty that is not usually acknowledged. Moreover, the assumptions encoded in the chosen model rarely represent some a priori known, ground truth, making standard inferences prone to bias, but also failing to give a pure reflection of the information that is contained in the data. Inspired by developments on assumption‐free inference for so‐called projection parameters, we here propose novel nonparametric definitions of main effect estimands and effect modification estimands. These reduce to standard main effect and effect modification parameters in generalised linear models when these models are correctly specified, but have the advantage that they continue to capture respectively the (conditional) association between two variables, or the degree to which two variables interact in their association with outcome, even when these models are misspecified. We achieve an assumption‐lean inference for these estimands on the basis of their efficient influence function under the nonparametric model while invoking flexible data‐adaptive (e.g. machine learning) procedures.
Article
Full-text available
An exciting recent development is the uptake of deep neural networks in many scientific fields, where the main objective is outcome prediction with a black-box nature. Significance testing is promising to address the black-box issue and explore novel scientific insights and interpretations of the decision-making process based on a deep learning model. However, testing for a neural network poses a challenge because of its black-box nature and unknown limiting distributions of parameter estimates while existing methods require strong assumptions or excessive computation. In this article, we derive one-split and two-split tests relaxing the assumptions and computational complexity of existing black-box tests and extending to examine the significance of a collection of features of interest in a dataset of possibly a complex type, such as an image. The one-split test estimates and evaluates a black-box model based on estimation and inference subsets through sample splitting and data perturbation. The two-split test further splits the inference subset into two but requires no perturbation. Also, we develop their combined versions by aggregating the p -values based on repeated sample splitting. By deflating the bias-sd-ratio, we establish asymptotic null distributions of the test statistics and the consistency in terms of Type 2 error. Numerically, we demonstrate the utility of the proposed tests on seven simulated examples and six real datasets. Accompanying this article is our python library dnn-inference (https://dnn-inference.readthedocs.io/en/latest/) that implements the proposed tests.
Article
Full-text available
It is a common saying that testing for conditional independence, i.e., testing whether X is independent of Y, given Z, is a hard statistical problem if Z is a continuous random variable. In this paper, we prove that conditional independence is indeed a particularly difficult hypothesis to test for. Statistical tests are required to have a size that is smaller than a predefined significance level, and different tests usually have power against a different class of alternatives. We prove that a valid test for conditional independence does not have power against any alternative. Given the non-existence of a uniformly valid conditional independence test, we argue that tests must be designed so their suitability for a particular problem setting may be judged easily. To address this need, we propose in the case where X and Y are univariate to nonlinearly regress X on Z, and Y on Z and then compute a test statistic based on the sample covariance between the residuals, which we call the generalised covariance measure (GCM). We prove that validity of this form of test relies almost entirely on the weak requirement that the regression procedures are able to estimate the conditional means X given Z, and Y given Z, at a slow rate. We extend the methodology to handle settings where X and Y may be multivariate or even high-dimensional. While our general procedure can be tailored to the setting at hand by combining it with any regression technique, we develop the theoretical guarantees for kernel ridge regression. A simulation study shows that the test based on GCM is competitive with state of the art conditional independence tests. Code will be available as an R package.
Article
Full-text available
Convolutional Neural Networks (CNNs) have been proven very effective for human demographics estimation by a number of recent studies. However, the proposed solutions significantly vary in different aspects leaving many open questions on how to choose an optimal CNN architecture and which training strategy to use. In this work, we shed light on some of these questions improving the existing CNN-based approaches for gender and age prediction and providing practical hints for future studies. In particular, we analyse four important factors of the CNN training for gender recognition and age estimation: (1) the target age encoding and loss function, (2) the CNN depth, (3) the need for pretraining, and (4) the training strategy: mono-task or multi-task. As a result, we design the state-of-the-art gender recognition and age estimation models according to three popular benchmarks: LFW, MORPH-II and FG-NET. Moreover, our best model won the ChaLearn Apparent Age Estimation Challenge 2016 significantly outperforming the solutions of other participants.
Article
Full-text available
A partial correlation-based variable selection method was proposed for normal linear regression models by Bühlmann, Kalisch and Maathuis (2010) as an alternative to regularization methods for variable selection. This paper addresses issues related to (a) whether the method is sensitive to the normality assumption, and (b) whether the method is valid when the dimension of predictor increases at an exponential rate in the sample size. To address (a), we study the method for elliptical linear regression models. Our finding indicates that the original proposal can lead to inferior performance when the marginal kurtosis of predictor is not close to that of normal distribution, and simulation results confirm this. To ensure the superior performance of the partial correlation-based variable selection procedure, we propose a thresholded partial correlation (TPC) approach to select significant variables in linear regression models. We establish the selection consistency of the TPC in the presence of ultrahigh dimensional predictors. Since the TPC procedure includes the original proposal as a special case, our results address the issue (b) directly. As a by-product, the sure screening property of the first step of TPC is obtained. Numerical examples illustrate that the TPC is comparable to the commonly-used regularization methods for variable selection.
Article
Full-text available
Motivated by applications in biological science, we propose a novel test to assess the conditional mean dependence of a response variable on a large number of covariates. Our procedure is built on the martingale difference divergence recently proposed in Shao and Zhang (2014), and it is able to detect a certain type of departure from the null hypothesis of conditional mean independence without making any specific model assumptions. Theoretically, we establish the asymptotic normality of the proposed test statistic under suitable assumption on the eigenvalues of a Hermitian operator, which is constructed based on the characteristic function of the covariates. These conditions can be simplified under banded dependence structure on the covariates or Gaussian design. To account for heterogeneity within the data, we further develop a testing procedure for conditional quantile independence at a given quantile level and provide an asymptotic justification. Empirically, our test of conditional mean independence delivers comparable results to the competitor, which was constructed under the linear model framework, when the underlying model is linear. It significantly outperforms the competitor when the conditional mean admits a nonlinear form.
Conference Paper
Full-text available
In this paper we tackle the estimation of apparent age in still face images with deep learning. Our convolutional neural networks (CNNs) use the VGG-16 architecture and are pretrained on ImageNet for image classification. In addition, due to the limited number of apparent age annotated images, we explore the benefit of finetuning over crawled Internet face images with available age. We crawled 0.5 million images of celebrities from IMDB and Wikipedia that we make public. This is the largest public dataset for age prediction to date. We pose the age regression problem as a deep classification problem followed by a softmax expected value refinement and show improvements over direct regression training of CNNs. Our proposed method, Deep EXpectation (DEX) of apparent age, first detects the face in the test image and then extracts the CNN predictions from an ensemble of 20 networks on the cropped face. The CNNs of DEX were finetuned on the crawled images and then on the provided images with apparent age annotations. DEX does not use explicit facial landmarks. Our DEX is the winner (1st place) of the ChaLearn LAP 2015 challenge on apparent age estimation with 115 registered teams, significantly outperforming the human reference.
Article
Full-text available
Statistical inference on conditional dependence is essential in many fields including genetic association studies and graphical models. The classic measures focus on linear conditional correlations, and are incapable of characterizing non-linear conditional relationship including non-monotonic relationship. To overcome this limitation, we introduces a nonparametric measure of conditional dependence for multivariate random variables with arbitrary dimensions. Our measure possesses the necessary and intuitive properties as a correlation index. Briefly, it is zero almost surely if and only if two multivariate random variables are conditionally independent given a third random variable. More importantly, the sample version of this measure can be expressed elegantly as the root of a V or U-process with random kernels and has desirable theoretical properties. Based on the sample version, we propose a test for conditional independence, which is proven to be more powerful than some recently developed tests through our numerical simulations. The advantage of our test is even greater when the relationship between the multivariate random variables given the third random variable cannot be expressed in a linear or monotonic function of one random variable versus the other. We also show that the sample measure is consistent and weakly convergent, and the test statistic is asymptotically normal. By applying our test in a real data analysis, we are able to identify two conditionally associated gene expressions, which otherwise cannot be revealed. Thus, our measure of conditional dependence is not only an ideal concept, but also has important practical utility.
Article
Full-text available
Distance covariance and distance correlation are scalar coefficients that characterize independence of random vectors in arbitrary dimension. Properties, extensions, and applications of distance correlation have been discussed in the recent literature, but the problem of defining the partial distance correlation has remained an open question of considerable interest. The problem of partial distance correlation is more complex than partial correlation partly because the squared distance covariance is not an inner product in the usual linear space. For the definition of partial distance correlation we introduce a new Hilbert space where the squared distance covariance is the inner product. We define the partial distance correlation statistics with the help of this Hilbert space, and develop and implement a test for zero partial distance correlation. Our intermediate results provide an unbiased estimator of squared distance covariance, and a neat solution to the problem of distance correlation for dissimilarities rather than distances.
Article
Full-text available
This paper investigates a computationally simple variant of boosting, L 2 Boost, which is constructed from a functional gradient descent algorithm with the L 2-loss function. As other boosting algorithms, L 2 Boost uses many times in an iterative fashion a pre-chosen fitting method, called the learner. Based on the explicit expression of refitting of residuals of L 2 Boost, the case with (symmetric) linear learners is studied in detail in both regression and classification. In particular, with the boosting iteration m working as the smoothing or regularization parameter, a new exponential bias-variance trade off is found with the variance (complexity) term increasing very slowly as m tends to infinity. When the learner is a smoothing spline, an optimal rate of convergence result holds for both regression and classification and the boosted smoothing spline even adapts to higher order, unknown smoothness. Moreover, a simple expansion of a (smoothed) 0-1 loss function is derived to reveal the importance of the decision boundary, bias reduction, and impossibility of an additive bias-variance decomposition in classification. Finally, simulation and real data set results are obtained to demonstrate the attractiveness of L 2 Boost. In particular, we demonstrate that L 2 Boosting with a novel component-wise cubic smoothing spline is both practical and effective in the presence of high-dimensional predictors.
Article
Full-text available
We study the asymptotic properties of the adaptive Lasso estimators in sparse, high-dimensional, linear regression models when the number of covariates may increase with the sample size. We consider variable selection using the adaptive Lasso, where the L 1 norms in the penalty are re-weighted by data-dependent weights. We show that, if a reasonable initial estimator is available, under appropriate conditions, the adaptive Lasso correctly selects covariates with nonzero coefficients with probability converging to one, and that the estimators of nonzero coefficients have the same asymptotic distribution they would have if the zero coefficients were known in advance. Thus, the adaptive Lasso has an oracle property in the sense of J. Fan and R. Li [J. Am. Stat. Assoc. 96, No. 456, 1348–1360 (2001; Zbl 1073.62547)] and J. Fan and H. Peng [Ann. Stat. 32, No. 3, 928–961 (2004; Zbl 1092.62031)]. In addition, under a partial orthogonality condition in which the covariates with zero coefficients are weakly correlated with the covariates with nonzero coefficients, marginal regression can be used to obtain the initial estimator. With this initial estimator, the adaptive Lasso has the oracle property even when the number of covariates is much larger than the sample size.
Article
Full-text available
This paper is concerned with screening features in ultrahigh dimensional data analysis, which has become increasingly important in diverse scientific fields. We develop a sure independence screening procedure based on the distance correlation (DC-SIS, for short). The DC-SIS can be implemented as easily as the sure independence screening procedure based on the Pearson correlation (SIS, for short) proposed by Fan and Lv (2008). However, the DC-SIS can significantly improve the SIS. Fan and Lv (2008) established the sure screening property for the SIS based on linear models, but the sure screening property is valid for the DC-SIS under more general settings including linear models. Furthermore, the implementation of the DC-SIS does not require model specification (e.g., linear model or generalized linear model) for responses or predictors. This is a very appealing property in ultrahigh dimensional data analysis. Moreover, the DC-SIS can be used directly to screen grouped predictor variables and for multivariate response variables. We establish the sure screening property for the DC-SIS, and conduct simulations to examine its finite sample performance. Numerical comparison indicates that the DC-SIS performs much better than the SIS in various models. We also illustrate the DC-SIS through a real data example.
Article
Full-text available
This paper proposes a test for selecting explanatory variables in nonparametric regression. The test does not need to estimate the conditional expectation function given all the variables, but only those which are significant under the null hypothesis. This feature is computationally convenient and solves, in part, the problem of the “curse of dimensionality” when selecting regressors in a nonparametric context. The proposed test statistic is based on functionals of a U-process. Contiguous alternatives, converging to the null at a rate n1/2n^{-1/2} can be detected. The asymptotic null distribution of the statistic depends on certain features of the data generating process,and asymptotic tests are difficult to implement except in rare circumstances. We justify the consistency of two easy to implement bootstrap tests which exhibit good level accuracy for fairly small samples, according to the reported Monte Carlo simulations. These results are also applicable to test other interesting restrictions on nonparametric curves, like partial linearity and conditional independence.
Article
Many testing problems are readily amenable to randomized tests, such as those employing data splitting. However, despite their usefulness in principle, randomized tests have obvious drawbacks. Firstly, two analyses of the same dataset may lead to different results. Secondly, the test typically loses power because it does not fully utilize the entire sample. As a remedy to these drawbacks, we study how to combine the test statistics or p-values resulting from multiple random realizations, such as through random data splits. We develop rank-transformed subsampling as a general method for delivering large-sample inference about the combined statistic or p-value under mild assumptions. We apply our methodology to a wide range of problems, including testing unimodality in high-dimensional data, testing goodness-of-fit of parametric quantile regression models, testing no direct effect in a sequentially randomized trial and calibrating cross-fit double machine learning confidence intervals. In contrast to existing p-value aggregation schemes that can be highly conservative, our method enjoys Type I error control that asymptotically approaches the nominal level. Moreover, compared to using the ordinary subsampling, we show that our rank transform can remove the first-order bias in approximating the null under alternatives and greatly improve power.
Thesis
This thesis concerns the ubiquitous statistical problem of variable significance testing. The first chapter contains an account of classical approaches to variable significance testing including different perspectives on how to formalise the notion of `variable significance'. The historical development is contrasted with more recent methods that are adapted to both the scale of modern datasets but also the power of advanced machine learning techniques. This chapter also includes a description of and motivation for the theoretical framework that permeates the rest of the thesis: providing theoretical guarantees that hold uniformly over large classes of distributions. The second chapter deals with testing the null that Y ⊥ X | Z where X and Y take values in separable Hilbert spaces with a focus on applications to functional data. The first main result of the chapter shows that for functional data it is impossible to construct a non-trivial test for conditional independence even when assuming that the data are jointly Gaussian. A novel regression-based test, called the Generalised Hilbertian Covariance Measure (GHCM), is presented and theoretical guarantees for uniform asymptotic Type I error control are provided with the key assumption requiring that the product of the mean squared errors of regressing Y on Z and X on Z converges faster than n1^{-1}, where n is the sample size. A power analysis is conducted under the same assumptions to illustrate that the test has uniform power over local alternatives where the expected conditional covariance operator has a Hilbert--Schmidt norm going to 0 at a nn\sqrt[n]{n}-rate. The chapter also contains extensive empirical evidence in the form of simulations demonstrating the validity and power properties of the test. The usefulness of the test is demonstrated by using the GHCM to construct confidence intervals for the boundary point in a truncated functional linear model and to detect edges in a graphical model for an EEG dataset. The third and final chapter analyses the problem of nonparametric variable significance testing by testing for conditional mean independence, that is, testing the null that E(Y | X, Z) = E(Y | Z) for real-valued Y. A test, called the Projected Covariance Measure (PCM), is derived by considering a family of studentised test statistics and choosing a member of this family in a data-driven way that balances robustness and power properties of the resulting test. The test is regression-based and is computed by splitting a set of observations of (X, Y, Z) into two sets of equal size, where one half is used to learn a projection of Y onto X and Z (nonparametrically) and the second half is used to test for vanishing expected conditional correlation given Z between the projection and Y. The chapter contains general conditions that ensure uniform asymptotic Type I control of the resulting test by imposing conditions on the mean-squared error of the involved regressions. A modification of the PCM using additional sample splitting and employing spline regression is shown to achieve the minimax optimal separation rate between null and alternative under Hölder smoothness assumptions on the regression functions and the conditional density of X given Z=z. The chapter also shows through simulation studies that the test maintains the strong type I error control of methods like the Generalised Covariance Measure (GCM) but has power against a broader class of alternatives.
Article
A classical result indicates that the arithmetic average of p-values multiplied by the factor of 2 is a valid p-value under arbitrary dependence among p-values. Moreover, this constant factor cannot be improved in general without additional assumptions. Given this classical result, we study the average of p-values under exchangeability, which is a natural generalization of the i.i.d. assumption. Somewhat surprisingly, we prove that exchangeability is not enough to improve the constant factor of 2. This negative result motivates us to explore other conditions under which it is possible to obtain a smaller constant factor. Finally, we discuss certain benefits of the average of p-values over the average of statistics in terms of statistical power and provide empirical results that verify our theoretical findings.
Article
Empirical researchers are increasingly faced with rich data sets containing many controls or instrumental variables, making it essential to choose an appropriate approach to variable selection. In this paper, we provide results for valid inference after post- or orthogonal L2-boosting is used for variable selection. We consider treatment effects after selecting among many control variables and instrumental variable models with potentially many instruments. To achieve this, we establish new results for the rate of convergence of iterated post-L2-boosting and orthogonal L2-boosting in a high-dimensional setting similar to Lasso, i.e., under approximate sparsity without assuming the beta-min condition. These results are extended to the 2SLS framework and valid inference is provided for treatment effect analysis. We give extensive simulation results for the proposed methods and compare them with Lasso. In an empirical application, we construct efficient IVs with our proposed methods to estimate the effect of pre-merger overlap of bank branch networks in the US on the post-merger stock returns of the acquirer bank.
Article
In this article, we test for the effects of high-dimensional covariates on the response. In many applications, different components of covariates usually exhibit various levels of variation, which is ubiquitous in high-dimensional data. To simultaneously accommodate such heteroscedasticity and high dimensionality, we propose a novel test based on an aggregation of the marginal cumulative covariances, requiring no prior information on the specific form of regression models. Our proposed test statistic is scale-invariance, tuning-free and convenient to implement. The asymptotic normality of the proposed statistic is established under the null hypothesis. We further study the asymptotic relative efficiency of our proposed test with respect to the state-of-art universal tests in two different settings: one is designed for high-dimensional linear model and the other is introduced in a completely model-free setting. A remarkable finding reveals that, thanks to the scale-invariance property, even under the high-dimensional linear models, our proposed test is asymptotically much more powerful than existing competitors for the covariates with heterogeneous variances while maintaining high efficiency for the homoscedastic ones. Supplementary materials for this article are available online.
Article
In many applications, it is of interest to assess the relative contribution of features (or subsets of features) toward the goal of predicting a response — in other words, to gauge the variable importance of features. Most recent work on variable importance assessment has focused on describing the importance of features within the confines of a given prediction algorithm. However, such assessment does not necessarily characterize the prediction potential of features, and may provide a misleading reflection of the intrinsic value of these features. To address this limitation, we propose a general framework for nonparametric inference on interpretable algorithm-agnostic variable importance. We define variable importance as a population-level contrast between the oracle predictiveness of all available features versus all features except those under consideration. We propose a nonparametric efficient estimation procedure that allows the construction of valid confidence intervals, even when machine learning techniques are used. We also outline a valid strategy for testing the null importance hypothesis. Through simulations, we show that our proposal has good operating characteristics, and we illustrate its use with data from a study of an antibody against HIV-1 infection.
Article
Recent results in nonparametric regression show that deep learning, that is, neural network estimates with many hidden layers, are able to circumvent the so-called curse of dimensionality in case that suitable restrictions on the structure of the regression function hold. One key feature of the neural networks used in these results is that their network architecture has a further constraint, namely the network sparsity. In this paper, we show that we can get similar results also for least squares estimates based on simple fully connected neural networks with ReLU activation functions. Here, either the number of neurons per hidden layer is fixed and the number of hidden layers tends to infinity suitably fast for sample size tending to infinity, or the number of hidden layers is bounded by some logarithmic factor in the sample size and the number of neurons per hidden layer tends to infinity suitably fast for sample size tending to infinity. The proof is based on new approximation results concerning deep neural networks.
Article
We propose the holdout randomization test (HRT), an approach to feature selection using black box predictive models. The HRT is a specialized version of the conditional randomization test (CRT) (Candes et al., 2018) that uses data splitting for feasible computation. The HRT works with any predictive model and produces a valid p-value for each feature. To make the HRT more practical, we propose a set of extensions to maximize power and speed up computation. In simulations, these extensions lead to greater power than a competing knockoffs-based approach, without sacrificing control of the error rate. We apply the HRT to two case studies from the scientific literature where heuristics were originally used to select important features for predictive models. The results illustrate how such heuristics can be misleading relative to principled methods like the HRT. Code is available at https://github.com/tansey/hrt.
Article
In a regression setting, it is often of interest to quantify the importance of various features in predicting the response. Commonly, the variable importance measure used is determined by the regression technique employed. For this reason, practitioners often only resort to one of a few regression techniques for which a variable importance measure is naturally defined. Unfortunately, these regression techniques are often sub‐optimal for predicting the response. Additionally, because the variable importance measures native to different regression techniques generally have a different interpretation, comparisons across techniques can be difficult. In this work, we study a variable importance measure that can be used with any regression technique, and whose interpretation is agnostic to the technique used. This measure is a property of the true data‐generating mechanism. Specifically, we discuss a generalization of the ANOVA variable importance measure, and discuss how it facilitates the use of machine learning techniques to flexibly estimate the variable importance of a single feature or group of features. The importance of each feature or group of features in the data can then be described individually, using this measure. We describe how to construct an efficient estimator of this measure as well as a valid confidence interval. Through simulations, we show that our proposal has good practical operating characteristics, and we illustrate its use with data from a study of risk factors for cardiovascular disease in South Africa. This article is protected by copyright. All rights reserved
Article
We propose a novel estimator of error variance and establish its asymptotic properties based on ridge regression and random matrix theory. The proposed estimator is valid under both low- and high-dimensional models, and performs well not only in nonsparse cases, but also in sparse ones. The finite-sample performance of the proposed method is assessed through an intensive numerical study, which indicates that the method is promising compared with its competitors in many interesting scenarios.
Article
This paper proposes general methods for the problem of multiple testing of a single hypothesis, with a standard goal of combining a number of p-values without making any assumptions about their dependence structure. A result by Rüschendorf (1982) and, independently, Meng (1993) implies that the p-values can be combined by scaling up their arithmetic mean by a factor of 2, and no smaller factor is sufficient in general. A similar result by Mattner about the geometric mean replaces 2 by e. Based on more recent developments in mathematical finance, specifically, robust risk aggregation techniques, we extend these results to generalized means; in particular, we show that Kp-values can be combined by scaling up their harmonic mean by a factor of logK\log K asymptotically as K tends to infinity. This leads to a generalized version of the Bonferroni–Holm procedure. We also explore methods using weighted averages of p-values. Finally, we discuss the efficiency of various methods of combining p-values and how to choose a suitable method in light of data and prior information.
Article
Methods for the construction of hypothesis tests based on multiple data splitting are presented. The tests combine p-values, and exhibit overall Type 1 error control. But, it is also shown that multiple data splitting may have worse power than single splitting.
Article
Assuming that a smoothness condition and a suitable restriction on the structure of the regression function hold, it is shown that least squares estimates based on multilayer feedforward neural networks are able to circumvent the curse of dimensionality in nonparametric regression. The proof is based on new approximation results concerning multilayer feedforward neural networks with bounded weights and a bounded number of hidden neurons. The estimates are compared with various other approaches by using simulated data. © 2019 Institute of Mathematical Statistics. © 2019 Institute of Mathematical Statistics. All rights reserved.
Article
Combining individual p-values to aggregate multiple small effects has a long-standing interest in statistics, dating back to the classic Fisher's combination test. In modern large-scale data analysis, correlation and sparsity are common features and efficient computation is a necessary requirement for dealing with massive data. To overcome these challenges, we propose a new test that takes advantage of the Cauchy distribution. Our test statistic has a simple form and is defined as a weighted sum of Cauchy transformation of individual p-values. We prove a non-asymptotic result that the tail of the null distribution of our proposed test statistic can be well approximated by a Cauchy distribution under arbitrary dependency structures. Based on this theoretical result, the p-value calculation of our proposed test is not only accurate, but also as simple as the classic z-test or t-test, making our test well suited for analyzing massive data. We further show that the power of the proposed test is asymptotically optimal in a strong sparsity setting. Extensive simulations demonstrate that the proposed test has both strong power against sparse alternatives and a good accuracy with respect to p-value calculations, especially for very small p-values. The proposed test has also been applied to a genome-wide association study of Crohn's disease and compared with several existing tests.
Article
Automatic gender and age prediction has become relevant to an increasing amount of applications, particularly under the rise of social platforms and social media. However, the performances of existing methods on real-world images are still not satisfactory as we expected, especially when compared to that of face recognition. The reason is that, facial images for gender and age prediction have inherent small inter-class and big intra-class differences, i.e., two images with different skin colors and same age category label have big intra-class difference. However, most existing methods have not constructed discriminative representations for digging out these inherent characteristics very well. In this paper, a method based on muti-stage learning is proposed: The first stage is marking the object regions with an encoder-decoder based segmentation network. Specifically, the segmentation network can classify each pixel into two classes, “people” and others, and only the “people” regions are used for the subsequent processing. The second stage is precisely predicting the gender and age information with the proposed prediction network, which encodes global information, local region information and the interactions among different local regions into the final representation, and then finalizes the prediction. Additionally, we evaluate our method on three public and challenging datasets, and the experimental results verify the effectiveness of our proposed method.
Article
Testing a hypothesis for high-dimensional regression coefficients is of fundamental importance in the statistical theory and applications. In this paper, we develop a new test for the overall significance of coefficients in highdimensional linear regression models based on an estimated U-statistics of order two. With the aid of the martingale central limit theorem, we prove that the asymptotic distributions of the proposed test are normal under two different distribution assumptions. Refitted cross-validation (RCV) variance estimation is utilized to avoid the overestimation of the variance and enhance the empirical power. We examine the finite-sample performances of the proposed test via Monte Carlo simulations, which show that the new test based on the RCV estimator achieves higher powers, especially for the sparse cases. We also demonstrate an application by an empirical analysis of a microarray data set on Yorkshire gilts.
Article
Consider the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to log n-factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential parameters being much bigger than the sample size. The analysis gives some insights why multilayer feedforward neural networks perform well in practice. Interestingly, the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that scaling the network depth with the logarithm of the sample size is natural.
Article
We revisit the classic semiparametric problem of inference on a low dimensional parameter θ0 in the presence of high-dimensional nuisance parameters η0. We depart from the classical setting by allowing for η0 to be so high-dimensional that the traditional assumptions, such as Donsker properties, that limit complexity of the parameter space for this object break down. To estimate η0, we consider the use of statistical or machine learning (ML) methods which are particularly well-suited to estimation in modern, very high-dimensional cases. ML methods perform well by employing regularization to reduce variance and trading off regularization bias with overfitting in practice. However, both regularization bias and overfitting in estimating η0 cause a heavy bias in estimators of θ0 that are obtained by naively plugging ML estimators of η0 into estimating equations for θ0. This bias results in the naive estimator failing to be N−1/2 consistent, where N is the sample size. We show that the impact of regularization bias and overfitting on estimation of the parameter of interest θ0 can be removed by using two simple, yet critical, ingredients: (1) using Neyman-orthogonal moments/scores that have reduced sensitivity with respect to nuisance parameters to estimate θ0, and (2) making use of cross-fitting which provides an efficient form of data-splitting. We call the resulting set of methods double or debiased ML (DML). We verify that DML delivers point estimators that concentrate in a N−1/2 -neighborhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements. The generic statistical theory of DML is elementary and simultaneously relies on only weak theoretical requirements which will admit the use of a broad array of modern ML methods for estimating the nuisance parameters such as random forests, lasso, ridge, deep neural nets, boosted trees, and various hybrids and ensembles of these methods. We illustrate the general theory by applying it to provide theoretical properties of DML applied to learn the main regression parameter in a partially linear regression model, DML applied to learn the coefficient on an endogenous variable in a partially linear instrumental variables model, DML applied to learn the average treatment effect and the average treatment effect on the treated under unconfoundedness, and DML applied to learn the local average treatment effect in an instrumental variables setting. In addition to these theoretical applications, we also illustrate the use of DML in three empirical examples. This article is protected by copyright. All rights reserved
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Article
We develop a general framework for distribution-free predictive inference in regression, using conformal inference. The proposed methodology allows construction of prediction bands for the response variable using any estimator of the regression function. The resulting prediction band preserves the consistency properties of the original estimator under standard assumptions, while guaranteeing finite sample marginal coverage even when the assumptions do not hold. We analyze and compare, both empirically and theoretically, two major variants of our conformal procedure: the full conformal inference and split conformal inference, along with a related jackknife method. These methods offer different tradeoffs between statistical accuracy (length of resulting prediction intervals) and computational efficiency. As extensions, we develop a method for constructing valid in-sample prediction intervals called rank-one-out conformal inference, which has essentially the same computational efficiency as split conformal inference. We also describe an extension of our procedures for producing prediction bands with varying local width, in order to adapt to heteroskedascity in the data distribution. Lastly, we propose a model-free notion of variable importance, called leave-one-covariate-out or LOCO inference. Accompanying our paper is an R package conformalInference that implements all of the proposals we have introduced. In the spirit of reproducibility, all empirical results in this paper can be easily (re)generated using this package.
Working Paper
Local smoothing testing based on multivariate non-parametric regression estimation is one of the main model checking methodologies in the literature. However, the relevant tests suffer from the typical curse of dimensionality, resulting in slow rates of convergence to their limits under the null hypothesis and less deviation from the null hypothesis under alternative hypotheses. This problem prevents tests from maintaining the level of significance well and makes tests less sensitive to alternative hypotheses. In the paper, a model adaptation concept in lack-of-fit testing is introduced and a dimension reduction model-adaptive test procedure is proposed for parametric single-index models. The test behaves like a local smoothing test, as if the model were univariate. It is consistent against any global alternative hypothesis and can detect local alternative hypotheses distinct from the null hypothesis at a fast rate that existing local smoothing tests can achieve only when the model is univariate. Simulations are conducted to examine the performance of our methodology. An analysis of real data is shown for illustration. The method can be readily extended to global smoothing methodology and other testing problems.
Article
A dimension reduction-based adaptive-to-model test is proposed for significance of a subset of covariates in the context of a nonparametric regression model. Unlike existing local smoothing signi?cance tests, the new test behaves like a local smoothing test as if the number of covariates were just that under the null hypothesis and it can detect local alternatives distinct from the null at the rate that is only related to the number of covariates under the null hypothesis. Thus, the curse of dimensionality is largely alleviated when nonparametric estimation is inevitably required. In the cases where there are many insigni?cant covariates, the improvement of the new test is very significant over existing local smoothing tests on the signi?cance level maintenance and power enhancement. Simulation studies and a real data analysis are conducted to examine the ?nite sample performance of the proposed test.
Article
We introduce the partial martingale difference correlation, a scalar-valued measure of conditional mean dependence of Y given X ,a d- justing for the nonlinear dependence on Z ,w hereX, Y and Z are random vectors of arbitrary dimensions. At the population level, partial martingale difference correlation is a natural extension of partial distance correlation developed recently by Szekely and Rizzo (14), which characterizes the de- pendence of Y and X, after controlling for the nonlinear effect of Z.I t extends the martingale difference correlation first introduced in Shao and Zhang (10) just as partial distance correlation extends the distance corre- lation in Szekely, Rizzo and Bakirov (13). Sample partial martingale differ- ence correlation is also defined building on some new results on equivalent expressions of sample martingale difference correlation. Numerical results demonstrate the effectiveness of these new dependence measures in the context of variable selection and dependence testing.
Article
The residual variance and the proportion of explained variation are important quantities in many statistical models and model fitting procedures. They play an important role in regression diagnostics and model selection procedures, as well as in determining the performance limits in many problems. In this paper we propose new method-of-moments-based estimators for the residual variance, the proportion of explained variation and other related quantities, such as the ℓ2 signal strength. The proposed estimators are consistent and asymptotically normal in high-dimensional linear models with Gaussian predictors and errors, where the number of predictors d is proportional to the number of observations n; in fact, consistency holds even in settings where d/n → ∞. Existing results on residual variance estimation in high-dimensional linear models depend on sparsity in the underlying signal. Our results require no sparsity assumptions and imply that the residual variance and the proportion of explained variation can be consistently estimated even when d>n and the underlying signal itself is nonestimable. Numerical work suggests that some of our distributional assumptions may be relaxed. A real-data analysis involving gene expression data and single nucleotide polymorphism data illustrates the performance of the proposed methods.
Article
In this article, we propose a new metric, the so-called martingale difference correlation, to measure the departure of conditional mean independence between a scalar response variable V and a vector predictor variable U. Our metric is a natural extension of distance correlation proposed by Székely, Rizzo, and Bahirov, which is used to measure the dependence between V and U. The martingale difference correlation and its empirical counterpart inherit a number of desirable features of distance correlation and sample distance correlation, such as algebraic simplicity and elegant theoretical properties. We further use martingale difference correlation as a marginal utility to do high-dimensional variable screening to screen out variables that do not contribute to conditional mean of the response given the covariates. Further extension to conditional quantile screening is also described in detail and sure screening properties are rigorously justified. Both simulation results and real data illustrations demonstrate the effectiveness of martingale difference correlation-based screening procedures in comparison with the existing counterparts. Supplementary materials for this article are available online.
Article
We propose a novel technique to boost the power of testing a high-dimensional vector H:\btheta=0 against sparse alternatives where the null hypothesis is violated only by a couple of components. Existing tests based on quadratic forms such as the Wald statistic often suffer from low powers due to the accumulation of errors in estimating high-dimensional parameters. More powerful tests for sparse alternatives such as thresholding and extreme-value tests, on the other hand, require either stringent conditions or bootstrap to derive the null distribution and often suffer from size distortions due to the slow convergence. Based on a screening technique, we introduce a ``power enhancement component", which is zero under the null hypothesis with high probability, but diverges quickly under sparse alternatives. The proposed test statistic combines the power enhancement component with an asymptotically pivotal statistic, and strengthens the power under sparse alternatives. The null distribution does not require stringent regularity conditions, and is completely determined by that of the pivotal statistic. As a byproduct, the power enhancement component also consistently identifies the elements that violate the null hypothesis. As specific applications, the proposed methods are applied to testing the factor pricing models and validating the cross-sectional independence in panel data models.
Article
We consider testing regression coefficients in high dimensional generalized linear models. An investigation of the test of Goeman et al. (2011) is conducted, which reveals that if the inverse of the link function is unbounded, the high dimensionality in the covariates can impose adverse impacts on the power of the test. We propose a test formation which can avoid the adverse impact of the high dimensionality. When the inverse of the link function is bounded such as the logistic or probit regression, the proposed test is as good as Goeman et al. (2011)'s test. The proposed tests provide p-values for testing significance for gene-sets as demonstrated in a case study on an acute lymphoblastic leukemia dataset.
Article
This survey intends to collect the developments on Goodness-of-Fit for regression models during the last 20 years, from the very first origins with the proposals based on the idea of the tests for density and distribution, until the most recent advances for complex data and models. Far from being exhaustive, the contents in this paper are focused on two main classes of tests statistics: smoothing-based tests (kernel-based) and tests based on empirical regression processes, although other tests based on Maximum Likelihood ideas will be also considered. Starting from the simplest case of testing a parametric family for the regression curves, the contributions in this field provide also testing procedures in semiparametric, nonparametric, and functional models, dealing also with more complex settings, as those ones involving dependent or incomplete data.
Article
Applicability of Pearson correlation as a measure of explained variance is by now well understood. One of its limitations is that it does not account for asymmetry in explained variance. Aiming to develop broad applicable correlation measures, we study a pair of generalized measures of correlation (GMC) which deal with asymmetries in explained variances, and linear or nonlinear relations between random variables. We present examples under which the paired measures are identical, and they become a symmetric correlation measure which is the same as the squared Pearson correlation coefficient. As a result, Pearson correlation is a special case of GMC. Theoretical properties of GMC show that GMC can be applicable in numerous applications and can lead to more meaningful conclusions and decision making. In statistical inferences, the joint asymptotics of the kernel based estimators for GMC are derived and are used to test whether or not two random variables are symmetric in explaining variances. The testing results give important guidance in practical model selection problems. The efficiency of the test statistics is illustrated in simulation examples. In real data analysis, we present an important application of GMC in explained variances and market movements among three important economic and financial monetary indicators.
Article
We propose simultaneous tests for coefficients in high-dimensional linear regression models with factorial designs. The proposed tests are designed for the “large p, small n” situations where the conventional F-test is no longer applicable. We derive the asymptotic distribution of the proposed test statistic under the high-dimensional null hypothesis and various scenarios of the alternatives, which allow power evaluations. We also evaluate the power of the F-test for models of moderate dimension. The proposed tests are employed to analyze a microarray data on Yorkshire Gilts to find significant gene ontology terms which are significantly associated with the thyroid hormone after accounting for the designs of the experiment.
Article
Variance estimation is a fundamental problem in statistical modelling. In ultrahigh dimensional linear regression where the dimensionality is much larger than the sample size, traditional variance estimation techniques are not applicable. Recent advances in variable selection in ultrahigh dimensional linear regression make this problem accessible. One of the major problems in ultrahigh dimensional regression is the high spurious correlation between the unobserved realized noise and some of the predictors. As a result, the realized noises are actually predicted when extra irrelevant variables are selected, leading to serious underestimate of the level of noise. We propose a two-stage refitted procedure via a data splitting technique, called refitted cross-validation, to attenuate the influence of irrelevant variables with high spurious correlations. Our asymptotic results show that the resulting procedure performs as well as the oracle estimator, which knows in advance the mean regression function. The simulation studies lend further support to our theoretical claims. The naive two-stage estimator and the plug-in one-stage estimators using the lasso and smoothly clipped absolute deviation are also studied and compared. Their performances can be improved by the reffitted cross-validation method proposed.
Article
This paper presents a framework for comparing bivariate distributions according to their degree of regression dependence. We introduce the general concept of a regression dependence order (RDO). In addition, we define a new nonparametric measure of regression dependence and study its properties. Beside being monotone in the new RDOs, the measure takes on its extreme values precisely at independence and almost sure functional dependence, respectively. A consistent nonparametric estimator of the new measure is constructed and its asymptotic properties are investigated. Finally, the finite sample properties of the estimate are studied by means of small simulation study.
Article
In many situations regression analysis is mostly concerned with inferring about the conditional mean of the response given the predictors, and less concerned with the other aspects of the conditional distribution. In this paper we develop dimension reduction methods that incorporate this consideration. We introduce the notion of the Central Mean Subspace (CMS), a natural inferential object for dimension reduction when the mean function is of interest. We study properties of the CMS, and develop methods to estimate it. These methods include a new class of estimators which requires fewer conditions than pHd, and which displays a clear advantage when one of the conditions for pHd is violated. CMS also reveals a transparent distinction among the existing methods for dimension reduction: OLS, pHd, SIR and SAVE. We apply the new methods to a data set involving recumbent cows.