ArticlePDF Available

A Reminder of the Fallibility of the Wald Statistic

Authors:

Abstract

Computer programs often produce a parameter estimate and estimated variance (). Thus it is easy to compute a Wald statistic (- θ0){()} to test the null hypothesis θ = θ0. Hauck and Donner and Vaeth have identified situations in which the Wald statistic has poor power. We consider another example that is not in the classes discussed by those authors. We present data from a balanced one-way random effects analysis of variance (ANOVA) that illustrate the poor power of the Wald statistic compared to the usual F test. In this example the parameter of interest is the variance of the random effect. The power of the Wald test depends on the parameterization used, however, and a whole family of Wald statistics with p values ranging from 0 to 1 can be generated with power transformations of the random effect parameter.
... This means that for a larger total number of nonlinear model parameter to estimate, the cut-line of the profile function will be higher, and the resulting profile confidence interval will be wider (i.e., more conservative), thereby reflecting the penalty for estimating a larger number of parameters. The interested reader can visualize these results by examining Fig. 2. Other works (Fears et al. 1996;Pawitan 2000) have highlighted Wald confidence interval anomalies however their examples have been less clear-cut and have been based on the approximate likelihood test. Focusing on variance component estimation in the one-way random effects analysis of variance (ANOVA) model, these works underscore the inadequacy of Wald method using both simulation and this approximate or large-sample likelihood approach. ...
Article
Full-text available
Use of nonlinear statistical methods and models are ubiquitous in scientific research. However, these methods may not be fully understood, and as demonstrated here, commonly-reported parameter p-values and confidence intervals may be inaccurate. The gentle introduction to nonlinear regression modelling and comprehensive illustrations given here provides applied researchers with the needed overview and tools to appreciate the nuances and breadth of these important methods. Since these methods build upon topics covered in first and second courses in applied statistics and predictive modelling, the target audience includes practitioners and students alike. To guide practitioners, we summarize, illustrate, develop, and extend nonlinear modelling methods, and underscore caveats of Wald statistics using basic illustrations and give key reasons for preferring likelihood methods. Parameter profiling in multiparameter models and exact or near-exact versus approximate likelihood methods are discussed and curvature measures are connected with the failure of the Wald approximations regularly used in statistical software. The discussion in the main paper has been kept at an introductory level and it can be covered on a first reading; additional details given in the Appendices can be worked through upon further study. The associated online Supplementary Information also provides the data and R computer code which can be easily adapted to aid researchers to fit nonlinear models to their data. Supplementary Information The online version contains supplementary material available at 10.1007/s11538-024-01274-4.
... In multilevel models design where it is nested stricture of a population is to be modeled, the allocation of In order to handle the missingness, a missing data mechanism should be used [101,129]. Moreover, when the sample size is very small it leads to biased estimators and misleading statistical tests [52]. The other weakness of multilevel models is the power. ...
... The tendency of the Wald test to inflate small P values is known as the Hauck-Donner effect (HDE) (5,8,9). It was first studied in the context of logistic regression, and later revisited in the equivalent setting of 2 × 2 contingency tables (10). ...
Article
The Wald test is routinely used in case-control studies to test for association between a covariate and disease. However, when the evidence for association is high, the Wald test tends to inflate small p-values as a result of the Hauck-Donner Effect (HDE). Here, we investigate the HDE in the context of genetic burden, both with and without additional covariates. First, we examine the burden-based p-values in the absence of association using whole-exome sequence data from 1000 Genomes Project reference samples (n=54) and selected preterm infants with neonatal complications (n=74). Our careful analysis of the burden-based p-values shows that the HDE is present, and that the cause of the HDE in this setting is likely a natural extension of the well-known cause of the HDE in 2x2 contingency tables. Second, in a re-analysis of real data we find that the permutation test provides increased power over the Wald, Firth, and likelihood ratio tests, which agrees with our intuition since the permutation test is valid for any sample size, and since it does not suffer from the HDE. Therefore, we propose a powerful and computationally efficient permutation-based approach for the analysis and re-analysis of small case-control association studies.
... For each SNP, the Wald statistic (Buse 1982;Fears et al. 1996) was implemented to examine whether the SNP is associated with the trait. Then, the Bonferroni correction was implemented to define the genome-wide significant threshold (P < 1.49e-6) to reduce false-positive rate, which defined as 0.05/N, where N is the number of SNPs. ...
Article
Full-text available
The dominance effect is a kind of non‐additive effect due to the interaction between alleles at the same locus. Quantitative traits such as growth traits in farm animals have been found to be influenced by dominance effects. However, dominance effects are usually ignored in the genome‐wide association study (GWAS) of complex traits for farm animals. In this study, we performed GWAS and genetic parameters estimation for the two traits age at 100 kg (AGE) and backfat thickness at 100 kg (BF) of 3572 Large White pigs. The pigs were from three breeding farms of China and were genotyped by an in‐house designed 50k SNP chip. Our results showed significant non‐zero variance for the dominance effect of AGE, while the dominance effect of BF was not significant. Using a GWAS model accounting for both additive and dominance effects, we identified three additive and two dominance significant SNPs for the trait AGE. For the trait BF, three genome‐wide significant additive SNPs were detected, but no significant SNP was found for the dominance effect. In total, six important functional genes (NPAS3, USP16, PARN, ARL15, GPC3, ABHD4) near significant SNPs were identified as candidate genes associated with AGE or BF. Notably, ARL15 and PARN were associated with AGE near the dominance association signals. Overall, the newly detected SNPs and newly identified candidate genes in our study added new information about the genetic architectures of growth and fatness traits in pigs, and have the potential to be applied to the pig breeding program in the future.
... Finally, the distribution of the different Kt/V measurement methods (double pool, spKt/V and ionic dialysance) were analysed and integrated into the logistic regression model and an interaction analysis using the Wald test was performed [28]. ...
Article
Background The effect of dialysis dose on mortality remains unsettled. Current guidelines recommend to target a spKt/V at 1.20 to 1.40 per tri-weekly dialysis session. However, the optimal dialysis dose remains mostly disputed. Methods In a nationwide registry of all incident patients receiving thrice-weekly hemodialysis, 32 283 patients had available data on dialysis dose, estimated by Kt/V and its variants Kt and Kt/A. Survival was analyzed with a multivariate Cox model and a concurrent risk model accounting for renal transplantation. A predictive model of Kt in the upper quartile was developed. Results Regardless of the indicator, a higher dose of dialysis was consistently associated with better survival. The survival differential of Kt was the most discriminating, but marginally, compared to the survival differential according to Kt/V and Kt/A. Patient survival was higher in the upper quartile of Kt (> 69L/s), then deteriorated as the Kt decreased with a difference in survival between the upper and lower quartile of 23.6% at five years. Survival differences across Kt distribution were similar after accounting for kidney transplantation as a competing risk. Predictive factors for Kt in the upper quartile were arteriovenous fistula versus catheters and graft, hemodiafiltration versus hemodialysis, scheduled dialysis start versus emergency start, long weekly dialysis duration, spKt/V measurement versus double pool eKt/V. Conclusion Our data confirm the existence of a relationship between dialysis dose and survival, which persisted despite correcting for known confounders. A model for predicting a high dose of dialysis is proposed with practical relevance.
... To the best of our knowledge, the proposed method is the first approach that evaluates Our study is not without limitations. The proposed method is based on a Wald test that tends to have a lower power compared to a likelihood ratio test especially for alternatives sufficiently far from the null value or for additive risk models (47)(48)(49). In addition, the proposed method is not directly applicable to analyze imputed genotype data. ...
Article
Evaluating gene by environment (G$\times$E) interaction under an additive risk model (i.e. additive interaction) has gained wider attention. Recently, statistical tests have been proposed for detecting additive interaction that utilize an assumption on G-E independence to boost power, which do not rely on restrictive genetic models such as dominant or recessive models. However, a major limitation of these methods is a sharp increase in type I error when this assumption is violated. Our goal is to develop a robust test for additive G$\times$E interaction under the trend effect of genotype, applying an empirical Bayes-type shrinkage estimator of the relative excess risk due to interaction. The proposed method uses a set of constraints to impose the trend effect of genotype and builds an estimator that data-adaptively shrinks a RERI estimator obtained under a general model for G-E dependence using a retrospective likelihood framework. Numerical study under varying levels of departures from G-E independence shows that the proposed method is robust against the violation of the independence assumption while providing an adequate balance between bias and efficiency compared to existing methods. We applied the proposed method to the genetic data of Alzheimer's disease and lung cancer.
... This is not the case in general because the HDE has been observed in other regression models by various authors since. Some examples include Storer, Wacholder, and Breslow (1983) in conditional logistic regression with matched and stratified samples, Vaeth (1985) in one-sample problems for one-parameter exponential families and GLMs, Nelson and Savin (1990) in Tobit and nonlinear regression models, Fears, Benichou, and Gail (1996) in a balanced 1-way random effects ANOVA design, Therneau and Grambsch (2000, p. 60) in Cox proportional hazards models, and Kosmidis (2014) in cumulative link models. In general the Wald test can be expected to be valid only if a normal likelihood can be used to approximate the profile likelihood for the parameter well (Meeker and Escobar 1995) and the observed value of the sufficient statistic is away from ∂ . ...
Article
The Wald test remains ubiquitous in statistical practice despite shortcomings such as its inaccuracy in small samples and lack of invariance under reparameterization. This paper develops on another but lesser-known shortcoming called the Hauck–Donner effect (HDE) whereby a Wald test statistic is no longer monotone increasing as a function of increasing distance between the parameter estimate and the null value. Resulting in an upward biased p-value and loss of power, the aberration can lead to very damaging consequences such as in variable selection. The HDE afflicts many types of regression models and corresponds to estimates near the boundary of the parameter space. This article presents several new results, and its main contributions are to (i) propose a very general test for detecting the HDE in the class of vector generalized linear models (VGLMs), regardless of the underlying cause; (ii) fundamentally characterize the HDE by pairwise ratios of Wald and Rao score and likelihood ratio test statistics for 1-parameter distributions with large samples; (iii) show that the parameter space may be partitioned into an interior encased by at least 5 HDE severity measures (faint, weak, moderate, strong, extreme); (iv) prove that a necessary condition for the HDE in a 2 by 2 table is a log odds ratio of at least 2; (v) give some practical guidelines about HDE-free hypothesis testing. Overall, practical post-fit tests can now be conducted potentially to any model estimated by iteratively reweighted least squares, especially the GLM and VGLM classes, the latter which encompasses many popular regression models.
... The fact that the Wald power for logistic regression may fall for large alternative was noticed by Hauck and Donner (1977); they termed it aberrant behavior. This negative property of the Wald test was later studied in a broader context of a family of exponential distributions by Vaeth (1985), and for a linear model by Fears et al. (1996). For large alternatives the Wald power decreases to the size of the test-see Figure 1 for a geometrical illustration. ...
Article
Traditionally, asymptotic tests are studied and applied under local alternative. There exists a widespread opinion that the Wald, likelihood ratio, and score tests are asymptotically equivalent. We dispel this myth by showing that These tests have different statistical power in the presence of nuisance parameters. The local properties of the tests are described in terms of the first and second derivative evaluated at the null hypothesis. The comparison of the tests are illustrated with two popular regression models: linear regression with random predictor and logistic regression with binary covariate. We study the aberrant behavior of the tests when the distance between the null and alternative does not vanish with the sample size. We demonstrate that these tests have different asymptotic power. In particular, the score test is generally asymptotically biased but slightly superior for linear regression in a close neighborhood of the null. The power approximations are confirmed through simulations.
Article
In this paper, we consider the problem of estimating the reliability parameter of a mixed-type stress-strength model, i.e., the probability \(R=P\left( X<Y\right)\) where X and Y are a discrete and a continuous random variable, respectively. We focus on the specific case of Poisson stress and exponential strength, deriving the expression of R and its maximum likelihood estimator (MLE) and its uniformly minimum-variance unbiased estimator (UMVUE), based on simple random samples independently drawn from X and Y. For the MLE, we are able to provide an expression for the cumulative distribution function, which allows us to compute its expected value, bias, and variance. We derive asymptotic properties of the MLE, which we exploit for constructing approximate confidence intervals based on different approaches. A simulation study empirically compares such estimators and provides advice for their correct use, which is also illustrated through an application to real data.
Article
Chi-square type test statistics are widely used in assessing the goodness-of-fit of a theoretical model. The exact distributions of such statistics can be quite different from the nominal chi-square distribution due to violation of conditions encountered with real data. In such instances, the bootstrap or Monte Carlo methodology might be used to approximate the distribution of the statistic. However, the sample quantile may be a poor estimate of the population counterpart when either the sample size is small or the number of different values of the replicated statistic is limited. Using statistical learning, this article develops a method that yields more accurate quantiles for chi-square type test statistics. Formulas for smoothing the quantiles of chi-square type statistics are obtained. Combined with the bootstrap methodology, the smoothed quantiles are further used to conduct equivalence testing in mean and covariance structure analysis. Two real data examples illustrate the applications of the developed formulas in quantifying the size of model misspecification under equivalence testing. The idea developed in the article can also be used to develop formulas for smoothing the quantiles of other types of test statistics or parameter estimates.
Article
Hauck & Donner (1977) showed that Wald's test (the maximum likelihood test statistic) behaves in an aberrant manner when applied to hypotheses about a single parameter in a binomial logit model. In particular they have shown that the test statistic decreases to zero as the parameter estimate moves away from the null value. In the present work the behaviour of Wald's test when applied to hypothesis testing in exponential families is studied. The investigation is mainly restricted to the one-sample problem for one-parameter exponential families. Conditions under which Wald's test is well behaved and conditions under which Wald's test may be misleading are derived. It is shown that the problem occurs in connection with certain parameterizations of discrete probability distributions and also, in the continuous case, if the upper tail of the distribution function is approximately proportional to t-1e-θ t for some positive θ. Finally, the use of Wald's test in the analysis of generalized linear models is discussed. /// Dans un article récent Hauck et Donner (1977) démontrent que le test de Wald se comporte d'une manière aberrante quand il est appliqué aux hypothèses d'un seul paramètre dans une structure binomiale logite. Plus particulièrement, ils ont montré que le statistique du test descend à zéro quand l'estimateur du paramètre s'éloigne de la valeur nul. Dans cet article-ci on étudie le comportement du test de Wald quand il est appliqué aux tests d'hypothèses sur structures exponentielles. On démontre que le comportement du test de Wald dépend du choix de la paramètrisation et que le problème peut se poser dans des modèles de probabilité discrets et aussi dans le cas de distributions absolument continues si la queue supérieure de la fonction de distribution est approximativement proportionnelle à t-1e-θ t pour un nombre θ positive.
Article
For tests of a single parameter in the binomial logit model, Wald's test is shown to behave in an aberrant manner. In particular, the test statistic decreases to zero as the distance between the parameter estimate and null value increases, and the power of the test, based on its large-sample distribution, decreases to the significance level for alternatives sufficiently far from the null value.
Article
A classical result due to Wilks [1] on the distribution of the likelihood ratio $\lambda$ is the following. Under suitable regularity conditions, if the hypothesis that a parameter $\theta$ lies on an $r$-dimensional hyperplane of $k$-dimensional space is true, the distribution of $-2 \log \lambda$ is asymptotically that of $\chi^2$ with $k - r$ degrees of freedom. In many important problems it is desired to test hypotheses which are not quite of the above type. For example, one may wish to test whether $\theta$ is on one side of a hyperplane, or to test whether $\theta$ is in the positive quadrant of a two-dimensional space. The asymptotic distribution of $-2 \log \lambda$ is examined when the value of the parameter is a boundary point of both the set of $\theta$ corresponding to the hypothesis and the set of $\theta$ corresponding to the alternative. First the case of a single observation from a multivariate normal distribution, with mean $\theta$ and known covariance matrix, is treated. The general case is then shown to reduce to this special case where the covariance matrix is replaced by the inverse of the information matrix. In particular, if one tests whether $\theta$ is on one side or the other of a smooth $(k - 1)$-dimensional surface in $k$-dimensional space and $\theta$ lies on the surface, the asymptotic distribution of $\lambda$ is that of a chance variable which is zero half the time and which behaves like $\chi^2$ with one degree of freedom the other half of the time.
Article
A distribution analogous to the canonical distribution used in testing the general linear hypothesis is developed for Model II analysis of variance for balanced classifications. As in the case of Model I analysis of variance, this standard distribution exhibits the sums of squares going into the analysis of variance table. By use of the standard form it is also shown that (i) all exact $F$-tests used in testing hypotheses based on balanced multiple classifications determine uniformly most powerful (u.m.p.) similar regions although they are not likelihood ratio (L.R.) tests, but (ii) in the balanced one-way classification, for all practical purposes, the test is an L.R. test, and is u.m.p. invariant. An exact $F$-test exists when we have a sum of squares, $S_1$ distributed as $(k + \sigma^2_0)$ times a chi-square variate, where $k > 0$, independently of $S_2$, which is distributed as $k$ times a chi-square variate. The test is then to reject the hypothesis that $\sigma^2_0 = 0$ whenever $S_1/S_2$ is greater than some suitably chosen number, $c$. As a corollary to property (i) it is shown that "of all invariant tests of $\sigma^2_0 = 0$ against $\sigma^2_0 > 0$ whose power is a function of $\sigma^2_0/(k + \sigma^2_0)$ only, the test $S_1/S_2 > c$ is most powerful, providing $S_1$ and $S_2$, as defined above can be found."
Maximum Like-lihood Fitting of General Risk Models to Stratified Data
  • C R Rao
Rao, C. R. (1965), Linear Statistical Inference and Its Applications (2nd ed.), New York: John Wiley. SAS Institute Inc. (1989), SAS/STAT@ User's Guide, Version 6 (4th ed.), Cary, NC: SAS Institute Inc. (19921, SAS Technical Report P-229, Cary, NC: SAS Institute Inc. Storer, B. E., Wacholder, S., and Breslow, N. E. (1983), " Maximum Like-lihood Fitting of General Risk Models to Stratified Data, " Journal of the Royal Statistical Society, Series C, 32, 172-1 8 1.