Biometrika

Published by Oxford University Press (OUP)
Online ISSN: 1464-3510
Publications
Article
The idea of basing tests on the sample distribution function is a natural one. The Kolmogorov-Smirnov tests are of this nature. Blum, Kiefer & Rosenblatt (1961) made use of this approach to construct distribution free tests of independence. In this paper this method is further applied to the common two-sample problems of location and dispersion, the k-sample problem of location, and to the problem of dependence in bivariate samples. The rationale of the method is the following. Write the departure from the null hypothesis (against which the test must be sensitive) in terms of the true distribution functions and regard the mean value of this departure as the parameter of interest. The sample estimate of this parameter, expressed in terms of sample distribution functions, is then proposed as test statistic. This statistic is generally only dependent on the ranks of the samples and is consequently distribution free. Specifically this approach leads to Wilcoxon's two-sample test, a k-sample extension of Wilcoxon's test which is slightly different from the Kruskal-Wallis (1952) extension, the test of Ansari & Bradley (1960) for differences in dispersion, which is also a special case of a procedure proposed by Barton & David (1958), and to a test of dependence in bivariate samples which comes out to be a linear function of the rank correlation coefficients of Spearman and Kendall. The alternative k-sample extension of Wilcoxon's test has the same asymptotic relative efficiency properties as the Kruskal-Wallis test; it is however consistent against a slightly wider class than the latter. As is to be expected, the alternative test of independence behaves for large samples in much the same way as the tests of Spearman and Kendall.
 
Article
Born on 9 August 1880 in the parish of Shoreditch, Major Greenwood grew to adolescence in an environment which isolated his earliest years from daily social contact with contemporaries.
 
Article
The penalised least squares approach with smoothly clipped absolute deviation penalty has been consistently demonstrated to be an attractive regression shrinkage and selection method. It not only automatically and consistently selects the important variables, but also produces estimators which are as efficient as the oracle estimator. However, these attractive features depend on appropriately choosing the tuning parameter. We show that the commonly used the generalised crossvalidation cannot select the tuning parameter satisfactorily, with a nonignorable overfitting effect in the resulting model. In addition, we propose a bic tuning parameter selector, which is shown to be able to identify the true model consistently. Simulation studies are presented to support theoretical findings, and an empirical example is given to illustrate its use in the Female Labor Supply data.
 
Article
The estimation of parameters in an absorbing Markov chain has been discussed by Gani (1956), Bhat & Gani (1960), Bhat (1961) and possibly by other authors. Asymptotic theory for the distribution of maximum-likelihood estimators applies if a large number of independent replicates of the chain is available. These replicates, however, could be considered as occurring sequentially in time, so that the chain has implicitly been altered in such a way that the absorbing state has been replaced by an ‘instantaneous return’ state; once it is reached, a new replicate is commenced starting from the original initial state. In this paper, we concern ourselves with the asymptotic theory of maximum-likelihood estimators from a single realization of truly absorbing chains. It is clear that if the number of states is kept fixed there can be no asymptotic theory, since with probability one, an absorbing state will be reached after a finite number of transitions, and no further information can be obtained by continuing observation. In § 2, we show by means of two simple examples that asymptotic theory may, at least in some cases, be available if the number of non-absorbing states is large. In the remaining sections, forming the main part of the paper, we discuss a more complex example, a population genetic model of Moran (1958a). We do not prove that the conjectured asymptotic theory holds for the estimation of the parameter, but produce numerical evidence from simulation studies to support this conjecture. Further research is needed to clarify the general problem of inference in absorbing Markov chains. The problem can be made to depend on the theory of positively regular chains but the latter is itself incomplete for this purpose.
 
Article
This paper extends the induced smoothing procedure of Brown & Wang (2006) for the semiparametric accelerated failure time model to the case of clustered failure time data. The resulting procedure permits fast and accurate computation of regression parameter estimates and standard errors using simple and widely available numerical methods, such as the Newton-Raphson algorithm. The regression parameter estimates are shown to be strongly consistent and asymptotically normal; in addition, we prove that the asymptotic distribution of the smoothed estimator coincides with that obtained without the use of smoothing. This establishes a key claim of Brown & Wang (2006) for the case of independent failure time data and also extends such results to the case of clustered data. Simulation results show that these smoothed estimates perform as well as those obtained using the best available methods at a fraction of the computational cost.
 
Article
A bivariate correlated Poisson model for the number of accidents sustained by a set of individuals exposed to a risk situation in two successive periods of time is considered. It is shown that a selection of individuals free from accidents in the first period reduces the average number of accidents to be expected in the next period, provided that the accident proneness varies from individual to individual with an arbitrary non-degenerate distribution.
 
Article
Recent scientific and technological innovations have produced an abundance of potential markers that are being investigated for their use in disease screening and diagnosis. In evaluating these markers, it is often necessary to account for covariates associated with the marker of interest. Covariates may include subject characteristics, expertise of the test operator, test procedures or aspects of specimen handling. In this paper, we propose the covariate-adjusted receiver operating characteristic curve, a measure of covariate-adjusted classification accuracy. Nonparametric and semiparametric estimators are proposed, asymptotic distribution theory is provided and finite sample performance is investigated. For illustration we characterize the age-adjusted discriminatory accuracy of prostate-specific antigen as a biomarker for prostate cancer.
 
Article
We propose a graphical measure, the generalized negative predictive function, to quantify the predictive accuracy of covariates for survival time or recurrent event times. This new measure characterizes the event-free probabilities over time conditional on a thresholded linear combination of covariates and has direct clinical utility. We show that this function is maximized at the set of covariates truly related to event times and thus can be used to compare the predictive accuracy of different sets of covariates. We construct nonparametric estimators for this function under right censoring and prove that the proposed estimators, upon proper normalization, converge weakly to zero-mean Gaussian processes. To bypass the estimation of complex density functions involved in the asymptotic variances, we adopt the bootstrap approach and establish its validity. Simulation studies demonstrate that the proposed methods perform well in practical situations. Two clinical studies are presented.
 
Left: Illustration of observational equivalence in directed graphs, Right: A simple directed graph
True directed graph along with estimates from Gaussian observations. The gray scale represents the percentage of inclusion of edges.
MCC, FP and TP for estimation of directed graph using the PC-algorithm (solid), lasso (dashes) and adaptive lasso (dot-dashes) with random ordering.
Known and estimated networks for human cell signalling data. True and false edges are marked with solid and dashed arrows, respectively.
Known and estimated transcription regulatory network of E-coli. Large nodes indicate the transcription factors, and true and false edges are marked with solid and dashed arrows, respectively.
Article
Directed acyclic graphs are commonly used to represent causal relationships among random variables in graphical models. Applications of these models arise in the study of physical and biological systems where directed edges between nodes represent the influence of components of the system on each other. Estimation of directed graphs from observational data is computationally NP-hard. In addition, directed graphs with the same structure may be indistinguishable based on observations alone. When the nodes exhibit a natural ordering, the problem of estimating directed graphs reduces to the problem of estimating the structure of the network. In this paper, we propose an efficient penalized likelihood method for estimation of the adjacency matrix of directed acyclic graphs, when variables inherit a natural ordering. We study variable selection consistency of lasso and adaptive lasso penalties in high-dimensional sparse settings, and propose an error-based choice for selecting the tuning parameter. We show that although the lasso is only variable selection consistent under stringent conditions, the adaptive lasso can consistently estimate the true graph under the usual regularity assumptions.
 
Article
We give a definition of a bounded edge within the causal directed acyclic graph framework. A bounded edge generalizes the notion of a signed edge and is defined in terms of bounds on a ratio of survivor probabilities. We derive rules concerning the propagation of bounds. Bounds on causal effects in the presence of unmeasured confounding are also derived using bounds related to specific edges on a graph. We illustrate the theory developed by an example concerning estimating the effect of antihistamine treatment on asthma in the presence of unmeasured confounding.
 
4: Test errors and numbers of selected genes for four SVMs for the leukaemia
Article
Several sparseness penalties have been suggested for delivery of good predictive performance in automatic variable selection within the framework of regularization. All assume that the true model is sparse. We propose a penalty, a convex combination of the L1- and L∞-norms, that adapts to a variety of situations including sparseness and nonsparseness, grouping and nongrouping. The proposed penalty performs grouping and adaptive regularization. In addition, we introduce a novel homotopy algorithm utilizing subgradients for developing regularization solution surfaces involving multiple regularizers. This permits efficient computation and adaptive tuning. Numerical experiments are conducted using simulation. In simulated and real examples, the proposed penalty compares well against popular alternatives.
 
Article
We propose a semiparametric additive rate model for modelling recurrent events in the presence of a terminal event. The dependence between recurrent events and terminal event is nonparametric. A general transformation model is used to model the terminal event. We construct an estimating equation for parameter estimation and derive the asymptotic distributions of the proposed estimators. Simulation studies demonstrate that the proposed inference procedure performs well in realistic settings. Application to a medical study is presented.
 
Article
Results are given concerning inferences that can be drawn about interaction when binary exposures are subject to certain forms of independent nondifferential misclassification. Tests for interaction, using the misclassified exposures, are valid provided the probability of misclassification satisfies certain bounds. Results are given for additive statistical interactions, for causal interactions corresponding to synergism in the sufficient cause framework and for so-called compositional epistasis. Both two-way and three-way interactions are considered. The results require only that the probability of misclassification be no larger than 1/2 or 1/4, depending on the test. For additive statistical interaction, a method to correct estimates and confidence intervals for misclassification is described. The consequences for power of interaction tests under exposure misclassification are explored through simulations.
 
Melanoma data. Plots of empirical Bayes survival curve estimates for males, dashed, and females, solid, for (a) α = 10 000, (b) α = 100, (c) α = 10 and (d) α = 1. 
Article
We develop a novel empirical Bayesian framework for the semiparametric additive hazards regression model. The integrated likelihood, obtained by integration over the unknown prior of the nonparametric baseline cumulative hazard, can be maximized using standard statistical software. Unlike the corresponding full Bayes method, our empirical Bayes estimators of regression parameters, survival curves and their corresponding standard errors have easily computed closed-form expressions and require no elicitation of hyperparameters of the prior. The method guarantees a monotone estimator of the survival function and accommodates time-varying regression coefficients and covariates. To facilitate frequentist-type inference based on large-sample approximation, we present the asymptotic properties of the semiparametric empirical Bayes estimates. We illustrate the implementation and advantages of our methodology with a reanalysis of a survival dataset and a simulation study.
 
Article
We consider statistical inference for additive partial linear models when the linear covariate is measured with error. We propose attenuation-to-correction and simulation-extrapolation, simex, estimators of the parameter of interest. It is shown that the first resulting estimator is asymptotically normal and requires no undersmoothing. This is an advantage of our estimator over existing backfitting-based estimators for semiparametric additive models which require undersmoothing of the nonparametric component in order for the estimator of the parametric component to be root-n consistent. This feature stems from a decrease of the bias of the resulting estimator, which is appropriately derived using a profile procedure. A similar characteristic in semiparametric partially linear models was obtained by Wang et al. (2005). We also discuss the asymptotics of the proposed simex approach. Finite-sample performance of the proposed estimators is assessed by simulation experiments. The proposed methods are applied to a dataset from a semen study.
 
Article
Clustered survival data frequently arise in biomedical applications, where event times of interest are clustered into groups such as families. In this article we consider an accelerated failure time frailty model for clustered survival data and develop nonparametric maximum likelihood estimation for it via a kernel smoother-aided em algorithm. We show that the proposed estimator for the regression coefficients is consistent, asymptotically normal, and semiparametric efficient when the kernel bandwidth is properly chosen. An em-aided numerical differentiation method is derived for estimating its variance. Simulation studies evaluate the finite sample performance of the estimator, and it is applied to the diabetic retinopathy dataset.
 
Article
The conventional model selection criterion, the Akaike information criterion, aic, has been applied to choose candidate models in mixed-effects models by the consideration of marginal likelihood. Vaida & Blanchard (2005) demonstrated that such a marginal aic and its small sample correction are inappropriate when the research focus is on clusters. Correspondingly, these authors suggested the use of conditional aic. Their conditional aic is derived under the assumption that the variance-covariance matrix or scaled variance-covariance matrix of random effects is known. This note provides a general conditional aic but without these strong assumptions. Simulation studies show that the proposed method is promising.
 
Article
We study model selection for clustered data, when the focus is on cluster specific inference. Such data are often modelled using random effects, and conditional Akaike information was proposed in Vaida & Blanchard (2005) and used to derive an information criterion under linear mixed models. Here we extend the approach to generalized linear and proportional hazards mixed models. Outside the normal linear mixed models, exact calculations are not available and we resort to asymptotic approximations. In the presence of nuisance parameters, a profile conditional Akaike information is proposed. Bootstrap methods are considered for their potential advantage in finite samples. Simulations show that the performance of the bootstrap and the analytic criteria are comparable, with bootstrap demonstrating some advantages for larger cluster sizes. The proposed criteria are applied to two cancer datasets to select models when the cluster-specific inference is of interest.
 
Article
SUMMARY The logistic and integrated normal binary response curves are known to agree closely except in the tails. For experiments based on three dose levels the power of a significance test is found for the null hypothesis that the response curve is logistic against the alternative that it is normal, and vice versa. From this an appropriate spacing of dose levels for discrimination is found. Approximately 1000 observations are necessary for even modest sensitivity.
 
Article
SUMMARY We give a method of analysing data from the Negative Binomial and other generalized Poisson distributions which is free of certain disadvantages that arise when the data are transformed and then subjected to an ordinary analysis of variance. The method of analysis proposed is based on a statistic called which is constructed from estimators given by Hinz & Gurland (1967 a). This technique can be used for testing a general linear hypothesis relating to untransformed data and it is illustrated here by means of an example based on a two-way layout.
 
Bladder tumour study. (a) Nonparametric estimation of cumulative rate function by treatment group; (b) Semiparametric estimation of baseline cumulative rate function with pointwise bootstrap 95% confidence intervals.  
Article
In this paper, we study panel count data with informative observation times. We assume nonparametric and semiparametric proportional rate models for the underlying event process, where the form of the baseline rate function is left unspecified and a subject-specific frailty variable inflates or deflates the rate function multiplicatively. The proposed models allow the event processes and observation times to be correlated through their connections with the unobserved frailty; moreover, the distributions of both the frailty variable and observation times are considered as nuisance parameters. The baseline rate function and the regression parameters are estimated by maximising a conditional likelihood function of observed event counts and solving estimation equations. Large-sample properties of the proposed estimators are studied. Numerical studies demonstrate that the proposed estimation procedures perform well for moderate sample sizes. An application to a bladder tumour study is presented.
 
Article
In biomedical studies, ordered bivariate survival data are frequently encountered when bivariate failure events are used as outcomes to identify the progression of a disease. In cancer studies, interest could be focused on bivariate failure times, for example, time from birth to cancer onset and time from cancer onset to death. This paper considers a sampling scheme, termed interval sampling, in which the first failure event is identified within a calendar time interval, the time of the initiating event can be retrospectively confirmed and the occurrence of the second failure event is observed subject to right censoring. In a cancer data application, the initiating, first and second events could correspond to birth, cancer onset and death. The fact that the data are collected conditional on the first failure event occurring within a time interval induces bias. Interval sampling is widely used for collection of disease registry data by governments and medical institutions, though the interval sampling bias is frequently overlooked by researchers. This paper develops statistical methods for analysing such data. Semiparametric methods are proposed under semi-stationarity and stationarity. Numerical studies demonstrate that the proposed estimation approaches perform well with moderate sample sizes. We apply the proposed methods to ovarian cancer registry data.
 
Article
The higher criticism test is effective for testing a joint null hypothesis against a sparse alternative, e.g., for testing the effect of a gene or genetic pathway that consists of d genetic markers. Accurate p-value calculations for the higher criticism test based on the asymptotic distribution require a very large d, which is not the case for the number of genetic variants in a gene or a pathway. In this paper we propose an analytical method for accurately computing the p-value of the higher criticism test for finite-d problems. Unlike previous treatments, this method does not rely on asymptotics in d or on simulation, and is exact for arbitrary d when the test statistics are normally distributed. The method is particularly computationally advantageous when d is not large. We illustrate the proposed method with a case-control genome-wide association study of lung cancer and compare its power with competing methods through simulations.
 
Article
Efficient estimation of parameters is a major objective in analyzing longitudinal data. We propose two generalized empirical likelihood based methods that take into consideration within-subject correlations. A nonparametric version of the Wilks theorem for the limiting distributions of the empirical likelihood ratios is derived. It is shown that one of the proposed methods is locally efficient among a class of within-subject variance-covariance matrices. A simulation study is conducted to investigate the finite sample properties of the proposed methods and compare them with the block empirical likelihood method by You et al. (2006) and the normal approximation with a correctly estimated variance-covariance. The results suggest that the proposed methods are generally more efficient than existing methods which ignore the correlation structure, and better in coverage compared to the normal approximation with correctly specified within-subject correlation. An application illustrating our methods and supporting the simulation study results is also presented.
 
Article
The basal area per acre at 77 sample points, estimated with Bitterlich's mirror relascope and with a wedge prism, has been compared with estimates derived from the conventional method of calipering trees in plots with fixed boundaries.Bitterlich's mirror relascope produced a negative bias of 4.5%, but unbiased estimates were obtained with a wedge prism, basal area factor 10.26. Bias was likely to be associated with the incorrect evaluation of borderline trees. Providing more contrast between the tree stems and the background did not improve the determination of the status of these doubtful trees and consequently did not serve a useful purpose.Random sampling in 5 stands showed that wedge sampling and calipering of trees in plots with a radius of 30 ft. both gave a coefficient of variation of 20%Time studies indicated that the time expended on calipering trees in 30 and 50 ft. plots increases curvilinearly with the number of trees per plot. For a given number of trees calipered, the average walking time in the plots with a 50 ft. radius is longer and the relevant regression curve lies above that of the 30 ft. plots.The time expended on measuring with prisms is directly related to number of stems counted rather than to number of stems per unit area. The elapsed times increase curvilinearly with increasing count.A cost-effectiveness study showed that the angle count method reduces the cost of field work by approximately 50% when compared with conventional calipering. Conversely, the angle count method yields a higher precision of the estimates, for given sampling cost per unit area. In addition the application of this method has the advantage of reduced labour requirements.The angle count method can be recommended for application in South African forestry, but additional research is necessary for the construction of tables to convert basal area into volume.
 
Article
The asymptotie null distribution of a statistic for testing the uniformity of distribution of points on the circumference of a circle is derived. Using Monte Carlo methods this distribution was found to agree closely with actual distributions for small sample sizes.
 
Top-cited authors
Donald B. Rubin
  • Harvard University
Scott Zeger
  • Johns Hopkins Bloomberg School of Public Health
Kanti Mardia
  • University of Leeds
David Cox
  • Nuffield College, Oxford
Robin Thompson
  • Rothamsted Research