Technical ReportPDF Available

A Robust Multivariate Estimator with Stepwise Covariate Selection and Inequality Constraints for Complex Sample Surveys: An Initial Concept

Authors:

Abstract

This Technical Report introduces a new multivariate difference-estimator for complex sample-surveys. It is an alternative to conventional model-assisted estimators that use specific inference. Model-assisted estimators and the new difference-estimator both reduce variance in population estimates for M study-variables by using population statistics for J correlated auxiliary-variables, where M and J can number in the hundreds or thousands. Both are closely related to the difference-estimator offered by Särndal et al. (1992), although the new difference-estimator uses a different stochastic model. Both employ linear transformations of design-based estimators (e.g., Horvitz-Thompson). Both choose coefficients for a M×(M+J) transformation matrix that minimize variance of population estimates for each study-variable, where the degree of variance-reduction depends upon the specific correlation between each study-variable and each auxiliary-variable. Both estimators support expansion factors, which facilitate small-area estimators. The new difference-estimator introduces a novel approach to variance-reduction with auxiliary data. Unlike model-assisted estimators, which require known population parameters for the J auxiliary-variables, the new estimator accommodates sample-survey estimates of those population parameters. Therefore, the new difference-estimator can directly use population estimates for auxiliary-variables from more complex sample-surveys, including components such as multi-phase and multi-stage sampling-designs, cluster plots, interpenetrating panels, and supplemental surveys. The new difference-estimator introduces numerical advances. The model-assisted estimator with specific inference requires inversion of the J×J covariance matrix for population estimates of J auxiliary-variables; and that matrix inverse is infeasible or numerically unstable if the covariance matrix is rank-deficient or ill-conditioned. The new difference-estimator incorporates a recursive method; it replaces that J×J matrix inverse with up to J scalar inverses. The j th step in the recursion minimizes variances of all M study-variables with the j th scalar auxiliary-residual; and it removes any collinearity between the j th auxiliary-variable and all remaining auxiliary-variables. The recursion ceases if the j th scalar inverse is numerically unstable (i.e., division by a very small number). This suggests the J×J covariance matrix has rank (j−1), and all auxiliary information is essentially exhausted after recursions with the first (j−1) auxiliary-variables. The recursive method used in the new difference-estimator simplifies nonlinear estimation procedures, such as inequality constraints on population estimates for each study-variable and protection from negative variance estimates. The recursive method easily implements procedures that mitigate risks from outliers and overfitting with numerous auxiliary-variables. The recursive method supports stepwise covariate-selection among the auxiliary-variables, which reduces variance for the most important study-variables as identified by the analyst. Internal consistency in statistical tables requires that the sum of population estimates in each row or column equals the population estimate for the corresponding margin in that table. Model-assisted estimators with a generic weight produce internal consistency, but at the cost of statistical efficiency. The new difference-estimator provides an alternative that does not compromise statistical efficiency; it uses recursive raking to sequentially impose equality constraints on each row and column of a statistical table.
A preview of the PDF is not available
... INTRODUCTION Czaplewski (2020) introduced the Generalized Multivariate Difference (GMDe) estimator for complex sample surveys as a general case of the univariate difference estimator described by Hansen et al. (1953) and Särndal et al. (1992:239-242). GMDe is a straightforward linear transformation of a vector of design-based population estimates and its covariance matrix. ...
... In rare cases, GMDe numerical estimates of correlations are not so bounded. Czaplewski (2020) introduces inequality constraints in the recursive GMDe that assure estimated correlations are within feasible bounds. Without this constraint, GMDe estimates of variance statistics can be negative. ...
... The covariance matrix for GMDe in Equation (18) Czaplewski (2020) introduced the recursive GMDe to solve numerical challenges with the "batch" version GMDe in Sections 6 (page 16) and 7.1 (page 23). The primary challenge is the matrix inverse of the correlation matrix 1 − * Λ in Equations (23) and (27) for the auxiliary residuals. ...
Technical Report
Full-text available
The Generalized Multivariate Difference estimator (GMDe) is a broad generalization of the univariate "difference estimator" described by Hansen et al. (1953:250-253) and Särndal et al. (1992:239-242). Difference estimators use population estimates of auxiliary variables to improve population estimates of correlated study variables. Examples of auxiliary variables include administrative records, remotely sensed measurements, and time-series of predictions from deterministic process models (e.g., econometric models, demographic models, forest-stand projection models). GMDe is a multivariate alternative to model-assisted GREG regression estimators for finite populations, such as post-stratification, ratio, regression, lasso, ridge, and elastic net estimators. GMDe does not require model-assisted predictions of a proxy variable, nor does GMDe require known population totals for all auxiliary variables. Much like the composite estimator, GMDe is a simple linear transformation of a vector of population estimates from a probability sample with design-consistent multivariate Horvitz-Thompson "π-estimator". Therefore, GMDe does not directly use the data matrix of study variables and auxiliary variables for each population element included in the probability sample. This Technical Report derives the M×J matrix of minimum-variance coefficients for each of M study variables and each of J auxiliary variables for the linear transformation in GMDe. The degree of variance reduction with GMDe depends, in part, upon the strength of correlations between study variables and auxiliary variables. Substantial improvements of GMDe relative to the prior π-estimate require relatively strong correlations (e.g., ±0.70 and stronger). In a National Forest Inventory (NFI), remotely sensed auxiliary variables are sufficiently correlated with broad groupings of domains. However, predictions from deterministic process models might provide auxiliary variables that are more strongly correlated with detailed study variables that change slowly or more predictably over time; and change-detection with remotely sensed data can post-stratify the population into undisturbed strata for which deterministic process models provide stronger predictors. This Technical Report includes a simple example of the recursive version of GMDe, which is a relatively simple estimator for complex sample surveys that include longitudinal surveys for time-series of population estimates; interpenetrating panels; multi-phase and multi-stage sampling; and multiple independent surveys. If a design-based π-estimate is feasible for a vector of study variables and correlated auxiliary variables, then GMDe can use those π-estimates for the population to reduce variances of study variables that are correlated with auxiliary variables. The recursive GMDe can also impose equality and inequality constraints on study variables and mitigate influence of outliers. The recursive GMDe replaces inversion of the J×J partition of the π-covariance matrix for auxiliary residuals with a sequence of J scalar divisions.
... In addition, population estimates from a probability sample and a design-consistent estimator, such as the Horvitz-Thompson estimator, are identified as "π-estimates" from a "π-estimator" respectively. Some derivations and other details are available in Czaplewski (2020aCzaplewski ( , 2020bCzaplewski ( , 2021 ...
Technical Report
Full-text available
The Generalized Multivariate Difference estimator (GMDe) is designed for complex longitudinal sample surveys with auxiliary variables. An example is a National Forest Inventory (NFI) that includes annual re‑measurements for panels of sample field plots, plus remotely sensed data for disturbance detection, plus time‑series of predictions from a deterministic ecosystem model for undisturbed plots. GMDe is a multivariate extension of an estimator published by Hansen, Hurwitz and Madow in 1953. GMDe is a robust alternative to model‑assisted and model‑based estimators. GMDe is closely related to the multivariate composite estimator and the Kalman filter update. GMDe is a simple linear transformation of a large vector that contains population estimates for study variables and auxiliary variables in either a finite or infinite population. The initial coefficient matrix in the transformation specifically minimizes the variance for each population estimate. GMDe modifies that matrix to impose inequality constraints and mitigate influence of outliers.
Article
Full-text available
National forest inventories in many countries combine expensive ground plot data with remotely-sensed information to improve precision in estimators of forest parameters. A simple post-stratified estimator is often the tool of choice because it has known statistical properties, is easy to implement, and is intuitive to the many users of inventory data. Because of the increased availability of remotely-sensed data with improved spatial, temporal, and thematic resolutions, there is a need to equip the inventory community with a more diverse array of statistical estimators. Focusing on generalized regression estimators, we step the reader through seven estimators including: Horvitz Thompson, ratio, post-stratification, regression, lasso, ridge, and elastic net. Using forest inventory data from Daggett county in Utah, USA as an example, we illustrate how to construct, as well as compare the relative performance of, these estimators. Augmented by simulations, we also show how the standard variance estimator suffers from greater negative bias than the bootstrap variance estimator, especially as the size of the assisting model grows. Each estimator is made readily accessible through the new R package, mase. We conclude with guidelines in the form of a decision tree on when to use which an estimator in forest inventory applications.
Technical Report
Full-text available
The modified Kalman filter algorithm introduced here is a multivariate model-assisted estimator for complex sample surveys. It uses population estimates for a large vector of auxiliary variables to reduce variance in population estimates for a large vector of study variables. The latter can be elements of detailed statistical tables. Auxiliary variables arise from different sampling phases, stages, panels, and surveys; and they can be census constants from administrative records, remote sensing, and certain types of 'Big Data'. The algorithm incorporates inequality constraints, such as nonnegative population statistics. Other constraints assure nonnegative variance estimates, correlation coefficients bounded by ±1, and statistical tables with additive margins. The algorithm reduces risk of overfitting and outliers with little need for analyst intervention. The case study starts with design-based estimates for a 10-by-10 table of population statistics; sample survey estimates for 500 auxiliary variables; and 50 auxiliary census variables. Distributions of 2,500 simulated vector estimates are compared to their true population parameters. In the case study, the algorithm was numerically accurate and computationally efficient with large vectors and a rank-deficient covariance matrix. There is no evidence for significant bias. The algorithm is relevant to any Kalman filter application that requires large vectors.
Technical Report
Full-text available
This Technical Report documents derivation of the KFz algorithm, which is a robust implementation of the multivariate Kalman Filter. The KFz algorithm is designed for a large state space in which the estimated state covariance matrix is rank deficient. The KFz algorithm includes minimum-maximum inequality constraints on the estimated state variables, mitigation for outliers in the innovation residuals, constraints that assure the estimated variances on the diagonal of the state covariance matrix are non-negative, and constraints that assure feasible Pearson correlation coefficients. This Technical report includes an Appendix with the necessary R-code and an example of the input spreadsheet that configures the KFz algorithm to the analyst's application. However, this document is not intended as a user's manual for that R-code.
Article
Full-text available
Wall-to-wall remotely sensed data are increasingly available to monitor landscape dynamics over large geographic areas. However, statistical monitoring programs that use post-stratification cannot fully utilize those sensor data. The Kalman filter (KF) is an alternative statistical estimator. I develop a new KF algorithm that is numerically robust with large numbers of study variables and auxiliary sensor variables. A National Forest Inventory (NFI) illustrates application within an official statistics program. Practical recommendations regarding remote sensing and statistical issues are offered. This algorithm has the potential to increase the value of synoptic sensor data for statistical monitoring of large geographic areas.
Article
This working paper describes the potential of the proliferation of new sources of large volumes of data, sometimes also referred to as “big data”, for informing policy making in several areas. It also outlines the challenges that the proliferation of data raises for the production of official statistics and for statistical policies.
Article
Previous studies have utilized ground plots, airborne lidar scanning or profiling data, and space lidar profiling data to estimate biomass across large regions, but these studies have failed to take into account the variance components associated with multiple models because the proper variance equations were not available. Previous large-domain studies estimated the variances of their biomass density estimates as the sum of the GLAS sampling variability plus the model variability associated with the models that predict airborne lidar estimates of biomass density (Y) as a function of satellite lidar measurements (X). This approach ignores the additional variability associated with the predictive models used to estimate ground biomass density as a function of airborne lidar measurements. This paper addresses that shortcoming. Analytic variance expressions are provided that include sampling variability and model variability in situations where multiple models are employed to generate estimates of biomass. As an example, the forest biomass of the continental US is estimated, by forest stratum within state, using a space lidar system (ICESat/GLAS). An airborne laser system (ALS) is used as an intermediary to tie the GLAS measurements of forest height to a small subset of US Forest Service (USFS) ground plots by flying the ALS over the ground plots and, independently, over individual GLAS footprints. Two sets of models are employed to relate satellite measurements to the ground plots. The first set of equations relates USFS ground plot estimates of total aboveground dry biomass density (Y1) to spatially coincident ALS forest canopy measurements (X1). The second set of models predicts those ALS canopy height measurements (X1) used in the first set of models to GLAS waveform measurements (X2). The following important conclusions are noted. (1) The variability associated with estimation of the plot-ALS model coefficients is significant and should be included in the overall estimate of biomass density variance. In the continental US, the total variance of mean forest biomass density (98.06 t/ha) increases by a factor of 3.6 ×, i.e., from 1.91 to 6.94 t²/ha², when plot-ALS model variance is included in the calculation of total variance. (2) State-level results are more variable, but on average, the percent model variance at the state level, i.e., (model variance / total variance) ∗ 100, increases from 16% to 59% when plot-ALS model variance is included. (3) The overall model variance is driven in large part by the number of plots overflown by the ALS and the number of GLAS pulses overflown by the ALS. Given a choice of improving precision by either increasing the number of plot-ALS observations or increasing ALS-GLAS observations, there is no obvious benefit to selecting one over the other. However, typically the number of ground plots overflown is the limiting factor. (4) If heteroskedasticity is evident in either the ground-air or air-satellite models, it can modeled using weighted regression techniques and incorporated into these model variance formulas in straightforward fashion. The results are unambiguous; in a hybrid three-phase sampling framework, both the ground-air and air-satellite model variance components are significant and should be taken into account.
Article
Context: This study concerns model-based inference for estimating growing stock volume in large-area forest inventories, combining wall-to-wall Landsat data, a sample of laser data, and a sparse subsample of field data. Aims: We develop and evaluate novel estimators and variance estimators for the population mean volume, taking into account the uncertainty in two model steps. Methods: Estimators and variance estimators were derived for two main methodological approaches and evaluated through Monte Carlo simulation. The first approach is known as two-stage least squares regression, where Landsat data were used to predict laser predictor variables, thus emulating the use of wall-to-wall laser data. In the second approach laser data were used to predict field-recorded volumes, which were subsequently used as response variables in modelling the relationship between Landsat and field data. Results: The estimators and variance estimators are shown to be at least approximately unbiased. Under certain assumptions the two methods provide identical results with regard to estimators and similar results with regard to estimated variances. Conclusion: We show that ignoring the uncertainty due to one of the models leads to substantial underestimation of the variance, when two models are involved in the estimation procedure.
Article
A four-phase sampling method using Landsat, high-altitude color infrared photography, low-altitude color infrared photography, and ground samples was tested in interior and coastal Alaska from 1982 through 1986. Ratio and regression estimators were applied using variables at the remote sensing phases as covariates related to variables of interest measured on the ground. The four-phase sampling strategy yielded more efficient estimators in coastal Alaska's more heavily timbered area than in interior Alaska's highly heterogeneous vegetation complexes. Advantages and disadvantages associated with each of the phases of the study are presented, as well as a general evaluation of the entire system.