ResearchPDF Available

When Prediction is Not Time Series Forecasting

Authors:

Abstract

unpublished note on 'estimation': cross-sectional v time series - Abstract: Statistical terminology is often confusing or even misleading. Consider "ignorable nonresponse," which is not "ignorable" at all. Perhaps the worst offender is so-called "significance." Here we will discuss the problematic word, "prediction." It has been discussed on ResearchGate, and here we will concentrate on the incompatibility of time series forecasting for individual cases, with prediction for cross sectional surveys when estimating totals for continuous data from sampling (in surveys, econometrics applications, experiments, etc.). For prediction, regressor data are on the population, from one or more other sources. Unfortunately, "prediction," such as used in model-based survey estimation, is a term that is often subsumed under the term "forecasting," but here we show why it is important not to confuse these two terms. [Consider, in a sense, modeling an individual (forecasting), versus modeling a group reaction (for a cross-sectional survey).]
unpublished note When Prediction is not Forecasting
James R. Knaub, Jr … April 2015
Page1
When Prediction is Not Time Series Forecasting:
Note on Forecasting v Prediction in Samples for Continuous Data
James R. Knaub, Jr.
April 25, 2015
Abstract:
Statistical terminology is often confusing or even misleading. Consider "ignorable nonresponse,"
which is not "ignorable" at all. Perhaps the worst offender is so-called "significance." Here we
will discuss the problematic word, "prediction." It has been discussed on ResearchGate, and here
we will concentrate on the incompatibility of time series forecasting with prediction when
estimating totals for continuous data from sampling (in surveys, econometrics applications,
experiments, etc.). Here regressor data are on the population, from one or more other
sources. Unfortunately, "prediction," such as used in model-based survey estimation, is a term
that is often subsumed under the term "forecasting," but here we show why it is important not
to confuse these two terms.
Difference between
Prediction for Surveys/Econometrics/Experiments/etc.,
and Forecasting, particularly from a Time Series
Often, with continuous data, especially for official statistics, we may, for example, have repeated
establishment census surveys and more frequently repeated establishment sample
surveys. Here we may have regressor data on a population which may be related to a sample
survey, collected with or without randomization. (See Knaub(2015).) The regression used often
involves just one regressor, regression through the origin (a ratio model) - see Brewer (2002) -
and under perhaps quite reasonable conditions, the classical ratio estimator (CRE) can be "...hard
to beat..." (Cochran(1977), page 160).
When a regression is used to impute for missing data, whether due to nonresponse or mass
imputation for cases not in the sample, we call each estimate in place of an observed value a
"prediction." An idea of the error for these "predicted" values may be obtained through the
variance of the prediction error, noted, for example, in Maddala(2001), the square root of which
is generated, for example, as STDI in SAS PROC REG. Knaub(1999) shows the difference between
the estimated variance of the prediction error for an individual case as opposed to estimating
finite population totals, in the first several pages.
For regression modeling, variance is noted, but bias comes from model misspecification. See
Shmueli(2010). Note also that Shmueli seems to use the term "prediction" in place of forecasting,
rather than note work in survey prediction and econometrics where many authors (see
Brewer(1999), for example) use "prediction" as stated here. As in forecasting, the simpler the
model, the more general, and less susceptible to overfitting to a given set of test data.
unpublished note When Prediction is not Forecasting
James R. Knaub, Jr … April 2015
Page2
Both prediction (from a regression model-based sample survey estimator) and time series
forecasting involve regression, but forecast modeling is typically more complex (sometimes
overly so). A prediction regression model looks at a given sample and estimates/'predicts' for
data not observed in the sample. A forecasting regression model looks at a time series of
responses by a given respondent, and from the trend (sometimes accounting for phenomena
such as seasonality), forecasts the next response or series of responses, for that given
respondent, to that given question.
Thus, a prediction for a member of a current population that is not in the currently observed
sample, is based on the regression relationship between all members of the sample and some
given set of regressor data - or more sets if there are more regressors. It is therefore important
to stratify these data into groups for which a single simple model fits well. These groupings may
not technically be strata if stratification is defined only to enhance a more aggregate level
estimate, as each group might be published separately, and a small area estimation scheme could
be involved (see Knaub(1999)).
At any rate, subpopulation groups are formed, within each only one model is to be used per
question, and one is interested in predicting for missing data based on regressor data, perhaps
from a previous census of the same data elements. This may happen in official statistics when
there is an annual census and monthly samples, or a monthly census and weekly samples.
However, when forecasting, one is often only looking at a given single response from a given
respondent (though it could be at a more aggregate level), and at each point in a time series, the
respondent has a value that contributes to that series - or a missing value there as well - and a
trend is approximated, such that one might forecast a coming response. This, of course, means
that any unanticipated change that may impact the series will contribute unknown error, which
is most substantial if those unanticipated changes are of primary importance. Thus the further
forward a forecast, the worse it might become, not only due to variance, but also changes that
could greatly impact the model could occur at any time.
Forecasting may help in planning, and if not too long term, may often be fairly accurate, as long
as a change point in the time series has not occurred. But if you are looking at a current sample
and want to estimate for missing data, prediction is what is needed, especially if you are
interested in a change, say a market change in an economic application, that is currently
occurring. A forecast cannot know about such a change.
To summarize, a prediction based on a relationship between a sample and regressor data is used
to basically impute for missing data, whatever the cause. Grouping data for modeling purposes
is very important, so the concept of stratification is important. However, a forecast for a missing
value in a sample is based only on the past history of that respondent, for that question. It is only
possible if there is a history (time series) for that respondent, so can only be used for imputation
for nonresponse or edit failed data. If there are missing values in the time series for that
respondent, that increases uncertainty associated with such an imputed value, and if there is a
unpublished note When Prediction is not Forecasting
James R. Knaub, Jr … April 2015
Page3
break in the time series due to any current event, that may greatly decrease accuracy. This is not
a problem with prediction as described. It uses the current data in the regression.
So, although the prediction and forecasting described above both use regression, it is applied
very differently. Do you want to know a forecast, or are you looking at what is currently
occurring? Using the term "prediction" may be confusing for the latter, especially as it may be
confused with time series forecasting, but these are the terms with which we are 'stuck.'
Consider: "Prediction" does not mean the same thing in Brewer(1999), as in Shmueli(2010),
though from their titles, Design-based or prediction-based inference? and “To Explain or to
Predict?” respectively, one might expect that they did. But as will be seen below, we have to
distinguish between prediction as typically used in survey statistics for current data, and time
series forecasting. The error structures are very different.
Relevant online literature: Prediction v Forecasting
OECD Definition, and Prediction Examples from Michigan State University and Onlinestatbook
So, “prediction” can mean “forecasting,” but in statistics, it can have another meaning, as noted
in OECD(2005): It can be the value found for y in a regression equation, based on the regressor
data x, whether or not there is a “temporal element.” (Note
https://stats.oecd.org/glossary/detail.asp?ID=3792 in the references.)
In model-based estimation for survey sampling, one finds that it is the latter definition that is
used, and we should not confuse this survey definition of “prediction” with time series
“forecasting” for two reasons: (1) they involve very different types of regression application and
theory, and (2) the estimates for y and variances involve very different, incompatible
structures/mechanisms, with errors based on different manners of variation, estimated for
different purposes. -- Further, a time series forecast will ignore current changes, as they do not
use the current data for the variables of interest.
In an Internet search, one may also find examples, showing how to interpret a prediction, which
is not a time series forecast. One such example for such a prediction (for illustration, but not
showing the usual heteroscedasticity, which would realistically, generally be present), is found at
Stocks(1999). There you will find a nice explanation and illustration of the “regression
(prediction) line.” (See, https://www.msu.edu/user/sw/statrev/strv203.htm.)
Among other discussions/examples one may also find available on the internet, which may be of
interest, consider Onlinestatbook(2015). This resource shows an interesting reversal of usual
unpublished note When Prediction is not Forecasting
James R. Knaub, Jr … April 2015
Page4
heteroscedasticity, comparing GPAs between high school and college (near the end of
https://www.msu.edu/user/sw/statrev/strv203.htm). One might think of this example as a kind
of forecast since you can use the high school data as regressor data and college data for y, and in
this case the regressor data occur first and might 'forecast' the college GPA, but it is definitely
not a time series forecast. Also, in other examples, both x regressor data and y sample data may
exist in the same time frame, as when electric plant capacity is used to predict for missing
generation data in a given, single time frame. But often, a previous census is used as regressor
data to be able to “predict” for missing data from a current sample, as in the electric sales data
illustrated in a scatterplot on the last page of Knaub(2013). This is prediction as found in survey
statistics, not a time series forecast. Here heteroscedasticity means an increase in the variance
of the prediction error for y, as x becomes larger. However, in the scatterplot on the last page of
Knaub(2014), the errors are small, and heteroscedasticity may be visually imperceptible, but still
mathematically present in that establishment survey example with real data. Note that
stratification or otherwise grouping of data is important, so that one model (regression) applies
per group. The scatterplot shown at the end of Knaub(2013) is for one such data group. Usually
for these cases, we have one very good regressor, often the same data element in a previous,
less frequently collected census, as stated, but not always. Often, for electric generation, the
same data element/variable from a previous census is the best regressor, but for wind-powered
generation, nameplate capacity can do much better. (Sometimes, multiple regression might
help, say for fuel switching: Knaub(2003).) Often the model-based classical ratio estimator (CRE)
is very robust for these data. (See Knaub(2005), Knaub(2013), Brewer(2002), and consider page
160 in Cochran(1977).)
So, to review, although using a high school GPA as a predictor for college GPA, may sound like
forecasting something you will discover later, it is not time series forecasting. Here, and in the
survey sampling context, "prediction" is the correct term, even if you never obtain those
observations - and unless you are studying test data, you generally never will obtain observations
to compare to these “predicted” values. (However, as in design-based sampling and estimation,
testing is very important.)
Aside: The GPA example illustrates very well the idea of prediction that is not a time series
forecast. However, it also has some interesting peculiarities: Both x and y are limited in range
from 0 to 4. Generally, as x becomes larger, the standard error of the prediction error for y
becomes larger, so the coefficient of heteroscedasticity (Knaub(2011a)) is greater than 0, even
though the ratio of standard error of prediction error of y to y may generally become smaller
with larger x. Here, that coefficient of heteroscedasticity would be a negative number. (?) But
this may make sense, because the high school GPA values are from a special subpopulation of
high school students: those motivated and able to attend college, or pressured and able to attend
college. No high school GPAs from those who did not go to college can be used. (Thus one should
not expect prediction of college GPAs for those who did not attend to be very accurate! This
subgroup has no data.) The peculiar heteroscedasticity may be because those more motivated
in high school who go to college tend to be more motivated there also, and variance of the
prediction error for y is reduced at higher x, rather than the usual increase. This situation also
produces a peculiar intercept, if using linear regression with an intercept term, as done in the
unpublished note When Prediction is not Forecasting
James R. Knaub, Jr … April 2015
Page5
online example. That intercept is very large here, and may be interpreted as saying that if one
had a GPA of zero in high school, but somehow went to college (?), her/his predicted college GPA
would be about 1.1. This is a bit nonsensical. The problem appears to be that logically, this
regression should go through the origin. It may be somewhat nonlinear, but with such large
residuals nearer to the origin that this cannot be readily shown. A ratio estimate through the
origin, with negative coefficient of heteroscedasticity (alternative ratio estimators are discussed
in Sarndal, Swensson, and Wretman(1992)), may really be best here. Comparing the estimated
variances of the prediction errors between such a heteroscedastic regression model through the
origin, and the simple regression model that was used, would be interesting. Also, in the model,
as used online, it might be instructive to know how the estimated standard error of the intercept
compared to the magnitude of the estimated intercept.
Example from US Energy Information Weekly Petroleum Sample Surveys
Weekly sampling of petroleum information occurs at the US Energy Information Administration
(EIA) in various surveys which basically represent strata for overall categories that could have
been part of a larger survey. Single regressor data are available for the same data elements (same
variables) from previous monthly census surveys. Thus a (robust) classical ratio estimator (CRE)
is used to relate monthly census and weekly sample data. However, referring to chapter 4 of
Lohr(2010), this has been documented at the EIA in the design-based CRE format, but applied in
the model-based CRE format, and the EIA also currently fails to supply relative standard error
(RSE) estimates, as could easily be done. (See Knaub(2011b).) The EIA also uses exponential
smoothing time series forecasts to impute for nonresponse, or replace edit rejected data. Such
forecasts are often very close to the predicted values for each case, but the forecasts must of
necessity (ie, by definition) have generally inferior performance when the market changes in such
a way that will impact the variable of interest, say stock level for a given petroleum product.
Forecasting is completely incapable of discerning a sudden change in the current market - it does not use
the data relevant to that - so that in such cases, though forecasts and predictions may often have
nearly identical results, and time series forecasts can sometimes even perform better, this is not
what will happen, in general, when it most matters.
Put another way, it is not logical to mix regression-model/prediction-based estimation for missing
current survey data (whether missing because it was not in the sample, or because of
nonresponse), with time series forecasting imputation for nonresponse. They have different
goals. The former has sampling error variance due to missing data from the current sample. The
latter has variance due to time series trends, and at crucial times would be substantially biased.
The variances are not compatible, the time series cases assume a trend change cannot happen,
and any attempt to bootstrap an overall variance could therefore be greatly misleading.
unpublished note When Prediction is not Forecasting
James R. Knaub, Jr … April 2015
Page6
If you have currently collected data for a current sample or partial census, you want imputed
data - whether for nonresponse or out-of-sample cases - to be compatible with the other current
data, not a forecast which makes no use of current data. To confuse these is a classic case of
"mixing apples and oranges."
Conclusions regarding difference between time series forecast and prediction
for missing data in a current survey:
A time series forecast is based on past trends for a single response data series. Autocorrelation
is an important consideration. Seasonality may be taken into account. Exponential smoothing can
be done to give more emphasis to more recent responses. The first forecast (after the end of the
response series to date) will be the best estimate, with forecasts deteriorating as you look further
into the future. The biggest problem will be break points in the coming series that are unknown
to the forecasting mechanism. A forecast is completely blind to that. Thus, for example, if a new
influence comes into play, say a new invention or a major merger within a market, a time series,
which can only be based on past history, cannot know this. Therefore, if one were to use a time
series forecast of any kind to impute for a nonresponse in a current sample survey or partial
census survey, it may often be acceptable, but not when it really matters. That is, a current
survey is used to learn what is happening ... well ... currently! If one is trying to be alert to a
current change in a market, collecting current data, one would wish to substitute predictions, for
all missing data, that are based on the current data collected as modeled by one or more sets of
regressor data. A time series forecast that looks at only the history of a nonrespondent will not
do this. Further, any variance estimate will be for forecasting, knowing that breaks in the series
are ignored, and such variance estimates are not compatible with variance estimates for the
current data. When you are collecting current data, and some are missing, you want to predict
(estimate really) for those missing data by taking advantage of the other current data that you
have, and how they relate to regressor data. If you are only interested in a forecast, you have
data for that before starting to collect the data of interest.
A further, unrelated problem, with forecasting for imputation for nonresponse and edit failed
data, is that these are generally the worst time series data available. Those that nonrespond or
provide low quality data, are not likely to do so only once. If they nonrespond more than one
cycle in a row, you probably will not even get a good forecast.
A prediction for missing data in a current survey is based on a group relationship - so good
grouping or stratification is important - between the sample data (or partial census), and one or
more regressors. This relationship, unlike a time series regression for individual respondent
history, is based instead on the current data that have been collected. (Better than PPS sampling,
unpublished note When Prediction is not Forecasting
James R. Knaub, Jr … April 2015
Page7
you can customize size measures in estimation by question on a survey, not having to rely on one
size measure for all.)
So, a current sample survey is not compatible with a forecast. But it is compatible with prediction
as defined by model-based survey sampling inference. Further, because the model-based
classical ratio estimator (CRE) is quite robust, and good groupings/stratification can greatly
reduce any nonignorability in the predictions (ie, we use one unique mechanism/regression
model per group), one may successfully use prediction/model-based estimation for all missing
data. Use of forecasting for nonrespondents is not compatible, ignores the goal of the survey to
determine the current circumstances, and is thus a needless, counterproductive
complication. Further, Knaub(2013) can be used to find effective sample sizes by independent
group, knowing that one may need to experiment iteratively with its application to obtain a good
estimate of overall sample size needs. For establishment surveys, a stratified cutoff sample using
the CRE may often be useful. See Karmel and Jain(1987), on stratification by size, and
Knaub(2014) regarding stratification by category (such as type of oil/gas well, with depth of well
as a proxy measure for grouping, when considering production, say from shale versus non-shale
[traditional] wells).
Summary:
A time series forecast is for an individual series over time, not including current data, with
possible autocorrelation, including possibly other regressors and complexities, and a generally
more complex error structure. However, a prediction of y by one or more regressors that is not
‘temporal’ in nature is based on observations by group/strata. The error structure is generally
less complex, but is virtually never homoscedastic. (See Knaub(2007).) This form of prediction is
found in survey statistics, econometrics, and other fields when we need to estimate for missing
data from a cross sectional data set, for any case where there are missing data from among a set
of respondents or experimental subjects. - Many may often refer to prediction and forecasting
interchangeably, but it is best to maintain a distinction, as shown in this paper. These two types
of regression are not compatible, and it is not logical to mix them.
Consider survey statistics, for example: A (time series) forecast estimates what will happen if a
pattern remains the same. But what if the point is to detect when a change has just now
occurred? Then a ‘non-temporal’ prediction is needed, such as the CRE, not a forecast, such as
exponential smoothing.
unpublished note When Prediction is not Forecasting
James R. Knaub, Jr … April 2015
Page8
References
Brewer, K.R.W. (1999). Design-based or prediction-based inference? stratified random vs
stratified balanced sampling. Int. Statist. Rev., 67(1), 35-47.
Brewer, KRW (2002), Combined survey sampling inference: Weighing Basu's elephants, Arnold:
London and Oxford University Press.
Cochran, W.G.(1977), Sampling Techniques, 3rd ed., John Wiley & Sons.
Karmel, T.S., and Jain, M. (1987), "Comparison of Purposive and Random Sampling Schemes for
Estimating Capital Expenditure," Journal of the American Statistical Association, Vol.82, pages 52-
57.
Knaub, J.R., Jr. (1999), “Using Prediction‐Oriented Software for Survey Estimation,” InterStat,
August 1999,
http://interstat.statjournals.net/YEAR/1999/abstracts/9908001.php?Name=908001 Short
version: “Using Prediction‐Oriented Software for Model‐Based and Small Area Estimation,”
Proceedings of the Survey Research Methods Section, American Statistical Association,
http://www.amstat.org/sections/srms/proceedings/papers/1999_115.pdf
https://www.researchgate.net/publication/261586154_Using_Prediction-
Oriented_Software_for_Survey_Estimation
Knaub, J.R., Jr. (2003), “Applied Multiple Regression for Surveys with Regressors of Changing
Relevance: Fuel Switching by Electric Power Producers,” InterStat, May 2003,
http://interstat.statjournals.net/YEAR/2003/abstracts/0305002.php?Name=305002
https://www.researchgate.net/publication/261586154_Using_Prediction-
Oriented_Software_for_Survey_Estimation
Knaub, J.R., Jr.(2005), “’Classical Ratio Estimator’ (Model-Based)InterStat, October 2005,
http://interstat.statjournals.net/YEAR/2005/abstracts/0510004.php?Name=510004
https://www.researchgate.net/publication/261474011_The_Classical_Ratio_Estimator_%28Mo
del-Based%29
unpublished note When Prediction is not Forecasting
James R. Knaub, Jr … April 2015
Page9
Knaub, J.R., Jr. (2007), “Heteroscedasticity and Homoscedasticity” in Encyclopedia of
Measurement and Statistics, Editor: Neil J. Salkind, Sage Publications, Vol. 2, pp. 431-432.
https://www.researchgate.net/publication/262972023_HETEROSCEDASTICITY_AND_HOMOSCE
DASTICITY
Knaub, J.R., Jr. (2011a), “Ken Brewer and the Coefficient of Heteroscedasticity as Used in Sample
Survey Inference,” Pakistan Journal of Statistics, Vol. 27(4), 2011, 397‐406, invited article for
special edition in honor of Ken Brewer’s 80th birthday, found at
http://www.pakjs.com/journals/27(4)/27(4)6.pdf .
https://www.researchgate.net/publication/261596397_KEN_BREWER_AND_THE_COEFFICIENT
_OF_HETEROSCEDASTICITY_AS_USED_IN_SAMPLE_SURVEY_INFERENCE
Knaub, J.R., Jr.(2011b), “Some Proposed Optional Estimators for Totals and their Relative
Standard Errors for a set of Weekly Cutoff Sample Establishment Surveys,” InterStat, July 2011,
http://interstat.statjournals.net/YEAR/2011/abstracts/1107004.php?Name=107004 .
https://www.researchgate.net/publication/261474159_Some_Proposed_Optional_Estimators_
for_Totals_and_their_Relative_Standard_Errors_for_a_set_of_Weekly_Quasi-
Cutoff_Sample_Establishment_Surveys
Knaub, J.R., Jr. (2013), “Projected Variance for the Model‐Based Classical Ratio Estimator:
Estimating Sample Size Requirements,” to be published in the Proceedings of the Survey Research
Methods Section, American Statistical Association,
https://www.amstat.org/sections/SRMS/Proceedings/y2013/Files/309176_82260.pdf,
https://www.researchgate.net/publication/261947825_Projected_Variance_for_the_Model-
based_Classical_Ratio_Estimator_Estimating_Sample_Size_Requirements
Knaub, J.R., Jr. (2014), “Efficacy of Quasi‐Cutoff Sampling and Model‐Based Estimation For
Establishment Surveys and Related Considerations, InterStat, January 2014,
http://interstat.statjournals.net/YEAR/2014/abstracts/1401001.php
https://www.researchgate.net/publication/261472614_Efficacy_of_Quasi-
Cutoff_Sampling_and_Model-
Based_Estimation_For_Establishment_Surveys_and_Related_Considerations
unpublished note When Prediction is not Forecasting
James R. Knaub, Jr … April 2015
Page10
Knaub, J.R., Jr.(2015), “Short Note on Various Uses of Models to Assist Probability Sampling and
Estimation,” unpublished note on ResearchGate.
https://www.researchgate.net/publication/274704886_Short_Note_on_Various_Uses_of_Mod
els_to_Assist_Probability_Sampling_and_Estimation
Lohr, S.L.(2010), Sampling: Design and Analysis, 2nd ed., Brooks/Cole.
Maddala, G.S. (2001), Introduction to Econometrics, 3rd ed., Wiley.
OECD(2005), Organization for Economic Cooperation and Development (OECD) Glossary of
Statistical Terms, https://stats.oecd.org/glossary/detail.asp?ID=3792, last updated August 11,
2005, citing this source: A Dictionary of Statistical Terms, 5th edition, prepared for the
International Statistical Institute by F.H.C. Marriott. Published for the International Statistical
Institute by Longman Scientific and Technical.
Onlinestatbook(2015), “Introduction to Linear Regression,” Online Statistics Education: An
Interactive Multimedia Course of Study, Downloaded April 24, 2015, Developed by Rice University
(Lead Developer), University of Houston Clear Lake, and Tufts University.
http://onlinestatbook.com/2/regression/intro.html,
home page: http://onlinestatbook.com/2/index.html
Särndal C.-E, Swensson B., and Wretman, J.(1992), Model Assisted Survey Sampling, Springer.
Shmueli, G.(2010), “To Explain or to Predict?” Statistical Science, Vol. 25, No. 3 (August 2010), pp.
289-310. Published by: Institute of Mathematical Statistics. Article Stable URL:
http://www.jstor.org/stable/41058949
Stocks, J.T.(1999), Correlation and Regression: The Regression (Prediction) Line,” Basic Stats
Review, Michigan State University, https://www.msu.edu/user/sw/statrev/strv203.htm,
Home page: https://www.msu.edu/user/sw/statrev/strev.htm
Please note that there have been question and answer threads on ResearchGate regarding the
definitions of "forecasting," and "prediction" that may be of interest to the reader.
... However, as seen in the example in the appendix of Shmueli(2010), where she references other sources, it appears to be applicable to cross-sectional surveys of interest to this paper as well. (See Knaub(2015) for some clarity on "prediction" which is not "forecasting.") But it appears, from internet searches that most users, or at least many of them, think in terms of machine learning when the term "bias-variance tradeoff" is used. ...
Research
Full-text available
Cutoff sampling, and multiple-attribute quasi-cutoff sampling for multiple variables of interest, can often be the most accurate of alternatives for estimation of totals from finite populations, especially for use with highly skewed establishment surveys, when prediction (i.e., model-based estimation) is used for these cross-sectional surveys. Papers are referenced in a short review discussing why this is the case, and emphasis should be given to the interaction of model-based sampling and estimation (i.e., prediction), regarding variance and bias, considering both sampling, and nonsampling error. Conventional thinking puts too much emphasis on randomization in cases where such thinking is not to overall advantage: small samples and highly skewed populations with obvious linear regression through (actually up to) the origin, a strongly correlated regressor, and information necessary for stratification where needed. A total survey error approach should be considered. Good regressor data are important and often available for official statistics. It should also be noted that a model is often the key to overcoming an 'unrepresentative sample,' chosen at random, by employing model-assisted design-based methods, so one should not underestimate the power of regression modeling to improve 'representativeness,' under any useful definition of the word, 'representative.' The advantages and disadvantages of balanced sampling are also noted, again considering bias and variance. Prediction performs well with a version of small area estimation (SAE) noted here, and in references, and again variance and bias should be considered. Further, the somewhat inverse relationship of stratification and SAE is seen. The concept of a bias-variance decomposition and tradeoff for modeling is explored in a different context, and related to the choice of model-based sampling and estimation. Many times a decision can be some kind of " bias-variance tradeoff. " This is often found in the literature under the quite different heading of " statistical learning " for model selection, and this relationship is explored.
Presentation
Full-text available
Invited presentation on quasi-cutoff sampling and prediction, as used for multiple establishment surveys at the US Energy Information Administration (EIA). [This could be considered to be a kind of imputation when 'growth rates' for the smallest establishments are not so much different from others that they substantially change small area totals. - See graphics! - Improvements may be substantial if better data/model groupings are found.] This presentation provides background and introductory information, and explains the use of ratio estimation, particularly the classical ratio estimator, with quasi-cutoff sampling. This is especially useful for highly skewed establishment sample surveys. Applications to US Energy Information Administration production of Official Statistics for energy data are given in their historical context. ---- Note that quasi-cutoff sampling is used to accommodate multipurpose surveys. See, for example, Knaub J.R., Jr. (2011). Cutoff sampling and total survey error. Journal of Official Statistics, Letter to the Editor, 27(1), 135-138. https://www.researchgate.net/publication/261757962_JOS_Letter_-_Cutoff_Sampling_and_Total_Survey_Error. ... [Note that the US Energy Information Administration reports on very many small populations. Consider also that in Brewer, K.R.W. (2014), “Three controversies in the history of survey sampling,” Survey Methodology, (December 2013/January 2014), Vol 39, No 2, pp. 249-262. Statistics Canada, Catalogue No. 12-001-X. http://www.statcan.gc.ca/pub/12-001-x/2013002/article/11883-eng.htm, Ken Brewer proposes that model-based estimators might do well with small populations.]
Technical Report
Full-text available
(Really just "other RESEARCH" reported only to ResearchGate, but I was unable to change to that category without taking down and then uploading again.) Unpublished notes regarding the use of auxiliary data in survey sampling and estimation methodologies for continuous data: Sampling and estimation for finite populations may be accomplished using (1) strictly design-based probability methods, (2) model-assisted design-based methods, or (3) strictly model-based (regresson/prediction) methods, regardless of selection process, where stratification/grouping is particularly important when randomized sampling is not used (Knaub(2014)). Under (2) we find that some of the 'assistance' may concentrate more directly on the estimation 'part,' and some on the sampling 'part,' thus influencing estimation less directly. Sampling and estimation are different processes that must be coordinated in that estimators have to be consistent with the sampling methodology. Of the model assistance that influences probability sampling, and thus estimation through the sampling, the assistance may influence the probability of selection, or perhaps the selection even more directly. This note is with regard to these methods of 'assistance' for (2) above, with emphasis on continuous data. Continuous data will be assumed throughout.
Conference Paper
Full-text available
From below: "Here we explore planning for the allocation of resources for use in obtaining official statistics through model-based estimation. Concentration is on the model-based variance for the classical ratio estimator (CRE). [Substituting other models would require new derivations.] This has application to quasi-cutoff sampling (simply cutoff sampling when there is only one attribute), balanced sampling, econometrics applications, and perhaps others." -- Stratification can be very important, and it is highly recommended that it be considered. ... However ... October 29, 2018: Note that this is for one stratum, or subpopulation, or an unstratified population. -- Also note, on page 2886, that "confidence" there should be "prediction." Reference is to prediction intervals, not confidence intervals. Thank you. .......... March 4, 2016: Wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw some things by some people that made me realize my language was technically incorrect here. (I also read elsewhere that this incorrect language may be fairly common.) At any rate, this is a fixed value to be estimated. It is a standard deviation. .................... Previous notes: - to be found at http://www.amstat.org/sections/SRMS/Proceedings/ Joint Statistical Meetings (JSM) 2013 - Session 89 Projected Variance for the Model-based Classical Ratio Estimator: Estimating Sample Size Requirements [for a finite population] Sponsor: Survey Research Methods Section Keywords: Model-based Estimation, Classical Ratio Estimator, Official Statistics, Resource Allocation Planning, Volume Coverage, Sample Size Requirements James Knaub U.S. Energy Information Administration. Also, please see: https://www.researchgate.net/publication/263235800_Example_-_Use_of_Appendix_A_Data_-_Projected_Variance_for_the_Model-Based_CRE_in_JSM2013_-_Spreadsheet_Tool Here we explore planning for the allocation of resources for use in obtaining official statistics through model-based estimation. Concentration is on the model-based variance for the classical ratio estimator (CRE). This has application to quasi-cutoff sampling (simply cutoff sampling when there is only one attribute), balanced sampling, econometrics applications, and perhaps others. Multiple regression for a given attribute can occasionally be important, but is only considered briefly here. Nonsampling error always has an impact. Allocation of resources to given strata should be considered as well. Sttatification can be extremely important. Here, however, we explore the projected variance for a given attribute in a given stratum, for resource planning at that base level. Typically one may consider the volume coverage for an attribute of interest, or related size data, say regressor data, to be important, but standard errors for estimated totals are needed to judge the adequacy of a sample. Thus the focus here is on a 'formula' for estimating sampling requirements for a model-based CRE, analogous to estimating the number of observations needed for simple random sampling. Balanced and cutoff sampling are considered. - (When estimating the WLS version of MSE (random factors of residuals only), the smallest observations in previous data test sets may sometimes best be ignored due to their sometimes relatively lower data quality in highly skewed establishment survey data, when samples are frequently collected, as noted by myself and other colleagues. - JRK - October 2014.) ----- For multiple attributes (variables of current interest), this may be applied iteratively, as a change in sample for one attribute impacts the sample size for another. Please see the definitions section in https://www.researchgate.net/publication/261472614_Efficacy_of_Quasi-Cutoff_Sampling_and_Model-Based_Estimation_For_Establishment_Surveys_and_Related_Considerations?ev=prf_pub.
Article
Full-text available
This article is from the Pakistan Journal of Statistics, http://www.pakjs.com/, in a special issue in honor of Ken Brewer. The URL for this article is http://www.pakjs.com/journals//27(4)/27(4)6.pdf . --- Here we will review some of the historical development of the use of the coefficient of heteroscedasticity for modeling survey data, particularly establishment survey data, and for inference at aggregate levels. Some of the work by Kenneth R. W. Brewer helped develop this concept. Dr. Brewer has worked to combine design-based and model-based inference. Here, however, we will concentrate on regression modeling, and particularly on some of his earlier work.
Article
Full-text available
Journal: InterStat ............ Renamed - Establishment surveys collected weekly are obvious candidates for cutoff sampling with prediction for estimation of the universe and estimation of relative standard errors for estimated totals, if there are good regressor data available. In the case investigated here, there are monthly census data that may be used for regressor data. The same data element/attribute collected in a previous census is often the best single regressor, but there may be others. For example, crude oil imports by country of origin sampled weekly may use the same data element from a monthly census as one regressor, but may also use imports from all other countries as a second regressor, as suppliers may change. This is analogous to fuel switching for electric power plants, Knaub (2003), where multiple regression estimation has been used. Small area estimation is used there, and may also be used here. A number of optional estimators for totals and their variances are suggested, and some of them are examined closely below for one regressor. This is preliminary work in an attempt to improve estimation, and to provide estimates of relative standard errors (RSEs) for better decision making by US Energy Information Administration (EIA) staff, regarding what data to collect, and for customers of EIA publications to make informed use of these results. Currently, forecasting is used for imputation for the weekly petroleum survey considered here, but prediction has been proven, see Knaub (1999, 2009), to perform well for this also, and includes the impact on the RSE estimates, and will cause results to better reflect the current status of the petroleum market. Methods developed for these surveys, could easily apply to other establishment surveys. The simplest option here is also the newest method being used for a survey of natural gas. This is part of ongoing work at the US EIA. - [Note that a random sample can easily be drawn which is substantially "unrepresentative" of the population, especially with continuous data that have a few outsized members of a finite population. That is why model-assisted design-based sampling can be so useful. (Sometimes the model may be more important.)] . apologies for any sloppy notation
Article
Full-text available
InterStat - ----------------------------------- In the field of Statistical Learning, the bias-variance tradeoff tells us that when we add a regressor, this tends to increase variance. Here, however, we are increasing model complexity, and the variance is sometimes greatly reduced. See the Comment section for the reason, and a quote from Ken Brewer's book. ----------------------------------- March 17, 2016: Note that a special approximation for variance, regarding estimated totals, was used here, for purposes of various possible needs for production of Official Statistics in a potentially changing environment. Flexibility for possible changes in modeling, data storage, aggregate levels to be published, and avoidance of future data processing errors on old data made this attractive. Simplicity of application in a production environment was emphasized. ----------------------------------- This research concerns multiple regression for survey imputation, when correlation with a given regressor may vary radically over time, and emphasis may shift to other regressors. There may be many applications for this methodology, but here we will consider the imputation of generation and fuel consumption values for electric power producers in a monthly publication environment. When imputation is done by regression, a sufficient amount of goodquality observed data from the population of interest is required, as well as good-quality, related regressor data, for all cases. For this application, the concept of 'fuel switching' will be considered. That is, a given power producer may report using a given set of fuels for one time period, but for economic and/or other practical reasons, fuel usage may change dramatically in a subsequent time period. Testing has shown the usefulness of employing an additional regressor or regressors to represent alternative fuel sources. A performance measure found in Knaub(2002, ASA JSM CD) is used to compare results. Also, the impact of regression weights and the formulation of those weights, due to multiple regression, are considered. ----- Jan 8, 2016: Note that this is not a time series technique. This is for cross-sectional surveys, and was designed for use on establishment surveys for official statistics. I have had some discussions on ResearchGate recently, regarding the notion of bias-variance tradeoffs in modeling, and that more complicated models (tend to?) decrease (conditional?) bias and increase variance. Here, however, variance for estimated totals, under the sampling conditions here, is decreased when there is fuel switching. (Acknowledgement: Thank you to those who discussed my questions on ResearchGate.)
Article
Full-text available
InterStat, October 2005 ---- Jan. 8, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw some things by a couple of people that made me realize my language was sloppy here. But I also read where this may be fairly common. At any rate, this is a fixed value to be estimated. It is a standard deviation. .................... Previous notes: The classical ratio estimator (CRE) is very simple, has a long history, and has a stunningly broad range of application, especially with regard to econometrics, and to survey statistics, particularly establishment survey statistics. The CRE has a number of desirable properties, one of which is that the sum of its estimated residuals is always zero. It is easily extended to multiple regression, and a property shown in Sarndal, Swensson and Wretman (1992) may be used to indicate the desirability of this zero sum of estimated residuals feature when constructing regression weights for multiple regression. In the single regressor form, the zero sum of estimated residuals property is related to an interesting phenomenon expressed in Fox (1997). Finally, relationships of the CRE to some other statistics are also considered. -- Note added November 2014: As noted in other works I have done, and elsewhere, for this model, only the individual values of x corresponding to individual y values need to be known, as long as the sum of the remaining x (for out-of-sample cases) is known, and then one can still estimate variance. If we do not know the sum of those remaining N-n x-values (where n is sample size selected minus cases to be imputed), but we know a range for that subtotal of x's, then we know a range of estimated variances for the estimated y-totals to go with that range. --- Another possibility, when knowing all of the individual x-values for a population is not feasible, might be to work out a regression prediction model version of a double sampling approach using ratio estimation. In traditional double sampling (two-phase sampling), a larger probability sampling for x values is taken in a first sample, the first phase, either to help stratify and/or to help in regression or ratio estimation, and a smaller probability sub-sampling of y values is taken in a second sample, the second phase. This is for probability sampling and the traditional estimation depends upon that. But a prediction version would mean reliance on regression in the estimation following the second phase. But what about estimating the x-total from the first stage sample? Would that need to be a randomized sample of x? If so, that may not be very efficient for an establishment survey. Stratify? And if used in an estimator, how do you interpret a final variance estimate that is based partially on randomization (for the x-total), and partially on prediction? (This has come up in discussions with Samson Adeshiyan.) Perhaps instead, the estimated x-total could be based on a cutoff or other sampling of x, and another known variable, z, for which we have a census of z. Then something may be worked out in two phases. But if we have z, it might be more effective to just do the classical ratio estimation (single phase) for model-based estimation shown here in this paper.
Article
Full-text available
InterStat, January 2014 - Weighted least squares linear regression through the origin has many uses in statistical science. An important use is for the estimation of attribute totals from establishment survey samples, where we might use quasi-cutoff sampling. Two questions in particular will be explored here, with respect to survey statistics: (1) How do we know this is performing well? and (2) What if the smallest members of the finite population appear to behave differently? This review article contains a summary of conclusions from experimental findings, and explanations with numerous references. - Note: Models (say, for ratio estimation) should be applied separately by strata. Whether any strata should be combined, or a stratum split into other strata, may be considered by examining scatterplots. [Note also that a random sample can easily be drawn which is substantially "unrepresentative" of the population, especially with continuous data that have a few outsized members of a finite population. That is why model-assisted design-based sampling can be so useful. (Sometimes the model may be more important. Here, the models, applied by appropriate data groupings/strata, along with relatively high 'coverage,' are what is important.)] The grouping of data by model application was key to small area estimation in my series of papers starting in 1999, making use of prediction software. - - June 2015: My graphical methodology for determining categories to be modeled separately, for stratification or for small area estimation purposes, performs very well. Yet even though A. Hoegh suggested that my method be applied to modeling shale oil versus traditional well production separately, several years ago, and another source I heard about appears feasible for labeling these wells for this purpose, to make this practical, as noted in this paper, this application has still not been implemented. Convincing people of even the simplest of truths, so that anything new can be done, is often a daunting task. At any rate, I thank A. Hoegh for his support. And I thank all others for what we did accomplish, including J. Douglas, O. Yildiz, C. Hughes-Cromwick, S. Adeshiyan, and J. Worrall, and best wishes to S. Adeshiyan in continuing to try to overcome such obstacles.
Article
This article reports results of a large-scale study of various sampling strategies. Conventional sampling strategies are compared with model-based strategies on data from over 12,000 businesses included in the annual Manufacturing Census of the Australian Bureau of Statistics. The study has been designed to replicate the quarterly Survey of Capital Expenditure. The results show that for the given data a stratified sample consisting of units with the largest values of the auxiliary variable in each stratum and simple ratio estimation is by far the most efficient of the strategies considered. A meaningful estimate of sampling error can be derived.
Article
Early survey statisticians faced a puzzling choice between randomized sampling and purposive selection but, by the early 1950s, Neyman's design-based or randomization approach had become generally accepted as standard. It remained virtually unchallenged until the early 1970s, when Royall and his co-authors produced an alternative approach based on statistical modelling. This revived the old idea of purposive selection, under the new name of "balanced sampling". Suppose that the sampling strategy to be used for a particular survey is required to involve both a stratified sampling design and the classical ratio estimator, but that, within each stratum, a choice is allowed between simple random sampling and simple balanced sampling; then which should the survey statistician choose? The balanced sampling strategy appears preferable in terms of robustness and efficiency, but the randomized design has certain countervailing advantages. These include the simplicity of the selection process and an established public acceptance that randomization is "fair". It transpires that nearly all the advantages of both schemes can be secured if simple random samples are selected within each stratum and a generalized regression estimator is used instead of the classical ratio estimator.