Content uploaded by Gordon B Willis
Author content
All content in this area was uploaded by Gordon B Willis on Dec 08, 2018
Content may be subject to copyright.
CHAPTER 25
Does Pretesting Make a Difference?
An Experimental Test
Barbara Forsyth
Westat
Jennifer M. Rothgeb
U.S. Bureau of the Census
Gordon B. Willis
National Cancer Institute
25.1 INTRODUCTION
In this chapter we present results from research designed to determine (1) whether
questionnaire pretesting results predict actual problems encountered in survey
data collection, and (2) whether survey administration is facilitated or survey
outcomes are improved using revisions based on pretesting results. The research
reported here was conducted in two phases. In phase 1 we used several pretesting
techniques to test a set of survey items and to develop revised questions. In
phase 2 we conducted a telephone survey using a split-sample experiment to
administer both the original and the revised questions. We explore whether results
from the phase 1 pretesting research predict problems observed when the original
questions are administered in the phase 2 telephone survey. We also examine
whether question revisions developed based on pretest results produce improved
survey outcomes.
25.1.1 Background
Questionnaire pretesting is standard practice for several U.S. government statisti-
cal agencies and other organizations involved in designing or conducting national
Methods for Testing and Evaluating Survey Questionnaires, Edited by Stanley Presser,
Jennifer M. Rothgeb, Mick P. Couper, Judith T. Lessler, Elizabeth Martin, Jean Martin,
and Eleanor Singer
ISBN 0-471-45841-4 Copyright 2004 John Wiley & Sons, Inc.
525
526 DOES PRETESTING MAKE A DIFFERENCE? AN EXPERIMENTAL TEST
surveys. Pretesting methods that are commonly used include expert review, cog-
nitive interviewing, behavior coding, and respondent debriefing. The ability to
make good, informed decisions about pretest standards and pretest practices is
enhanced by data that address methodological questions such as the following:
žWhich pretesting methods are most effective for identifying questionnaire
problems?
žWhich pretesting methods are most useful for providing information to fix
questionnaire problems?
žHow does the set of effective methods differ depending on survey charac-
teristics or pretest purposes?
žWhat is the most effective way to combine sets of pretesting methods to
address particular pretest goals?
Researchers have taken different approaches to answering questions such as
these. Willis et al. (1999a) noted that one way to distinguish these research
approaches is according to the criteria for methods evaluation. Following Willis
et al. we identify three general approaches to methods evaluation.
žExploratory research compares pretest methods in terms of their effective-
ness for detecting unexpected questionnaire problems.
žConfirmatory research compares pretesting methods in terms of their effec-
tiveness for confirming or disconfirming questionnaire problems that are
suspected based on other results.
žReparatory research compares pretesting methods in terms of their effec-
tiveness for suggesting revisions that improve survey outcomes.
Exploratory and confirmatory research focus on how well pretest techniques
detect questionnaire problems. Reparatory research focuses on how effectively
pretest techniques identify ways to improvežquestionnaire items once problems
have been identified.
Exploratory research is relatively common. The designs typically make direct
comparisons between different pretesting methods in terms of the numbers and
types of problems identified when the methods are applied to a constant set of
survey materials (e.g., Campanelli, 1997; DeMaio et al., 1993; Oksenberg et al.,
1991; Presser and Blair, 1994). A small number of exploratory studies have
focused on comparing variants of a particular method — for example, alterna-
tive approaches for adding probe questions to cognitive interview protocols or
alternative behavior coding schemes (e.g., Conrad and Blair, 2001; Davis and
DeMaio, 1993; Edwards et al., 2002; Foddy, 1996a).
Confirmatory research is less common than exploratory research. The most
prevalent confirmatory design involves assessing relations between pretesting
results and survey results, especially survey measures assumed to be related to
data quality (e.g., Davis and DeMaio, 1993; Willis and Schechter, 1997). Gener-
ally, these studies focus on predictions from pretests using individual pretesting
INTRODUCTION 527
methods. We have identified no confirmatory studies examining combinations of
pretesting methods that are typical of actual pretesting practice. One major pur-
pose of the research reported here is to extend the typical confirmatory design to
explore predictions of survey outcomes based on results obtained by applying a
sequence of pretesting methods similar to those used in actual practice.
Reparatory research is rare. We’ve identified two studies where researchers
used split-sample field test designs to compare survey results from questions
revised based in part on pretesting research activities with survey results from
unrevised questions (Lessler et al., 1989; Turner et al., 1992a). In both studies,
the reparatory results are difficult to interpret because questionnaire revisions
were based on additional information beyond the pretest results. A second major
purpose of the research reported here is to conduct reparatory research to deter-
mine whether questionnaire revisions made solely in response to pretesting results
improve survey administration and/or survey outcomes.
25.1.2 Objectives
The research reported here was conducted in two phases. In phase 1 we used
expert review, questionnaire appraisal, and cognitive interview methods to pretest
three sets of survey items. One goal of the phase 1 research was to compare
the three pretesting methods in terms of the numbers and types of potential
problems identified. In phase 1, three organizations used all three of the methods.
As a consequence, we were also able to examine agreement across organizations.
Those findings are reported in Rothgeb et al. (2001). A second goal of the phase 1
pretesting research was to develop indicators of questionnaire problems to use
for predicting phase 2 survey outcomes. We developed a qualitative problem
classification scheme for this purpose. The classification scheme is described
below (Section 25.2.2). A third goal of the phase 1 pretesting research was to
develop recommendations for questionnaire revisions expected to improve survey
outcomes in phase 2.
In phase 2 we conducted a split-sample field experiment using a random-digit-
dial (RDD) telephone survey. Household cases were assigned randomly to either a
control questionnaire or an experimental questionnaire. The control questionnaire
included the items pretested in phase 1. The experimental questionnaire included
comparable items that were revised based on the pretest evaluation.
We designed the telephone survey field experiment to answer two research
questions.
žResearch question 1: Do pretesting results from phase 1 predict problems
in the control condition of the phase 2 field experiment?
žResearch question 2: Do questionnaire revisions based on phase 1 pretest
findings improve survey outcomes in the experimental condition of the
phase 2 field experiment?
Research question 1 is confirmatory. Results from the phase 2 field experiment
are used to confirm or disconfirm suspected questionnaire problems identified in
528 DOES PRETESTING MAKE A DIFFERENCE? AN EXPERIMENTAL TEST
phase 1 pretesting. Research question 2 is reparatory. Results from the phase 2
field experiment are used to determine whether questionnaire revisions based on
phase 1 pretesting results produced more effective items.
25.2 DESIGN
25.2.1 Phase 1: Questionnaire Pretesting Research Design
In selecting a pretest design, we were interested in including both the experimen-
tal factors directly of interest and additional design factors that would enhance
the generalizability of our results. The design factors we incorporated were:
žPretesting methods. We chose to focus on three pretesting methods: informal
expert review, questionnaire forms appraisal, and cognitive interviewing.
žSurvey organization. Researchers from each of three survey research organi-
zations conducted pretest activities. In addition to enhancing generalizability,
we included research organization as a design factor so we could explore
organizational differences. (Those results are reported in Rothgeb et al.,
2001.)
žPretest experience. A senior methodologist with considerable pretest re-
search experience led the pretesting team at each organization. Each team
consisted of two additional pretest researchers. The organizations aimed to
select one pretest researcher with moderate pretesting experience and one
with relatively little pretesting experience. Our aim was to include a mix of
experience levels within each organization to enhance generalizability.
žQuestionnaire content. We pretested a total of 83 questionnaire items selec-
ted from three survey questionnaires. We selected survey topics that the
pretest researchers had relatively little experience with, including questions
about (1) household telephone expenses and vehicles owned, from the U.S.
Bureau of the Census’s 1998 Consumer Expenditure Survey; (2) use of alter-
native transportation modes, from the U.S. Department of Transportation’s
1995 National Public Transportation Survey; and (3) attitudes toward envi-
ronmental issues from the U.S. Environmental Protection Agency’s 1999
Urban Environmental Issues Survey.
žPretest method sequence. Each pretest researcher used all three pretest meth-
ods. We selected a single pretest method order that seemed to reflect common
practice and to minimize undesirable carryover effects. Each researcher com-
pleted an informal expert review first, followed by the questionnaire forms
appraisal and then cognitive interviews.
We selected a Latin square design for conducting phase 1 pretest research
activities. Under this design, each researcher conducted an informal expert review
with one set of pretest items, a questionnaire forms appraisal with a second set of
pretest items, and three cognitive interviews with the third set of pretest items. The
DESIGN 529
Latin square design ensured that each item was tested under all three pretesting
techniques and by each organization. However, each individual staff member
reviewed an item under only one of the three pretesting methods.
We analyzed the phase 1 results by comparing the number of problems detected
by each pretesting method and by each organization. The pretest “problem scores”
ranged from 0 to 9 (based on evaluation by three organizations ×three pretesting
techniques). An item’s problem score was 0 when no organization identified a
problem with the item based on any pretest method. An item’s problem score
was 9 when all three organizations identified problems with the item under all
three pretest methods. Problem scores between these two extremes reflected dis-
agreements across organizations, across pretest methods, or both.
Results from phase 1 indicated that there was little variation in the numbers
or types of problems identified across participating research organizations; the
organizations seemed to use similar criteria to identify and label questionnaire
problems encountered in pretesting. Further, although the pretesting techniques
varied in terms of the numbers of problems they identified, from a qualitative
perspective all three were found to focus mainly on problems related to question
comprehension and communication.
For purposes of the research reported here, we used the phase 1 problem
scores to select the most problem-prone items. From the 83 items tested, we
selected 12 items with problem scores of 8 or above to include in the phase 2
field experiment. The 12 items selected were very problematic, as is clear from
the following examples:
Example item 1: Is local bus service available in your town or city? (Include only
services that are available for use by the general public for local or commuter
travel, including dial-a-bus and senior citizen bus service. Do not include long-
distance buses or those chartered for specific trips.)
Example item 2: First, I’m going to read you a list of different issues that may or
may not occur in your community. ... I am going to read the list of issues and I
want you to tell me how high or low a priority each is in the community. Use a
scale of 1 to 10, with 1 meaning “very low priority” and 10 meaning “very high
priority.”
a. Depletion of the water table
25.2.2 Phase 2: Questionnaire Design
The control questionnaire included original versions of the 12 items selected,
and the experimental questionnaire included revised versions of the 12 items
selected. Our research purposes required that we revise questions based solely
on pretest results. To meet this goal, we reviewed all notes gathered through the
phase 1 pretest and analysis, using a problem classification coding scheme (CCS)
to document question problems identified during phase 1 pretesting.
Table 25.1 shows the CCS problem categories. The CCS consists of a hierar-
chy of 28 codes. At the highest level of the hierarchy, the codes are grouped under
530 DOES PRETESTING MAKE A DIFFERENCE? AN EXPERIMENTAL TEST
Table 25.1 Problem Classification Coding Scheme (CCS)
1. Comprehension and communication
Interviewer difficulties
1. Inaccurate instructions
2. Complicated instruction
3. Difficult to administer
Question content
4. Vague topic/term
5. Complex topic
6. Topic carried over from earlier question
7. Undefined term(s)
Question structure
8. Transition needed
9. Unclear respondent instruction
10. Question too long
11. Complex, awkward syntax
12. Erroneous assumption
13. Several questions
Reference period
14. Carried over from earlier question
15. Undefined
16. Unanchored or rolling
2. Memory retrieval
17. Shortage of cues
18. High detail required or information
unavailable
19. Long recall period
3. Judgment and evaluation
20. Complex estimation
21. Potentially sensitive or desirability bias
4. Response selection
Response terminology
22. Undefined term(s)
23. Vague term(s)
Response units
24. Responses use wrong units
25. Unclear what response options are
Response structure
26. Overlapping categories
27. Missing categories
5. Other
28. Something else
DESIGN 531
the familiar headings of the traditional four-stage cognitive response model: prob-
lems in comprehension and communication, retrieval from memory, judgment and
evaluation, and response evaluation (e.g., Tourangeau, 1984). Within each of the
four stages there are midlevel categories of problems, and the lowest-level codes
provide the most detailed descriptions of question problems identified during
phase 1 pretesting.
The three authors jointly assigned CCS codes to pretest results for the 12
items selected. We applied the CCS to each item a total of nine times: once
for each combination of pretest method and research organization. We selected
CCS codes collaboratively and assigned as many codes as we agreed applied to
the documented problems. A total of 257 problems were identified across the
nine separate evaluations of the 12 most problem-prone items. These problems
involved 28 unique problem codes. [Details of this analysis are provided in
Rothgeb et al., (2001).]
We used the CCS codes and testers’ notes to revise items to address the specific
problems identified by our pretest research activities. For example, if the CCS
codes indicated that pretest respondents had problems understanding a question
because it used “undefined terminology,” we used testers’ notes to identify terms
that caused problems for pretest respondents and developed a revised question
that addressed only the documented terminology problem(s). Pretesting typically
identified multiple problems with each problem-prone question. Consequently,
revisions generally addressed multiple design problems.
We faced four general challenges as we developed question revisions.
žIdentifying question objectives. We didn’t have specific objectives for most
of the items tested. As a result, we frequently had to agree on preliminary
assumptions about question objectives before we could develop item revi-
sions. This is probably not typical of most questionnaire revision where revi-
sion includes discussions between substantive and methodological experts
to clarify and refine question objectives.
žRevising items with multiple problems. The CCS results identified multiple
problems for all 12 items pretested. We chose to develop revisions address-
ing all problems identified. This feature of our design influences how we
interpret analytic results. Differences in outcome measures between the con-
trol and experimental question versions cannot be linked to any one specific
change. Rather, differences must be attributed to the combination of revi-
sions selected.
žSimplifying complex items. Several of the original problem-prone items were
identified as too complex. Effective revision depended on decomposing these
items into two or more simpler items. As a consequence, we had to select
analytic strategies that assess experimental effects when there is a many-to-
one correspondence between the experimental and control question versions.
žDeveloping items for CATI interviews. Some of the original questions
came from paper-and-pencil questionnaires. We had to design them for
532 DOES PRETESTING MAKE A DIFFERENCE? AN EXPERIMENTAL TEST
administration in a CATI instrument without introducing extraneous changes
that would interfere with our pretest predictions and analytic conclusions.
Of course, we also had to contend with more common design issues related
to time constraints and allocated space within the questionnaire. Our interview
content represented just one of four experiments included in the survey design.
Our portion of the control questionnaire contained a total of 26 items. These
26 items included the selected 12 items that are the focus of our methodolog-
ical experiment, and 14 additional items and transitional instructions included
to establish and maintain interview flow. Our portion of the experimental ques-
tionnaire consisted of 44 items. These 44 items included 26 that were revised
versions of 12 pretested items, and 18 additional items and transitional passages
included for interview flow.
We included the original and the revised versions of the 12 items selected as
part of two versions of an omnibus survey questionnaire. The omnibus survey
was conducted by the Census Bureau in August and September 2000. Details of
the split-sample field test design and methodology are provided in Section 25.3.
25.3 METHODOLOGY
25.3.1 Phase 2 Data Collection
The Census Bureau’s omnibus Questionnaire Design Experimental Research Sur-
vey (QDERS) was conducted in August and September 2000. QDERS interviews
were conducted by telephone from one of the Bureau’s telephone interview-
ing facilities. The survey used RDD sampling procedures and computer-assisted
telephone interview (CATI) survey instruments. The RDD sample represented
households in the continental United States, and the sample consisted of 10,000
telephone numbers randomly assigned to one of the two questionnaire versions.
For each eligible sample household, interviewers identified one adult household
member to serve as the household respondent based on eligibility and willingness
to participate. With the respondent’s permission, the interviews were audiotaped.
The interviews lasted approximately 15 minutes. Both versions of the QDERS
questionnaire included seven sets of questions. The three sets of questions of
interest here are on the topics of telephone expenditures, transportation, and atti-
tudes about the environment. The other topics covered in the questionnaires were
health insurance, home mortgages, income, and basic household demographics.
Interviewers completed interviews over a four-week period. The interview
staff consisted of 24 experienced telephone interviewers, split randomly into two
groups. During the first two-week period, one group of interviewers trained on and
administered one version of the questionnaire. The other group of interviewers
trained on and administered the other questionnaire version. At the end of the first
two-week data collection period, interviewers were retrained on the alternative
version and conducted their remaining interviews with a second half-sample.
We selected this approach for staffing because we wanted all interviewers to
METHODOLOGY 533
administer both questionnaires, and we wanted to minimize interference. This
approach produced one “pure” debriefing session for each questionnaire version.
Interviewers completed interviews in 1862 households. Using accepted
response rate calculation guidelines (American Association for Public Opinion
Research, 2000), the set of 1862 interviews represents a response rate between 42
and 55%. There were no differences between the two questionnaire conditions in
terms of household response, household nonresponse, or interview refusal rates.
25.3.2 Dependent Measures of Survey Outcomes
We selected three sets of measures as indicators of phase 2 survey administration
and as potential measures of data quality: item nonresponse rates under the two
questionnaire versions, behavior coding results for the two questionnaire versions,
and interviewer ratings collected as part of the study’s interviewer debriefing
activities. Behavior coding and interviewer ratings are not direct measures of
data quality. Instead, they are measures of questionnaire flow and interviewer
opinions that predict quality measures (e.g., Hess and Singer, 1995). Measures
of item nonresponse are traditionally accepted as indicators of survey data quality
(e.g., Groves, 1989; Hox et al., 1991; Turner et al., 1992a). All three outcome
measures are useful because methodologists would generally agree that decreased
item nonresponse, fewer problematic behavior codes, and improved interviewer
ratings are signs of a successfully revised questionnaire.
Item Nonresponse Rates We computed “don’t know” and refusal frequencies
separately for each item in each questionnaire version. Item refusal frequencies
were uniformly low, so we combined “don’t know” and refusal into a single
nonresponse frequency for each item. We computed item nonresponse rates by
dividing each item nonresponse frequency by the total number of respondents
expected to answer that item. In both questionnaire versions, a few items could
be skipped based on earlier responses. Respondents who skipped an item were
eliminated from the nonresponse computations for that item.
We also computed an index of nonresponse for each respondent for the purpose
of comparing the control and experimental questionnaire versions. This index
was computed by dividing the respondent’s total number of nonresponses by the
number of items administered to the respondent. We computed the mean subject
nonresponse index by averaging the individual indices across respondents.
Behavior Coding All of the telephone interviews were recorded on audiotape.
We selected a random sample of 98 cases from each questionnaire version
for behavior coding. The staff who conducted the behavior coding was not
involved in any other study activities. The behavior codes documented five inter-
viewer behaviors:
žQuestion read exactly or with a slight change in wording
žMajor change in question wording
534 DOES PRETESTING MAKE A DIFFERENCE? AN EXPERIMENTAL TEST
žQuestion worded as verification
žQuestion mistakenly omitted
žOther follow-up behavior (e.g., question repeated, probed for clarification,
provided clarification).
The behavior codes documented five respondent behaviors:
žAdequate response
žQualified response
žInadequate response
žInterruption
žRequest for clarification or repetition
Behavior coders assigned codes for up to two interviewer –respondent inter-
actions. Analyses presented here focus on the first behaviors coded during the
first interaction for each item in each coded interview.
We used the codes assigned to interviewer and respondent behaviors to develop
two behavior coding problem indicators for each item in each interview. The
interviewer problem indicator was 1 if an assigned code indicated that the inter-
viewer had difficulty administering the item; otherwise, the interviewer problem
indicator was 0. The respondent problem indicator was 1 if an assigned code
indicated that the respondent had difficulty understanding or answering the item,
or 0 if no code was assigned indicating a respondent problem. We computed
frequency distributions for the interviewer and respondent problem indicators by
questionnaire version.
Interviewer Ratings Interviewers participated in debriefing sessions at the end
of both two-week data collection periods, just after completing interviews with a
single questionnaire version. At both time points, interviewers administering the
control and experimental questionnaire versions were debriefed separately. At
each debriefing session, interviewers completed an item rating task for the ques-
tionnaire they had just finished administering. They reviewed the questionnaire
items independently and rated each in terms of how often it caused problems
for them as interviewers and also in terms of how often it caused problems
for respondents. Interviewers used a three-point scale to indicate that an item
(1) caused no problems, (2) caused some problems, or (3) caused a lot of prob-
lems. Also, interviewers wrote comments about the types of interviewer and
respondent problems they experienced with each item, when applicable.
We computed frequency distributions for the three rating categories by item
and questionnaire version, separately for the interviewer and respondent prob-
lem ratings. We combined frequencies for ratings of “some problems” and “a
lot of problems” to make presentations parallel across the three sets of survey
outcome variables.
RESULTS 535
25.4 RESULTS
25.4.1 Research Question 1: Do Pretesting Results from Phase 1 Predict
Survey Outcomes in the Control Condition of the Phase 2 Field
Experiment?
We begin by describing the summary measures we developed to reflect the
phase 1 pretest results and to predict phase 2 survey outcomes. All analyses
in this section examine data collected using the control questionnaire version.
The CCS codes in Table 25.1 describe two general sets of pretesting problems:
interviewer problems (CCS codes 1 through 3) and respondent problems (CCS
codes 4 through 27). We developed two summary measures of pretest problem
severity for each item by counting the number of CCS codes assigned to the item
during the course of pretesting, separately for the codes describing interviewer
problems and for the codes describing respondent problems. We used these prob-
lem severity measures to classify each item in two ways: first, as low, moderate,
or high in interviewer problems, and second, as low, moderate, or high in respon-
dent problems. These classifications were the basis for predicting interviewer and
respondent problems observed during the phase 2 survey data collection.
Recall that the 12 items of interest here were selected because they were the
most problem-prone. The least problematic item in this set had a total of 13
CCS codes assigned to it (combined across interviewer and respondent problem
codes). The most problem-prone item had a total of 36 CCS codes assigned
to it. Therefore, the low, medium, and high classifications of interviewer and
respondent problem severity actually represent a relatively narrow range of pretest
problem severity. Our results may not apply to less problem-prone items.
Interviewer Problems We gathered two outcome measures of interviewer prob-
lems with the control questionnaire version during the phase 2 survey data collec-
tion: behavior-coded interviewer problems observed for each item and interviewer
ratings of problems they experienced with each item. Table 25.2 shows the behav-
ior coding results, and Table 25.3 shows the interviewer rating results. The top
panels in Tables 25.2 and 25.3 show results for items expected to have low num-
bers of interviewer problems based on the phase 1 pretesting results. The middle
panels in both tables show results for items expected to have moderate numbers
of interviewer problems based on phase 1 results, and the bottom panels show
results for items expected to have high numbers of interviewer problems.
As expected based on the phase 1 pretest results, the proportion of interviewer
problems documented by behavior coders is small in the top panel of Table 25.2
and larger in the bottom panel of Table 25.2. Chi-square tests with directional
post hoc tests indicated no significant differences in interviewer problems between
items identified as low and items identified as moderate in interviewer problems
based on phase 1 pretest results. When these two sets of items are combined,
the behavior coding results show significantly more behavior-coded interviewer
problems for items with high pretest interviewer problems than for items with
low or moderate pretest interviewer problems.
536 DOES PRETESTING MAKE A DIFFERENCE? AN EXPERIMENTAL TEST
Table 25.2 Interviewer Problems Identified in Phase 2 Behavior Coding by Pretest
Interviewer Problem Severitya
Percent of Behavior-Coded Interviews
Pretest Interviewer
Problem Severity
No Interviewer
Problems
One or More
Interviewer Problems
Tot al
n
Low 96.1 3.9 180
(0 codes assigned)
Moderate 97.2 2.8 668
(1–2 codes assigned)
High 89.5 10.5b277
(4–6 codes assigned)
aChi square =24.94, p<0.01.
bOne-tailed p<0.001.
Table 25.3 Interviewer Problems Identified by Phase 2 Interviewer Ratings by
Pretest Interviewer Problem Severitya
Percent of Rated Items
Pretest Interviewer
Problem Severity
No Interviewer
Problems
Some or a Lot of
Interviewer Problems
Tot al
n
Low 93.2 6.8 44
(0 codes assigned)
Moderate 87.6 12.3 154
(1–2 codes assigned)
High 78.8 21.2b66
(4–6 codes assigned)
aChi square =5.174, p<0.10.
bOne-tailed p<0.05.
Interviewer ratings in Table 25.3 show a similar pattern. Chi-square analyses
with directional post hoc tests indicated no significant differences in interviewer
ratings of interviewer problems between items identified as low and items identi-
fied as moderate in interviewer problems based on phase 1 pretest results. When
these two sets of items are combined, the interviewer rating results show signif-
icantly more rated interviewer problems for items with high pretest interviewer
problems than for items with low or moderate pretest interviewer problems.
Respondent Problems We gathered two outcome measures of respondent prob-
lems with the control questionnaire version during the phase 2 survey data collec-
tion: behavior-coded respondent problems observed for each item and interviewer
ratings of apparent respondent problems for each item. Table 25.4 contains the
behavior coding results and Table 25.5 contains the interviewer rating results.
RESULTS 537
Table 25.4 Respondent Problems Identified in Phase 2 Behavior Coding by Pretest
Respondent Problem Severitya
Percent of Behavior-Coded Interviews
Pretest Respondent
Problem Severity
No Respondent
Problems
One or More
Respondent Problems
Tot al
n
Low 89.1 10.9 341
(12–16 codes assigned)
Moderate 76.8 23.2b379
(20–21 codes assigned)
High 57.1 42.9b170
(29–30 codes assigned)
aAnalysis is based on 10 items for which explicit responses were required. Two items were excluded
from this analysis because they were instructions that required no explicit response. Chi square =
67.9, p<0.001.
bOne-tailed p<0.001.
Table 25.5 Respondent Problems Identified by Phase 2 Interviewer Ratings by
Pretest Respondent Problem Severitya
Percent of Rated Items
Pretest Respondent
Problem Severity
No Respondent
Problems
Some or a Lot of
Respondent Problems
Tot al
n
Low 95.4 4.6 88
(12–16 codes assigned)
Moderate 90.8 9.2 109
(20–21 codes assigned)
High 69.2 30.8b65
(29–30 codes assigned)
aAnalysis is based on 12 items, including two sets of instructions that required no explicit response
because interviewers were able to rate these items. Chi square =25.156, p<0.001.
bOne-tailed p<0.001.
The top panels in Tables 25.4 and 25.5 show results for items expected to have
low numbers of respondent problems based on the phase 1 pretesting results. The
middle and lower panels in both tables show results for items expected to have
moderate and high numbers of respondent problems based on phase 1 results,
respectively.
The general patterns of results in Tables 25.4 and 25.5 are as expected based
on pretest results. In Table 25.4, the proportion of respondent problems docu-
mented by behavior coders increases consistently with increased pretest problem
severity. Chi-square tests with directional post hoc tests indicated significantly
more behavior-coded respondent problems for items with high pretest respondent
538 DOES PRETESTING MAKE A DIFFERENCE? AN EXPERIMENTAL TEST
problems than for items with moderate respondent problems. Also, behavior-
coded respondent problems were significantly higher for items with moderate
pretest respondent problems than for items with low pretest respondent problems.
In Table 25.5, chi-square tests with directional post hoc tests indicated no
significant differences in rated respondent problems between items with moderate
and low pretest respondent problems. When these two sets of items are combined,
the interviewer rating results show significantly more rated respondent problems
for items with high pretest respondent problems than for items with moderate or
low pretest respondent problems.
Item Nonresponse We used a subset of CCS codes for predicting patterns of
nonresponse in the phase 2 survey. We hypothesized that item nonresponse would
be related to pretest problems with memory recall (CCS codes 17 through 19 in
Table 25.1) and item sensitivity (CCS code 21 in Table 25.1). We computed
a measure of pretest recall and sensitivity problem severity for each item by
counting the number of recall and/or sensitivity-oriented CCS codes assigned to
the item. Based on this measure, we classified items as low or high in pretest
recall and sensitivity problem severity. This classification was the basis for pre-
dicting item nonresponse rates. Because of the way the items clustered, we did
not identify a category of items with “moderate” recall and/or sensitivity prob-
lem severity.
Table 25.6 contains overall rates of nonresponse by recall and sensitivity prob-
lem severity. In Table 25.6, overall item nonresponse is lower for items identified
in phase 1 pretesting as having relatively few recall and sensitivity problems. Item
nonresponse is significantly higher for items identified as having more problems
related to recall and sensitivity.
Summary Results from behavior coding, interviewer ratings, and nonresponse
rates consistently indicated that problems observed during pretesting do predict
Table 25.6 Phase 2 Item Nonresponse for Control
Questionnaire Items by Pretest Recall and Sensitivity
Problem Severitya
Item Nonresponse
Pretest Recall and Sensitivity
Problem Severity
Percent of Items
Administered
Tot al
n
Low 1.0 4313
(0–3 codes assigned)
High 8.8b4303
(5–11 codes assigned)
aAnalysis is based on 10 items for which explicit responses were
required. Two items were excluded from this analysis because
they were instructions that required no explicit response.
bOne-tailed p<0.001.
RESULTS 539
problems observed when the same items are administered in the field.
žItems with relatively many interviewer problems during pretesting also have
relatively many behavior-coded and rated interviewer problems in the field.
žItems with relatively many respondent problems during pretesting also have
relatively many behavior-coded and rated respondent problems in the field.
žItems with relatively many recall and sensitivity problems during pretesting
also have relatively high nonresponse rates in the field.
Thus, information from phase 1 pretesting was consistently useful for detect-
ing problems that were confirmed by survey results. In the next section we present
results for assessing the repairs made based on the phase 1 pretest results.
25.4.2 Research Question 2: Do Questionnaire Revisions Made Based on
Phase 1 Pretest Findings Improve Survey Outcomes in the Experimental
Condition of the Phase 2 Field Experiment?
We address research question 2 by examining differences between the control and
the experimental questionnaire versions in the survey outcome measures selected.
These analyses focus on 10 of the 12 items discussed in Section 25.4.1. The
experimental versions for two of the control questionnaire items were incorrectly
programmed in the experimental version of the CATI instrument. We excluded
these items from our analyses because the data are not comparable.
Questionnaire revisions often involved decomposing one complicated item in
the control questionnaire version into several simpler items in the experimental
questionnaire version. These revisions influenced our analytic approach. Because
of skip instructions, sample sizes could vary across the experimental questionnaire
items that we compared to a single control questionnaire item. In most of these
comparisons, our analyses compare results from a single control questionnaire
item with results from multiple experimental questionnaire items, proportionally
weighted according to their sample sizes. We used equal weights to compute the
mean subject index of nonresponse.
Item Nonresponse Table 25.7 contains item nonresponse rates and mean sub-
ject indexes of nonresponse for the control and experimental questionnaire ver-
sions. For both measures, overall nonresponse was significantly lower for the
experimental treatment questionnaire than for the control questionnaire, although
the differences are small in magnitude.
Behavior Coding Table 25.8 contains results for behavior-coded interviewer
problems and behavior-coded respondent problems for the control and experimen-
tal questionnaire versions. Directional and nondirectional tests of the proportions
in Table 25.8 revealed no significant differences between the two questionnaires
540 DOES PRETESTING MAKE A DIFFERENCE? AN EXPERIMENTAL TEST
Table 25.7 Phase 2 Measures of Nonresponse for Control and Experimental Ques-
tionnaire Versions
Control Questionnaire Experimental Questionnaire
Percent of
Administered
Items That Were
Unanswered
Total Number
of Items
or Number of
Respondents
Percent of
Administered
Items That Were
Unanswered
Total Number
of Items or
Number of
Respondents
Item nonresponse 5.9a6898 4.7a10,520
Mean subject index 6.4a915 4.8a930
aOne-tailed p<0.001.
Table 25.8 Interviewer and Respondent Problems Identified in Phase 2 Behavior
Coding for Control and Experimental Questionnaire Versionsa
Control Questionnaire Experimental Questionnaire
Percent of
Behavior-Coded
Interviews
Tot al
n
Percent of
Behavior-coded
Interviews
Tot al
n
One or more interviewer
problems
4.4 939 6.9 1291
One or more respondent
problems
23.6 705 25.9 1072
aOne- and two-tailed tests indicate no significant differences between questionnaire versions.
in terms of either behavior-coded interviewer problems or behavior-coded respon-
dent problems.1
Interviewer Ratings Table 25.9 contains results for rated interviewer problems
and rated respondent problems for the control and experimental questionnaire ver-
sions. Interviewer ratings of interviewer problems yielded an unexpected result.
Interviewer ratings identified more interviewer problems with the experimental
questionnaire than with the control questionnaire. This difference was signifi-
cant according to chi-square tests, but the direction was opposite that expected
based on the pretest findings, so it was not significant under our directional post
1Item-level analyses indicated that two items were largely responsible for the absolute increase in
behavior-coded interviewer problems with the experimental questionnaire version. The original ver-
sions of both items were easy for interviewers to read but difficult for respondents to understand based
on pretest results. Apparently, revisions made to enhance communication also increased interviewer
reading problems. Behavior-coded respondent problems increased with the experimental question-
naire version for seven items and decreased for two. Once averaged over items, the difference was
not significant.
CONCLUSIONS 541
Table 25.9 Interviewer and Respondent Problems Identified by Phase 2 Interviewer
Ratings for Control and Experimental Questionnaire Versions
Control Questionnaire Experimental Questionnaire
Percent of
Rated Items
Tot al
n
Percent of
Rated Items
Tot al
n
Some or a lot of interviewer
problems
11.0 218 24.8a314
Some or a lot of respondent
problems
55.3 217 45.3b300
aChi square =15.884, p<0.001.
bOne-tailed p<0.05.
hoc tests. Interviewer ratings of respondent problems identified fewer respondent
problems with the experimental questionnaire than with the control questionnaire.
Chi-square analyses with directional post hoc tests indicated that this difference
was significant.
Summary Results from analyses of item nonresponse, behavior coding, and
interviewer ratings indicate that the question revisions we made had mixed effects
on survey outcomes.
žItem revisions reduced item nonresponse, but the improvement was small.
žItem revisions had no effect on the types of interviewer and respondent
problems identified by behavior coding.
žItem revisions repaired the types of respondent problems identified by inter-
viewer ratings.
žItem revisions did not repair the types of interviewer problems identified by
interviewer ratings and may have made these interviewer problems worse.
25.5 CONCLUSIONS
We close by summarizing our key findings, discussing their implications, and
then suggesting some directions for future research.
25.5.1 Findings Related to Problem Detection
Our initial research question asked whether pretesting results based on expert
review, questionnaire appraisal, and cognitive interviews predict actual problems
in field survey outcomes. Our findings suggest that they did. Questions that
pretesting identified as particularly problematic for interviewers elicited more
inappropriate behavior-coded interviewer behaviors than less problematic items.
542 DOES PRETESTING MAKE A DIFFERENCE? AN EXPERIMENTAL TEST
Interviewers also rated these items as causing more problems for them than did
the less problematic items.
Questions that pretesting identified as posing particularly large problems for
respondents also elicited more uncodable responses, more respondent requests
for clarification, and/or more respondent interruptions compared with less prob-
lematic items, based on behavior coding analyses. Findings based on interviewer
ratings of respondent problems were consistent with this trend. Finally, items that
pretesting identified as posing large problems related to memory and estimation
elicited more nonresponse than did less problematic items.
25.5.2 Findings Related to Problem Repair
The second major research question we posed asked whether questionnaire revi-
sions based on pretest results yield improved survey outcomes. Based on our
results, the answer is unclear. We obtained some positive evidence, but it cer-
tainly was not pervasive. The revised items in the experimental questionnaire
produced a very small improvement in nonresponse and a larger improvement as
assessed through interviewer ratings of respondent problems. Thus, there is some
evidence that our pretest results served a reparative function, at least from the
respondent’s perspective. At the same time, there were no differences between
the two questionnaire versions in terms of behavior-coded respondent problems.
So the results are mixed.
We had little evidence that question modifications served to improve ques-
tions from the point of view of the interviewer. Our revisions had no effect on
interviewer problems reflected in behavior coding, although behavior coding also
suggested that these problems were generally not large to begin with. Based on
interviewer ratings, our question revisions may have made interviewer problems
worse. Initially, we chose to work with a very problematic set of items. Item-
level analyses suggested that solving respondent problems required steps that
made the revised items difficult for interviewers to administer. The increase in
interviewer ratings of interviewer problems with the experimental questionnaire
version makes it clear that there were limits to our ability to use the pretest
findings to make revisions that consistently improved all of the survey outcome
measures selected.
25.5.3 Study Strengths, Limitations, and Possible Explanations
Our study design had two important strengths. First, unlike previous confirmatory
studies, our design used a split-sample experiment to examine the confirmatory
power of a combination of pretesting methods. In this sense, our design mim-
icked an important feature of actual pretesting practice. Second, unlike previous
reparatory studies, we made revisions to the questionnaire based only on pretest
findings. As a result, any observed differences between items in the control
and experimental questionnaires can be attributed to decisions made based on
pretest results.
CONCLUSIONS 543
It is unfortunate that we did not strengthen the sensitivity of our basic design by
purposefully including less problematic items in the control questionnaire version.
Including less problem-prone items would allow us to assess the confirmatory
and reparatory effects of pretesting for “pretty good” items in addition to the
“pretty bad” and “very bad” items studied here. We believe that the contrast
would be informative.
We were not particularly successful in improving questions that pretesting
identified as problematic. We can think of a few possible reasons for our lack
of success.
Possibility 1: The identified problems were insurmountable We attempted to
eradicate flaws in the very worst questions that we found. Perhaps this goal was
simply not possible. In many cases, we attempted to decompose one difficult
question into several simpler ones. In these cases, concepts may have been too
complex or too undeveloped to fix with revised wordings. It may be necessary
to revisit measurement objectives or develop a new question-asking strategy to
fix these items.
Possibility 2: The modifications were flawed A major function of pretesting is
to point the way toward question improvement. However, this is not an auto-
matic process. It requires a proficient questionnaire designer. It may be that the
pretesting methods we employed were effective and uncovered real problems but
that the redesign phase was deficient because we selected ineffective revisions.
For reasons that are probably obvious, this is not our preferred conclusion. The
people involved in developing our revised items all have considerable experience
pretesting questionnaires in a variety of contexts and roles. It seems implausible
to us to maintain that we were very capable when using pretest results to find
problems, but incapable of identifying obvious solutions that were indicated by
the same pretest results.
Possibility 3: The survey outcome measures were flawed In the absence of objec-
tive measures of data quality, it can be difficult to assess question functioning,
data quality, or the effects of pretesting (Willis et al., 1999a). Although we
selected outcome measures that are generally thought to serve as indirect mea-
sures of data quality, there is no assurance that they act as reliable proxies of
validation measures. Nonresponse is generally recognized as a gross measure of
data quality. Interviewer ratings are subjective. In addition, interviewer ratings
are probably biased toward identifying items with obvious administration diffi-
culties and away from identifying items with more subtle problems that affect
analysis and interpretation. Behavior coding is most effective for detecting overt
problems that are easily observable.
We believe that the latter point is especially significant. Pretesting methods
such as cognitive interviewing are designed to investigate covert problems. For
example, intensive probing might be used to determine whether respondents inter-
pret a term as intended. Pretesting methods such as behavior coding can serve
544 DOES PRETESTING MAKE A DIFFERENCE? AN EXPERIMENTAL TEST
as effective validation measures for cognitive interviewing only when we expect
that covert problems in comprehension, recall, or response selection will produce
overt, codable behavior. We can imagine question modifications that markedly
improve question understanding without improving behavior coding results.2
It seems reasonable to hypothesize that in some cases, the pretest methods
studied here successfully confirmed problems and led to effective repairs, but
these successes were undetected in the survey outcome measures we studied.
Better measures of response accuracy would shed light on this possible explana-
tion for our lack of reparatory success. For example, in the current questionnaires
we could measure accuracy of reported telephone bill amounts either by access-
ing telephone company billing records or by asking respondents to send us copies
of their most recent telephone bills. Similarly, we could evaluate the accuracy of
reported access to public transportation by contacting local bus companies.
Possibility 4: Pretesting alone is ineffective Of course, it may be that we were
not able to repair question defects because the pretesting methods we used do
not always suggest “fixes” that actually alleviate the inherent problems. Less
radically, our findings may suggest that it is insufficient simply to pretest and
make modifications before fielding a survey questionnaire. Additional pretesting
with question revisions may be needed to ensure effective repairs. Pretesting
revisions in field settings may be particularly important. The common practice of
conducting cognitive interviewing as a series of iterative “rounds” clearly adheres
to this recommendation. Our already complicated pretest design would not allow
the added complexity of iterative pretesting.
A related thought is that our pretesting protocol may not have been entirely
effective. The selected combination of methods we used (expert review, followed
by appraisal, followed by cognitive interviewing) was effective for confirmatory
purposes, and at least somewhat effective for reparatory purposes. However, we
do not know how the selected pretest combination compares with others, nor
can we say anything about the contributions the individual pretesting methods
made to the overall effectiveness of the combination. Designs that would allow
comparisons with other combinations or with a specific pretest method would
require larger staffs and more resources than those available to us.
25.5.4 Recommendations for Further Research
First, we’d like to see reparatory studies that use more direct and more sen-
sitive measures of data quality, especially studies that undertake some type of
2An example from our item-level analyses illustrates. Our behavior coding results indicated few
respondent problems with a control questionnaire item on automobile use. (“Is it used for business?”)
We observed considerably more behavior-coded respondent problems for the revised item in the
experimental questionnaire version. [“In the past 12 months, was (FILL CAR) used at all as part
of a job or business? Do not include commuting to work.”] Pretesting indicated that the original
item had comprehension problems. We believe that the revision represents a reasonable move toward
improvement. The trends for these two items led us to wonder about the types of problems detected
by our behavior coding results.
CONCLUSIONS 545
validation. For example, one could develop and pretest questions about specific
health insurance coverages. A reparatory study of the pretest methods selected
could include procedures for gathering general health plan information to validate
eventual survey responses. In this way it should be possible to disentangle two
sets of issues:
žAdministration quality— the degree to which the questionnaire flows well,
avoids negative interviewer reactions, and reduces the frequencies of prob-
lematic behavior codes.
žData quality— the degree to which the questionnaire elicits reliable, accu-
rate, and unbiased information from respondents.
We hope that pretesting positively influences both sets of issues.
Second, we think it would be useful to develop reparatory designs that allow
more precise testing for effects of particular revisions on survey outcomes and
data quality. The problem-prone items we selected to study gave us a variety of
potential problems to repair. Early on, we chose to develop revisions that would
address as many of the documented problems as possible. We believe that this
approach reflects common pretesting practice. As a result, we think our findings
represent the types of effects to expect in practice, where the objective is to fix
as many problems as possible.
This approach to revision affords less experimental control. We cannot draw
specific conclusions about the reasons why particular revisions we made were
more or less effective. A more effective reparatory design might use a mix of
revision strategies. For example, one might include some experimental ques-
tions revised to fix multiple problems and others that reflect one incomplete but
unconfounded revision.
Third, we recommend additional research that includes a mix of more and
less problem-prone items in control and experimental questionnaire versions.
This type of mix is essential for determining that pretesting demonstrates both
sensitivity and specificity. In other words, we want to know that our methods
effectively identify items with true problems (sensitivity). We also want to know
that our methods effectively identify items with no problems (specificity). This
type of item mix would provide two additional advantages. A mix of problem
severities would allow us to assess whether different levels of problem severity
are particularly easy or difficult to repair. Also, a mix of problem severities would
allow us to study trade-offs between problem detection and repair. For example,
it may be that severe problems that are easy to detect are also especially difficult
to repair. Conversely, it may be easiest to fix items that are not obviously prob-
lematic. Thus, a simple wording change may fix an important problem uncovered
only by intensive pretesting methods.
Fourth, we recommend more research using iterative pretest designs where
pretesting for problem diagnosis is followed by preliminary repairs, further pre-
testing, and additional repairs before fielding. Because of the subjective nature of
questionnaire design, the “fixes” for a set of problems constitute a new stimulus.
546 DOES PRETESTING MAKE A DIFFERENCE? AN EXPERIMENTAL TEST
This is particularly true for items identified as having severe problems. We believe
that additional rounds of pretesting are wise before fielding major revisions on a
large scale. Studies of the types of changes made across iterative pretests might
shed light on the apparently subjective questionnaire revision process.
We’ll close with one last observation. We believe that it’s difficult to find
problems in survey questions. It’s more difficult to fix them. Even more difficult
is demonstrating that a repair is in fact an improvement.