Working PaperPDF Available

Too Fast, Too Straight, Too Weird: Post Hoc Identification of Meaningless Data in Internet Surveys


Abstract and Figures

Practitioners use various indicators to screen for meaningless, careless, or fraudulent responses in Internet surveys. This study employs an experimental design to test whether multiple post-hoc indicators are applicable to identify cases with low data quality. Findings suggest that careless responses are most reliably identified by questionnaire completion time, while a measure for within-subscale correlation structure is most indicative of fake responses. This paper discusses the different indicators' benefits and drawbacks, explains their computation, and proposes an index and a threshold value for completion speed. Given the tested estimates for data quality, removal of suspicious cases is only suggested if a significant amount of meaningless data is expected.
Content may be subject to copyright.
Too Fast, too Straight, too Weird:
Post-hoc Identification of Meaningless Data in Internet Surveys
Dominik J. Leiner1
– Working Paper from 02/2016 –
available at
This is a working paper from 2016. Please take note of the final,
much more elaborate publication stemming from this early version:
Leiner, Dominik J. (2019). Too Fast, too Straight, too Weird: Non-Reactive
Indicators for Meaningless Data in Internet Surveys.
Survey Research Methods, 13(3). doi: 10.18148/srm/2018.v13i3.7403.
Open Access URL:
1 Dominik J. Leiner, Ludwig-Maximilians University of Munich, Department of Communication Studies
and Media Research, Oettingenstr. 29, 80538 Munich, Germany,
Identification of Invalid Records in Internet Surveys 2
Practitioners use various indicators to screen for meaningless, careless, or fraudulent
responses in Internet surveys. This study employs an experimental design to test whether
multiple post-hoc indicators are applicable to identify cases with low data quality. Findings
suggest that careless responses are most reliably identified by questionnaire completion
time, while a measure for within-subscale correlation structure is most indicative of fake
responses. This paper discusses the different indicators' benefits and drawbacks, explains
their computation, and proposes an index and a threshold value for completion speed. Given
the tested estimates for data quality, removal of suspicious cases is only suggested if a
significant amount of meaningless data is expected.
Keywords: Data Cleaning, Careless Responding, Fraud Detection, Meaningless Data,
Paradata, Web-based Surveys
1. Introduction
Respondent-administered Internet surveys are well-known as an efficient and cost-
minimizing method of collecting data. Since response behavior in web-based surveys was
found similar to pen 'n' paper mail surveys (for a summary see Couper, 2010), the survey
mode “Internet” became increasingly common in scholarly and commercial survey research.
Today, web-based surveys and discussions of their limitations and advantages are found in
Identification of Invalid Records in Internet Surveys 3
most survey method text books (e.g., Bethlehem & Biffignandi, 2012; Fowler, 2009; Groves
et al., 2011; Marsden & Wright, 2010; Sue & Ritter, 2012).
One of the primary cost-savers in Internet surveys is automated response encoding, at least
for closed-ended questions. Every click is automatically interpreted and stored in the data set
– no matter if the respondent is earnestly completing the questionnaire, just having a look at
the questionnaire, or is filling in arbitrary answers to receive an incentive. The latter source
of invalid data is not specific to Internet surveys, but data quality on the Internet is more
likely to suffer: Firstly, an immaterial web page “may give respondents a sense of reduced
accountability” in comparison to a printed questionnaire (Johnson, 2005, p. 108). Secondly,
if no human being encodes the answers, no one takes notice of uncommon patterns, e.g.
zigzag patterns in matrix-style question batteries. Yet, additional metadata or paradata
(Kreuter, 2013) is available in web-based surveys that, therefore, provide several advantages
over printed questionnaires regarding situational control. Common Internet survey software
stores page/survey completion times, a respondent's IP address, the browser identification,
information on the screen size and/or available plug-ins. More detailed paradata can easily
be collected with little effort (Cellagaro, 2013; Olson & Parkhurst, 2013). Even the
commonly available paradata has proven helpful to identify multiple submissions by the
same respondent (Bowen, Daniel, Williams, & Baird, 2008; Johnson, 2005; Konstan,
Rosser, Ross, Stanton, & Edwards, 2005; Selm & Jankowski, 2006) and to supplement the
Identification of Invalid Records in Internet Surveys 4
screening for careless responses (Barge & Gehlbach, 2012; Bauermeister et al., 2012; Meade
& Craig, 2012).
If respondents do a 15-minute questionnaire in 3 minutes, it is unlikely that they actually
read the questions and answers. Common sense proposes to remove such suspicious cases
from the data. These cases probably reduce overall accuracy (for other aspects of data
quality see Wang & Strong, 1996) and therefore may increase type II errors (not rejecting
wrong null-hypotheses; for details see Meade & Craig, 2012). Worse, if such data is
systematically different from valid data regarding response distributions (e.g., always
selected the first answer option, for a summary on response styles see van Vaerenbergh &
Thomas, 2012), researchers risk drawing wrong conclusions (type I errors, i.e., rejecting
correct null-hypotheses) due to measurement artifacts, and possibly make detrimental
recommendations (Bauermeister et al., 2012; Woods, 2006). In spite of the significant
threats that “bad data” poses to scholarly research, the identification of meaningless cases in
web-based surveys has mostly been subject to practitioners (Bhaskaran & LeClaire, 2010;
Rogers & Richarme, 2009). Yet, if data cleaning is based on untested assumptions, removing
data may render new biases (Bauermeister et al., 2012; Harzing, Brown, Köster, & Zhao,
2012). A researcher may even face the accusation of manipulating data, if data cleaning is
not argued on systematic research and cleaned data fits the model and hypotheses better than
uncleaned data.
Identification of Invalid Records in Internet Surveys 5
The aim of this paper is to increase understanding about bad or low quality survey data
(Schendera, 2007, p. 6). It starts with a summary upon meaningless data and promising
indicators of data quality. Subsequently, multiple indicators are evaluated for their ability to
identify meaningless cases in self-administered Internet surveys. To do so, selected
respondents were asked for inaccurate answers to collect different types of meaningless data.
Finally, the paper draws conclusions on the application of quality indicators in field
2. Meaningless Data
In 1950, Payne published an influential paper “Thoughts About Meaningless Questions”.
This paper states that survey respondents try to make sense of survey questions, even if they
are meaningless when understood literally. Therefore, answers are not necessarily
meaningless if formally invalid – but meaningless responses are certainly invalid. The
defining element of a meaningful response are the respondent's intention and ability (e.g.,
linguistic competence, Johnson, 2005, p. 105) to give a qualified answer to a question. This
answer may be biased or purposefully faked – it is meaningful as long as it is an expression
of the respondent's considerations on the question. Literature knows two perspective on
meaningless data. The empirical perspective is content nonresponsivity (Meade & Craig,
2012, p. 437; Nichols, Greene, & Schmolck, 1989), i.e., a respondent's answer is
independent from what was asked. The respondent's perspective is satisficing, i.e., spending
limited or no cognitive effort on answering a question (Krosnick, 1991, 1999). The degree of
Identification of Invalid Records in Internet Surveys 6
understanding and cognitive effort actually varies within a broad continuum. The term
meaningless data describes such data that was collected near the continuum's lower end.
The dimension of meaning is closely aligned to the established concept of validity and,
consequently, to data quality. Depending on the focus of research, “satisficing” (Krosnick,
1991), “inattentive or careless response” (Johnson, 2005; Meade & Craig, 2012, p. 438),
“response sets” (Jandura, Peter, & Küchenhoff, 2012), and “response styles” (van
Vaerenbergh & Thomas, 2012) may better describe the phenomenon with attributions to
behavior, cause, or data structure. Another term, “random response”, is often found in
literature and probably based on the notion that a respondent arbitrarily selects answer
options. It is somewhat misleading as meaningless answers typically follow clear, effortless
patterns (e.g., always selecting the first option, Meade & Craig, 2012, p. 438) and, therefore,
show rather minimal than maximal entropy.
The identification of meaningless data is part of the data cleaning process. Step 1 of this
process usually removes cases where important questions have not been answered. This
includes records in which respondents have not even answered one question. Incomplete
cases may, in some instances, be relevant to estimate self-selection biases and identify
problematic aspects of the instrument. In the bulk of studies, such cases simply do not
contribute information to statistical analyses and are therefore excluded from descriptive or
explanatory reasoning. Step 2 is the removal of multiple submissions by the same
respondents (Bauermeister et al., 2012; Bowen et al., 2008; Konstan et al., 2005) and data
Identification of Invalid Records in Internet Surveys 7
doublets. Multiple submission is often considered a minor problem, because doing the same
survey twice is very unattractive in non-/low-incentivised, lengthy Internet surveys
(Birnbaum, 2003, p. 372; Göritz, 2004). If the study provokes multiple submissions, various
techniques may help with their identification (Musch & Reips, 2000). The problem of data
doublets and their removal is also bound to specific designs, e.g., when information about
organizations is collected and there are multiple contact persons per unit. Step 3 might
identify and remove cases with meaningless data (this paper's focus), followed by step 4
which removes ineligible cases where respondents are not part of the population under
research, and step 5 which removes extreme outliers that would disproportionally skew
statistical analysis and/or remove outlier responses from otherwise valid cases. Of course,
data cleaning may include further, less common steps and the steps' order may vary
throughout studies.
Meade and Craig (2012) distinguish two routes for detecting meaningless data. Under some
conditions, researchers expect a significant amount of meaningless data a-priori. In that
case, questions to identify meaningless data may be included in the questionnaire. This first
route is comprised of self-reports (direct questions whether to use the answers for analysis,
or scales for response behavior; also see Aust, Diedenhofen, Ullrich, & Musch, 2012) and
covered measures (scales designed to measure understanding or consistent responding,
bogus items, or instructed response items). With a focus on faking behavior, Burns and
Christiansen (2011) present a systematic framework and summary of such methods (also see
Identification of Invalid Records in Internet Surveys 8
Allen, 1966; Lim & Butcher, 1996; Pine, 1995). Their summary also covers the second route
for detecting meaningless data: post-hoc analysis of the data collected in the survey. By and
large, this route identifies anomalies in the responses' means, variance, and correlation
structure. Paradata collected during the survey – especially completion times in an Internet
survey – might be considered a third path. It does not consist of questionnaire responses, but
is often available even if the researcher did not consider meaningless data a problem a-
priori. This paper's focus is the post-hoc analysis, including paradata. The primary research
question is:
Figure 2.1: Response patterns from the control group in a Likert-like scale on elaboration
Overall scale consistency is α = .85 (N = 11.201). The item statements were placed left, the
rows originally varied in height and were aligned for illustration. * Items 2 and 9 are
particularly large
in-scale variation
clearly suspicious
response patterns
common response
* *
1 2 3 4 5 6
Identification of Invalid Records in Internet Surveys 9
RQ 1: Which data quality indicators, available post-hoc in an Internet survey, are the most
efficient ones in identifying cases of meaningless data?
Meaningless data only describes a symptom, not the causes, or the physical form it takes.
Regarding causes, Nichols, Greene, and Schmolck (1989) suggest to differentiate between
careless responding and faking. Therefore a secondary research question is:
RQ 2: Which are the most efficient quality indicators used to identify specific types of
meaningless data?
Figure 2.2. Distribution of index values from a bi-polar attitude scale
The peaks at the scale's middle and extremes are probably caused by straightliners. Data
from a 16 items scale (7-points, N = 11,032, M = 3.95, SD = 1.54, for details see methods
Identification of Invalid Records in Internet Surveys 10
There are two studies which empirically test indicators to identify careless cases in Internet
surveys. Both studies' questionnaires include one or more extensive personality inventories.
Johnson (2005) presents data collected with a 300-item-subset of the International
Personality Item Pool (IPIP, Goldberg, 1999). The reduced inventory was published online
to allow anonymous Internet users doing a self-test. After removing duplicate records,
23,076 data sets remain for analysis. This study's focus is on the distributions of four quality
indicators. Using an elbow criterion, Johnson (2005) identifies clear cut-points in the
distributions of a straightlining index and the amount of missing responses. These indicators
suggest removing 6.3 % or 2.9 % of the cases. Two further indices for inter-item correlation
do not show clear cut points. Their interpretation requires a gold standard. By combining
two consistency measures, 10 % of the cases are identified as careless responses. Notably,
the two consistency measures identify mostly independent sets of cases (Johnson, 2005,
p. 118).
Meade and Craig (2012) survey a convenience sample of 438 undergraduate students with
about 400 items, including bogus items and self-reports on response quality. They compare
seventeen quality indicators and find them to correlate only low to moderate. Using a latent
cluster analysis, Meade and Craig (2012) identify three respondent classes of which two
(11 % of the sample) are considered careless responses. Respondents from the larger class
(9 %) are characterized by giving inconsistent answers, the others (2 %) show extensive
straightlining. According to these clusters, Meade and Craig (2012) find the correlation
Identification of Invalid Records in Internet Surveys 11
measures, bogus items, and a diligence scale to be most efficient in identifying the clusters
of careless responses.
Meade and Craig (2012) and Johnson (2005) provide valuable insights into the distribution
of data quality indicators and the relation of different indicators. Yet, both studies do not
employ an external criterion for careless responses. The conclusion that a response is
careless is based on data structure and response anomalies – assuming that the indicators
actually predict careless responding. To test this assumption, and therefore, to determine
predictive validity of the indicators, this study employs an experimental design: A minority
of respondents is asked for careless and fraudulent responses, and indicators compete to
identify their records.
3. Post-hoc Data Quality Indicators
For sake of clarity, five classes of post-hoc data quality indicators may be distinguished. The
first indicator class, (1) the percentage of missing data, is important for data cleaning in
general (Barge & Gehlbach, 2012; Börkan, 2010; Kwak & Radler, 2002; Shin, Johnson, &
Rao, 2012). Missing data, i.e., questions that have not been answered, are a significant
limitation for nearly any kind of data analysis and can render a record unusable. Internet
surveys can be configured to demand an answer before the respondent may continue to the
next questionnaire page. On one hand, this ensures complete data sets. On the other hand,
such a filter may obfuscate cases in which a person just leafs through the questionnaire, e.g.,
Identification of Invalid Records in Internet Surveys 12
to possibly complete it later. There is probably no better way to identify a problematic case
than if half of its answers are missing. The compromise lies in probing for an answer to
determine whether it was purposefully omitted (implicit “don't know”, Franzén, 2001;
Krosnick & Fabrigar, 2003; Schneider, 1985; Schuman & Presser, 1980). Probing enables
visitors to quickly screen the questionnaire while calling attention to answers unintentionally
missed. Regarding the quality of answers, missing data may be of ambivalent informative
value. Unmotivated respondents likely skip questions (Barge & Gehlbach, 2012), but highly
motivated respondents could as well express an “I do not feel qualified to answer this
question” by omitting the answer. Technically, open-ended text questions, single-choice
selections, and multiple choice selections may require different handling and/or weighting.
When data is encoded manually, (2) patterns in matrix-style questions (e.g., a Likert
question battery) are the most obvious indicator for suspicious data. Typical patterns painted
by annoyed respondents are straight vertical lines (the same response option is chosen for
each item of a scale, also known as straightlining, Schonlau & Toepoel, 2015), diagonal
lines, and a combination of both (Figure 2.1). Such patterns do not necessarily render the
response invalid, but it seems likely that the respondents had the pattern in mind rather than
the battery items. Technically, there are three indicator classes to describe patterns. To
compute (a) the number of straightlined scales, multiple short scales of about 5–25 items
each, are required. (b) The longest sequence of the same answer (also known as longest
string, Johnson, 2005) requires few extensive scales of about 50 items and more. (c)
Identification of Invalid Records in Internet Surveys 13
Mathematical functions or algorithms calculate a pattern index from a sequence of answers
(Baumgartner & Steenkamp, 2001; Jong, Steenkamp, Fox, & Baumgartner, 2008; van
Vaerenbergh & Thomas, 2012, also see methods section). Such indices are usually more
sensitive to nearly-perfect patterns and allow more differentiation than the former measures.
The (3) distance from the sample means is a straightforward measure to identify respondents
who answer in an atypical way. Removing outliers is a common preparation for many
statistical analyses, applied in other steps of data cleaning (above). The face validity of
removing outliers is highest, if the sample shows a clean normal distribution and single
cases or small groups cause “peaks” and/or lie far outside the limits of three or four standard
Table 3.1: Experimental groups' descriptive statistics.
Study 1 (S1) Study 2 (S2)
Records after manipulation check
(Percentage of sample)
(97 %)
(3 %)
(56 %)
(44 %)
Average age in years
Ratio female : male in percent 60 : 40 63 : 37 58 : 42 63 : 37
Respondents with matriculation standard 85 % 66 % 85 % 70 %
Notes. Effects of the group differences on the studies' results are discussed in the paper. The
paper generally displays percentage values without digits to account for samples sizes of less
than 1,000.
Identification of Invalid Records in Internet Surveys 14
deviations. Respondents who always click the first option in a scale, for example, can cause
such an outlier group (Figure 2.2). Outliers may indicate meaningless data, but Bhaskaran
and LeClaire (2010, p. 239) argue that outliers may as well be completely valid answers
from atypical respondents. Removing outliers will directly affect a sample's variance and
means. Technically, outliers can be computed for single variables or using multivariate
methods such as regression adjusted scores (Burns & Christiansen, 2011) or the
Mahalanobis distance (Johnson, 2005).
The (4) correlation structure within the answers (consistency) is a chimera. On the one
hand, answers about the same construct shall be consistent and therefore highly correlated.
The same is expected for measures on related or dependent constructs, especially if such
correlations were shown in other studies. Respondents giving inconsistent answers,
therefore, are probably not giving valid answers. On the other hand, differentiation between
similar but non-identical items might rather indicate a respondent's cognitive effort
(Krosnick & Alwin, 1988). Vice versa, straightlining results in very high consistency if
scales do not contain reversed items. Kurtz and Parrish (2001) discuss that valid responses
also may seem inconsistent and Sniderman and Bullock (2004) argue that inconsistent
answers may simply indicate that the respondent is not familiar with the issue under
research. Last but not least, data cleaning based on correlations may interfere with
hypothesis testing. If only those respondents whom show a correlation one seeks to test are
selected for analysis, this is clearly a violation of prudence. Yet, respondents generally vary
Identification of Invalid Records in Internet Surveys 15
in answer consistency; therefore, choosing the consistent and possibly satisficing
respondents for analysis, could result in the same fallacy. Technically, there are various
methods to compute a correlation index: (a) one or more correlations between related
constructs, (b) the correlation between scale indices calculated for half-split scales (known
as even-odd consistency index, Johnson, 2005; Meade & Craig, 2012, p. 443, for details see
methods section), suitable if the questionnaire contains multiple scales or subscales, or (c)
subscale consistency. Jandura, Peter, and Küchenhoff (2012) suggest employing regression
models and control for (d) the probability that the given answer is chosen, based on the
respondent's previous answers.
The innovation of web-based surveys, regarding data quality, is that (5) completion time is
routinely available – usually measured per questionnaire page. It is common sense that care,
consideration, or even reading suffers when a respondent completes the questionnaire
extremely fast (rushing). Malhotra (2008) presents evidence that survey completion time
correlates to measurement artifacts and Furnham, Hyde, and Trickey (2013) show a positive
correlation between the personality trait reliability and completion time. Notably, research
on interviewer-administered surveys find longer, not shorter response latencies, to indicate
low-quality data (Draisma & Dijkstra, 2004). A general downside of response times is their
extensive interpersonal variation (Fazio, 1990; Meade & Craig, 2012, p. 447; Yan &
Tourangeau, 2008). Possibly, interpersonal differences outweighs variation in the
respondents' effort. Leiner and Doedens (2010) point out that, beyond extreme cases,
Identification of Invalid Records in Internet Surveys 16
response time does not predict test-retest reliability. Technically, page completion times can
be summarized as they are, or an index relative to some sample mean is created. A figurative
index is the relative speed, as discussed in the methods section. Such a speed index is much
closer to the normal distribution than the right-skewed absolute completion time, or put in
other words, the speed index accounts for the fact that the difference between completing a
questionnaire in two or three minutes is more important than the difference between twelve
and thirteen minutes. Further, a relative index is comparable throughout questionnaires of
different length and an index with page-wise normalization is more flexible regarding
optional (e.g., filtered) questionnaire pages.
4. Method
To test the indicators' efficiency in identifying cases of meaningless data, two studies
employing an experimental design were conducted. Both studies used the same treatment
instructions and the same questionnaire – except for the political issue, which the questions
were about (details below). In order to represent heterogeneous data sets, study 1 (S1)
promoted large inter-individual variance by varying the questionnaire's issue between
respondents. Study 2 (S2) represents more homogeneous data sets and, therefore, the issue
was the same for all respondents. Within the experimental design, some liberties were taken
regarding group assignment: The control group (the reference group in both studies) and the
experimental groups were invited separately from slightly different populations. Likely side-
Identification of Invalid Records in Internet Surveys 17
effects are discussed below and it is argued, these effects are not detrimental to the study's
The control group (CG) received the questionnaire during a research project. The
instructions ensured anonymity and asked the respondents to give their personal opinions,
but included no appeal to answer the questions particularly carefully. No incentive was
offered to the control group. The experimental groups (EG) were invited at different times.
The instructions in the experimental conditions stated that it was a research project on poor
survey data, followed by a treatment instruction how to complete the questionnaire. The
instructions also announced one question at the questionnaire's end (the manipulation check)
that, other then the previous questions, would require a honest and careful answer. The
experimental design expects three causes of meaningless data: careless responding,
conscious faking (Nichols et al., 1989), and rushing. To cover these aspects, six treatment
instructions (experimental conditions) were designed and randomly assigned to the
respondents: (E1a) “Please complete this questionnaire as fast as possible”, (E1b, S2 only)
“Please try to reach the questionnaire's end as fast as possible”, (E2a) “Please take as little
care as possible in doing this questionnaire. Do this questionnaire deliberately carelessly”,
(E2b) “Please imagine that you have no interest at all in the questions, but your only interest
is to attend the lottery”, (E3a) like E2b but with the amendment “– yet try to make your
answers look authentic”, (E3b) “In the subsequent questionnaire, please disclose as little
Identification of Invalid Records in Internet Surveys 18
about you and your opinion as possible”. The respondents in the experimental group were
offered to attend a lottery for 30 Euros in cash after finishing the questionnaire. Incentives
are known to encourage participation of disinterested respondents, who likely create
meaningless data (Göritz, 2006). The treatment instruction was colored red, inscribed in
large friendly letters on the first page. To ensure that every participant read the treatment
instructions, the second page had no content but a bold, red reminder explaining how to
complete the questionnaire. The experimental groups' questionnaire ended with a
manipulation check, consisting of six 6-point items on the respondent's completion behavior.
Again, an eye-catching instruction page preceded these items, announcing one page where
honest and careful answers are required.
All respondents were recruited from a volunteer online access panel. The return rate (click-
through rate 38 %) was much higher than in the access panel's other studies (M = 22 %). The
control group was randomly drawn from panelists registered to the panel before June 2011,
they received an e-mail invitation between June 2011 and October 2012. To avoid inviting
the same panelists to multiple conditions, the experimental groups were drawn from
panelists registered after June 2011 and invited in November 2012 (S1) and in June 2013
(S2). As the experimental treatment was clearly obvious and required significant differences
regarding the instructions, using slightly different populations (registered before and after
June 2011) was considered acceptable. Finally, the experiments' aim was not to test a causal
relation, but to collect comparable data sets with high and low quality. As the panel does not
Identification of Invalid Records in Internet Surveys 19
Table 4.1: Indicator efficiency in identifying cases from the experimental group (study 1)
Data Quality Indicator Cut Off LR+ R² Hit Rate
1.a Missing Data (absolute) ≥ 32 % 1.2 .00 8 %
1.b Missing Data (weighted) ≥ 10 % 1.1 .00 8 %
2.a Straightlining (no. of straightlined scales)
20 % A
(35 %)
2.b Longest String (longest same-answer sequence) ≥ 13 3.5 .02 23 %
2.c Straightlining (within scale SD) ≤ 0.13 4.1 .05 25 %
2.d Patterns (algorithmic) ≥ 0.65 3.9 .04 24 %
2.e Patterns (second derivation) ≤ 1.08 4.1 .03 25 %
3.a Average Item Distance from Sample Mean ≥ 1.47 1.1 .00 6 %
3.b Mahalanobis Distance from Sample Mean 82.5 3.5 .05 18 %
4.a Consistency (correlation of split-half scales) ≤ .11 1.6 .01 10 %
4.b Unpredicted Answers (probability < 20 %) ≥ 3 3.2 .03 19 % A
4.c Average Response Predictability (5 items) ≤ .24 3.8 .04 24 %
4.d Unsystematic Answers (within-scale residual) ≥ 1.01 3.5 .03 22 %
4.e Atypical Answers (residual's distance from avg.) 0.31 4.9 .08 30 %
5.a Fast Completion (absolute time) ≤ 547 s 6.5 .04 B 38 %
5.b Fast Completion (abs. time w/o breaks) ≤ 526 s 6.4 .13 38 %
5.c Fast Completion (relative speed index) ≥ 1.77 6.4 .13 38 %
Notes. Efficiency computes with a cut off chosen to identify 783 cases. The diagnostic
likelihood ratio positive (LR+) is estimated conservatively, assuming that only cases from
EG are truly meaningless data (positives). A binomial logistic regression is estimated to give
Nagelkerke's R². A Two indicators result in very distinct values. Depending on the chosen cut
off, they identify too many or too few cases; the approximately corrected hit rate for 4.b is
21 %. B Extreme outliers skew but not the identification rate.
Identification of Invalid Records in Internet Surveys 20
pay compensations for participating in surveys, the volunteers' primary motivation is
probably interest. Therefore, high motivation is likely and only few records in the control
group should contain meaningless data.
Multiple submissions from the same panelist were prevented by the access panel a-priori.
Cases from respondents who did not finish the questionnaire were removed. Each
respondent in study 1 (CG and EG) answered the questionnaire on 1 of 18 different political
issues. The control group of study 2 (n = 621) is a subset of the former control group, i.e.,
respondents that did the questionnaire on one specific issue. The respondents from study 2's
experimental group did the questionnaire on the same issue. Some respondents from the
experimental groups left comments, saying they are used to doing questionnaires carefully
and it was hard for them to follow the instructions.
Success of manipulation (rushing, carelessness, and fraud) is controlled by two manipulation
check items, each. The answers to both items, each measured on a 6-point scale, are
averaged and participants who achieved a value of 2 or less are removed from analysis
(* inverted items): For EG1a/b rushing is controlled (fast completion, thinking about
questions*), for EG2a/b carelessness is controlled (correct answers*, careful answers*), for
EG3a/b fraud is controlled (try to answer consistently, make answers look authentic).
Consistently, answers to the manipulation check items suggested to remove 117 (26 %, S1)
and 139 cases (22 %, S2) from the experimental groups, who did the questionnaire too
slowly, carefully, or honestly, leaving 335 and 486 cases. Regarding demographics, the
Identification of Invalid Records in Internet Surveys 21
panelists from the access panel resemble other convenience samples: They are rather young,
well educated, and comprises a majority of female respondents (Leiner, 2012). Table 3.1
summarizes the sample used for analysis.
Table 4.2: Identification efficiency per experimental condition (study 1)
Experimental Group E1a E2a E2b E3a E3b RG
No Interest
No Interest
but Plausible
No Personal
Random Data
Experimental Group Size 82 68 69 65 51 75
1.b Missing Data (weighted) 5 % 7 % 6 % 2 % 6 % 0 %
2.c Straightlining (within scale SD) 7 % 25 % 46 % 25 % 12 % 0 %
2.e Patterns (second derivation) 10 % 18 % 51 % 20 % 14 % 0 %
3.b Mahalanobis Distance from Sample Mean 11 % 21 % 19 % 14 % 24 % 16 %
4.c Average Response Predictability (5 items) 12 % 24 % 22 % 26 % 16 % 36 %
4.e Atypical Answers (residual's distance from av.) 16 % 22 % 35 % 32 % 37 % 76 %
5.a Fast Completion (abs. time w/o breaks) 23 % 40 % 57 % 31 % 14 % A
5.c Fast Completion (relative speed index) 21 % 37 % 61 % 32 % 16 % A
Notes. The most efficient indicators per condition are highlighted in bold. Note that the hit
rates refer to groups sizes of less than 100, therefore, 1 case identified or not may account
for 2 %. A No hit rates were computed because completion times were not simulated in the
random group.
Identification of Invalid Records in Internet Surveys 22
In addition to the control and experimental group, a series of random responses was
simulated in both studies. This pseudo-group allows a comparison between meaningless
Table 4.3: Indicator efficiency in identifying cases from the experimental group (study 2).
Data Quality Indicator Cut Off LR+ Hit Rate
1.a Missing Data (absolute) ≥ 26 % 1.6 .17 72 %
1.b Missing Data (weighted) ≥ 2 % 2.0 .15 62 %
2.a Straightlining (no. of straightlined scales)
(1.7) .09 C 15 %
(32 %)
2.b Longest String (longest same-answer sequence) ≥ 5 1.0 .03 59 %
2.c Straightlining (within scale SD) ≤ 0.19 1.3 .06 52 %
2.d Patterns (algorithmic) ≥ 0.46 1.2 .02 50 %
2.e Patterns (second derivation) ≤ 1.75 1.3 .02 51 %
3.a Average Item Distance from Sample Mean ≥ 1.32 0.5 .11 C 22 %
3.b Mahalanobis Distance from Sample Mean ≥ 44.7 1.0 .03 C 34 %
4.a Consistency (correlation of split-half scales) ≤ .92 1.3 .07 51 %
4.b Unpredicted Answers (probability < 20 %) ≥ 3 1.9 .01 C 9 %
4.c Average Response Predictability (5 items) ≤ .40 1.3 .05 54 %
4.d Unsystematic Answers (within-scale residual) ≥ 0.75 1.1 .01 47 %
4.e Atypical Answers (residual's distance from avg.) 0.16 1.5 .12 56 %
5.a Fast Completion (absolute time) ≤ 826 s 2.5 .19 70 %
5.b Fast Completion (abs. time w/o breaks) ≤ 720 s 2.8 .32 72 %
5.c Fast Completion (relative speed index) ≥ 1.12 2.4 .28 68 %
Notes. The cut off was chosen to identify 511 cases. The procedure is the same as explained
in Table 4.1. C Four indicators identify the EG worse than chance, R² is misleading for these
Identification of Invalid Records in Internet Surveys 23
data, as found in surveys, and random data. 75 cases were simulated in both studies' random
groups (RG). Each variable was simulated independently by drawing a random sample with
replacement from the control group's answers to the same variable. This procedure results in
random samples that shows the same response distribution like in the control group for each
variable, but correlations between the variables are lost. The latter also means that there is
Table 4.4: Identification efficiency per experimental condition (study 2).
Experimental Group E1a E1b E2a E2b E3a E3b RG
Fast (completing)
No Interest
No Interest
but Plausible
No Personal
Experimental Group Size 97 106 73 75 75 60 75
1.b Missing Data (weighted) 36 % 51 % 44 % 51 % 48 % 53 % 0 %
2.c Straightlining (within scale SD) 21 % 22 % 40 % 51 % 43 % 40 % 0 %
3.b Mahalanobis Distance f. Sample Mean 20 % 18 % 27 % 25 % 15 % 22 % 13 %
4.e Atypical Answers (residual's dst. f. av.) 21 % 24 % 47 % 49 % 31 % 47 % 53 %
5.a Fast Completion (abs. time w/o breaks) 40 % 49 % 58 % 65 % 55 % 32 % A
5.c Fast Completion (relative speed index) 36 % 44 % 60 % 64 % 56 % 32 % A
Notes. Please refer to Table 4.2 for details. Chance for random identification is 14 to 18 %.
A No completion times simulated in RG.
Identification of Invalid Records in Internet Surveys 24
no aggregation of missing data in single cases. Simulation was limited to responses only,
paradata (particularly completion time) was not simulated.
All respondents completed an online questionnaire about their attitudes towards public
issues. The first part resembles a public opinion poll, asking for the opinions on eighteen
issues, offering between two and four response options per issue. Answers to these eighteen
and three further questions are forced unobtrusively: A technique is employed that
automatically continues with the next question after a valid response or the “don't know”
button is clicked. The second part asks for a detailed evaluation of the attitude towards one
issue, randomly drawn from the previously rated issues. To account for the respondents'
voluntary character, the questionnaire abstains from extensive scale batteries in favor of
short scales. The evaluation in part two includes one 7-point bipolar scale consisting of 16
items to measure the emotional and cognitive aspects of the attitude, based on Crites,
Fabrigar, and Petty (1994). Furthermore, the questionnaire includes four 5-point scales on an
issue's value load (13 items, loosely based on Noelle-Neumann, 1989 and Fuchs, Gerhards,
& Neidhardt, 1991), subjective attitude ambivalence (6 items), elaboration (10 items, based
on Eveland, 2001 and Perse, 1990), and uncertainty (6 items, based on the dimensions of
uncertainty described by Kahneman & Tversky, 1982). All five short scales are presented in
matrix layout, i.e., the radio buttons (scale points) are arranged in a matrix width a column
for each scale point and a row for each item (Figure 2.1, p. 8), the items' texts are left (and
Identification of Invalid Records in Internet Surveys 25
for the bi-polar scale left and right) of the radio buttons. Two open-ended questions ask for
arguments pro and contra the issue to measure the respondent's argument repertoire
(Cappella, Price, & Nir, 2002). Both questions offer up to 25 text inputs for single
arguments, technically accounting for 50 open-ended variables, of which typically 10 or
fewer were answered. In addition to the issue questions, the scale items, and the open-ended
questions, the questionnaire contains multiple single-item questions including the
demographics that are not discussed in detail. The questionnaire does not probe for missing
answers. Median completion time is 16.7 minutes (CG).
Figure 4.1: Gains and losses by threshold (relative speed index, study 1)
Identification of Invalid Records in Internet Surveys 26
To quantify the (1) percentage of missing data, all automatically measured paradata (e.g.,
response time) is excluded from computation. All other responses, accounting for 182
variables, are analyzed in a very technical way, according to the codes stored by the software
SoSci Survey: Choosing the “don't know” option in the 21 response-forcing questions at the
relative speed index
[margin] Death Penalty
0.25 1.00 1.25 1.75
0.75 1.50
relative speed index
[margin] Legalize Marijuana
0.25 1.00 1.25 1.75
0.75 1.50
relative speed index
[mean] Political Interest
0.25 1.00 1.25
[corr.] Certainty & Ambivalence
0.25 1.00 1.25 1.75
0.75 1.5
[corr.] Death Penalty & Marijuana
0.25 1.00 1.25 1.75
0.75 1.50
[corr.] Death Penalty & Polit. Interest
0.25 1.00 1.25 1.75
Figure 4.2: Margins and correlations by relative speed index
The sample was split in 20 subgroups by completion speed index, and each statistic was
computed for each subgroup. The middle was chosen by the overall sample's statistic. The
dashed lines symbolize the 95% confidence interval for each subgroup. If a bar exceeds the
dashed lines, it is significantly different from the overall sample (p < .05).
Identification of Invalid Records in Internet Surveys 27
beginning is considered a valid response, leaving void any of the 50 open-ended text
variables is considered a missing response. Checkboxes used in multiple choice questions
are always considered answered, as not clicking a checkbox is a valid choice. Two indices
for missing data are created, (1.a) the absolute percentage of missing answers and (1.b) an
index that weights each missing answer with the percentage of non-missing data on this
question or item (all reference values and subsequent statistics are based on a study's
complete sample including CG, EG, and RG). As 130 of the 132 closed questions have been
1 .0 1 . 5 2 .0 2 . 5 3 .0
0 .1 1 0
0 .1 1 5
0 .1 2 0
0 .1 2 5
0 .1 3 0
r e l a t iv s p e e d i n d e x
[ m a r g i n ] D e a t h P e n a lt y
50% 80% 95% 99% 100%
1 .0 1 .5 2 .0 2 .5 3 .0
0 .4 0 5
0 .4 1 0
0 .4 1 5
0 .4 2 0
0 .4 2 5
0 .4 3 0
0 .4 3 5
r e l a t iv s p e e d i n d e x
[ m a r g i n ] L e g a l iz e M a ri ju a n a
50% 80% 95% 99% 100%
1 .0 1 .5 2 .0 2 .5 3 .0
3 . 5 8
3 . 5 9
3 . 6 0
3 . 6 1
3 . 6 2
3 . 6 3
r e l a t iv s p e e d i n d e x
[ m e a n ] P o l it ic a l In te r e s t
50% 80% 95% 99% 100%
1 .0 1 . 5 2 .0 2 . 5 3 .0
0 .5 8
0 .5 9
0 .6 0
0 .6 1
0 .6 2
r e l a t iv s p e e d i n d e x
[ c o r r. ] C e r t a in t y & A m b i v a le n c e
50% 80% 95% 99% 100%
1 .0 1 .5 2 .0 2 .5 3 .0
0 .0 9
0 .0 8
0 .0 7
0 .0 6
0 .0 5
0 .0 4
0 .0 3
r e l a t iv s p e e d i n d e x
[ c o r r. ] D e a t h P e n a lt y & M a ri ju a n a
50% 80% 95% 99% 100%
1 .0 1 .5 2 .0 2 .5 3 .0
0 . 1 4
0 . 1 3
0 . 1 2
0 . 1 1
0 . 1 0
0 . 0 9
0 . 0 8
0 . 0 7
r e l a t iv s p e e d i n d e x
[ c o r r. ] D e a t h P e n a lt y & P o li t. I n te r e s t
50% 80% 95% 99% 100%
Figure 4.3: Statistics by threshold
The threshold is varied between 1 and 3, and the resulting statistics are compared to those of
the reference group (middle line). The dashed and dotted lines are the confidence interval (1
and 2 standard deviations) of the reference group's statistics ( NRG = 8.959). The axis above
shows the percentage of the sample retained at a specific threshold (below).
Identification of Invalid Records in Internet Surveys 28
answered by 90 % or more respondents, the indices are highly correlated (r = .985). The
weighting procedure foremost accounts for the voluntary open-ended questions – removing
those from the calculation would cause both indices to merge (r' > .999).
The questionnaire's five short scales are analyzed to find (2) visual response patterns.
Various measured are tested: (2.a) The number of straightlined scales, i.e., scales where each
item received the same response. Given only 5 scales, this measure does not allow for much
differentiation. (2.b) The longest string, i.e., the maximum number of subsequent items that
receive the same response (within any of the scales). This measure is limited to the longest
scale's length with 16 items. (2.c) A small standard deviation of responses within a scale also
indicates straightlining (Barge & Gehlbach, 2012). The deviation index is computed by
averaging the SD within each of the five scales. As the SD within the 7-point scale is not
systematically larger than within the 5-point scales, no further correction is applied. In
contrast to the previous measures, SD is less sensible to single exceptions from the straight
line: If a respondent always clicks the first response, but the second response in the scale's
middle, that is still little deviation. (2.d) To detect patterns beyond straightlining, an
algorithmic measure is employed. The algorithm gives one point if two subsequent items
receive the same answer (detecting straightlining), one point if the change between
subsequent items is the same like the recent change (detecting diagonal lines), and half a
point if the change is the same as the next-to-recent change (detecting left-right clicking).
No more than one point is given per item, and the point sum is divided by the number of
Identification of Invalid Records in Internet Surveys 29
items minus one, resulting in a value between 0 and 1 per scale. Again, the index is
computed by averaging the five scales' values. (2.e) Pretests with manufactured patterns
show that the absolute second derivation of response values is sensitive to straight, diagonal,
and zigzag lines. A test value v is computed for each scale (ri is the response to item i, k is
the number of the scale's items, reduced by 2 as the second discrete derivation is undefined
for the first two items). Small test values indicate patterns. The index, again, is the average
of the five scales' test values.
ri' '
The distance from the sample means (3) is measured twice. (3.a) A simple measure is the
absolute z-scored response per item, averaged over all 51 scale items. Inter-case
standardization automatically results in a distance from the sample mean. Further, the
average of z-scored responses is robust against the items' different standard deviations and
missing responses. (3.b) The Mahalanobis distance of all the 51 scale items from the
sample's means is employed as multivariate distance measure.
Multiple measures based on (4) the correlation structure are tested. (4.a) The even-odd
consistency index is computed for the five scales: Each scale is half-split into even and odd
items, and an index value is computed for each half-scale. Responses to negative items are
inverted before calculating the scale indices. To account for different half-scale lengths and
Identification of Invalid Records in Internet Surveys 30
missing data, item responses are averaged, not summed. This procedure results in two series
(even and odd) of five index values per respondent. The correlation coefficient between
these series is a combined measure of consistent responding within the scales and
differentiating between the scales. (4.b) Regression models for five single responses (test
items) near the questionnaire's end are computed. The most recent responses, two single-
item questions on political interest and issue obtrusiveness, and the last item from each of
the last three scales, are predicted by all previous responses. The latter questions are
explained by more variables than the earlier answers (47 to 75 variables). For each test item,
five binomial logistic regressions predict the individual probability that a respondent will
choose the response 1, 2, 3, 4, or 5. To keep complexity realistic, missing values are replaced
by the sample's rounded average per item. Jandura, Peter, and Küchenhoff (2012) suggest
removing those cases in which at least half of the answers are unlikely, i.e., their probability
is less then one fifth for 5-point scales. For this study, an index is created which gives one
point for each answer that is chosen with a probability less then one fifth, resulting in an
index from 0 to 5. (4.c) The average probability that Jandura, Peter, and Küchenhoff (2012)
found to be less indicative is also computed as an index. Both indices indicate overall
response consistency, using the full sample's (CG and EG) correlation structure as gold
standard. (4.d) A preliminary measure for intra-scale consistency is inspired by the idea of
using regressions. Within each scale, linear regression models predict every scale item's
response by the other scale items' responses. The absolute residuals for each item are
averaged per scale. To create an index of scale inconsistency, the average scale residuals are
Identification of Invalid Records in Internet Surveys 31
again averaged (again, no systematical difference is observed between 5 and 7-point scales).
This index indicates within-scale response consistency while controlling for the items'
specifics. (4.e) There is currently no literature on the interpretation of such a consistency
measure. While 4.d implicates that inconsistent respondents are suspicious, super-consistent
respondents as well are outliers – minimal scale residuals could indicate straightlining and
other forms of item nondifferentiation. The index value's distance from the sample's index
average is therefore included as another measure. This measure is sensitive to inconsistent
and super-consistent respondents at the same time, or in other words, it is sensitive to
atypical within-scale correlation structure. To cover the whole range, identification by small
residuals (indicating super-consistent respondents only) was also tested – and immediately
rejected due to its negligible efficiency.
The completion time (5) of a questionnaire comprises time to read and think, time to
respond (click or type), time for technical processing (Internet transmission, server
processing, also see Stieger & Reips, 2010), and possibly break times that are not actively
spent on the questionnaire, but with leaving the room, checking e-mails, reading news, etc...
The technical processing time is usually in the range of seconds. Given a dozen or more
items or questions that are answered, this small artifact only accounts for a negligible error.
The most simple measure of completion time is (5.a) the absolute duration between starting
the questionnaire and finishing the last page. In this study, control and experimental groups
completed different questionnaires (CG: 13 pages, EG: 16 pages incl. manipulation check)
Identification of Invalid Records in Internet Surveys 32
that have 12 pages in common. The absolute completion time is, therefore, computed
between starting the first common page and finishing the last common page. (1.b) While
doing a web-based survey, respondents can pause as long and as often as they like. The
absolute 12-page completion time ranges between 57 seconds and 4.4 hours (CG). An
extremely long completion time on a single page most likely indicates that the participant
suspended answering the questionnaire to resume later. To avoid such breaks
disproportionately skewing the completion time index, page-wise outliers are replaced by
the sample's median page completion time. The median is chosen as robust measure, as the
distribution of completion times is heavily skewed. For the same reason, an outlier is
defined as taking longer than 3/1.34 times the inter quartile range (IQR) longer than the
sample's median response time (equally to 3 SD if the distribution was normally distributed).
After removing outlying page-completion times, the 12-page completion time ranges from
57 seconds to 20 minutes (Mdn = 14 min.). Note that removing outliers does not only
change the absolute relations, but also the rank order that is relevant for data cleaning:
Spearman's rank correlation coefficient between the both measures of completion time is
only r = .727. (5.c) The absolute completion times always depend on the concrete
questionnaire’s length and complexity. To create a measure that is independent from the
questionnaire, a third index is computed: the relative completion speed. For each page, the
sample's median completion time is divided by the individual completion time, resulting in a
speed factor. A Factor of 2 means that the respondent completed the page twice as fast as the
typical respondent. Before an average speed factor is computed, the page factors are clipped
Identification of Invalid Records in Internet Surveys 33
to the interval [0|3]. This avoids disqualifying respondents who rush through a single page or
incidentally skip it. While limits smaller than 3 blur the distinction of quick responders and
people rushing the page, limits above 3 cause small differences in very short completion
times account for too much variance. This optimization to a limit of 3 is based on the test
data and therefore, this measure's efficiency benchmark presented below is possibly
5. Results
Indicator Efficiency
Research question 1 asks for the data quality indicator that is most efficient in identifying
meaningless data. Identification usually works with a cut off value: All cases which score
higher (or lower) on an indicator than a given cut off are considered “bad data”. While
Johnson (2005) uses score distribution anomalies to identify cut offs and Jandura, Peter, and
Küchenhoff (2012) compare statistical methods to find an appropriate cut off, this study uses
distribution quantiles. If a given percentage p is known to contain meaningless data, an
indicator's 1-p quantile is considered a suitable cut off value.
Due to this study's design, the records from the experimental group are know to contain
meaningless data, namely nEG = 335 (S1) and 486 cases (S2). However, the amount of
meaningless data in the control group is unknown and needs to be estimated. As the only
incentives for participation were goodwill and personal interest, the respondents are
Identification of Invalid Records in Internet Surveys 34
probably highly motivated. A conservative estimate for the percentage of meaningless cases
is the percentage of (mostly) useless cases, identified by traditional measures: Records with
more than 20 % missing answers (without voluntary open-ended text answers, nU1 = 399
cases of NCG = 11,201) and with minimal variance (straightlining in 4 of the 5 scales,
nU2 = 42 cases, no overlap with the missing answer subgroup) account for about 4 % of the
control group. This is a conservative estimate and probably inaccurate. Yet, the estimate is
easily reproducible and allows to reliably contest the different indicators. The estimate and
subsequent analysis particularly do not rely upon unproven assumptions about data quality.
The research questions disallows doing so to avoid the risk of circular explanations.
Assuming that 4% of the control group have completed the questionnaire carelessly does not
change the relative efficiency of the different indicators. It only affects the absolute values
and is intended to give a more realistic estimate of the indicator's efficiency.
Baed on the above assumption, the cut off value for each indicator is the quantile where
4 % NCG + nEG = 783 (S1) and 511 cases (S2) are identified as suspicious data. Given this cut
off and given the estimate of 4 % meaningless data in the control group is valid, a perfect
indicator of meaningless data would identify 100 % of the experimental group and 4 % of
the control group. The more cases from the experimental group remain below the cut off, the
less accurately the indicator identifies meaningless data. The probability to correctly identify
a record from the experimental group by chance is 7 % (S1) and 46 % (S2). Therefore,
identification rates must not be compared between the studies but only within each study.
Identification of Invalid Records in Internet Surveys 35
The large difference is due to the small control group in study 2 (Table 3.1). This first
analysis step excludes the random group (RG) because random data is expected to be
different from both, meaningful and meaningless data.
Study 1
Table 4.1 presents a summary of indicator efficiency in study 1. Missing data (1) is a
surprisingly weak indicator for meaningless data from the experimental group. An
identification rate of 8 % is hardly better than chance (7 %). Patterns in matrix-like
questions (2) correctly identify 20 to 25 % of the experimental group. At least for the short
scales applied in the study, a small standard deviation per scale is a slightly better indicator
(25 %) than the established longest string indicator (23 %). The number of straightlined
scales does not allow a direct comparison due to the distinct character of this indicator: Only
457 cases are identified by 2 or more straightlined scales, but they contain 20 % of the
experimental group which results in a very good likelihood ratio positive (LR+ = 5.7). If the
cut off is set to 1 straightlined scale, 35 % of the experimental group are identified, but also
2,081 cases overall (LR+ = 2.0). Notably, indicators which are sensitive not only to straight
lines but also to diagonal lines (2.d and 2.e) do not perform better than the straightlining
indicators. Regarding the distance from the sample means (3), the multivariate Mahalanobis
distance performs lengths better (18 %) than a simple average distance (6 %), consistent
with the findings from Meade and Craig (2012). The multivariate distance indicator would
have performed even better (approximately 21 %) if it wasn't undefined for nearly 20 % of
Identification of Invalid Records in Internet Surveys 36
the sample due to missing data. Within the group of correlation measures (4), the simple
half-(sub)scale correlation (4.a) is a weak indicator in the present study. The regression
based indicators consistently perform better – particularly the indicator of atypical residuals
(4.e, 30 %) that does not only identify those cases with unpredictable answers, but also those
with highly predictably answers. The best identification rates, however, are achieved with
the completion time indicators (5). These indicators, independent from their specific
computation, identify 38 % of the experimental group.
Indicating Different Types of Meaningless Data
Research question 2 asks for the efficiency of the quality indicators regarding different types
of meaningless data. The experimental conditions are designed to promote rushing (E1a),
careless responding (E2a, E2b), and faking (E3a, E3b). The simulated random group (RG)
complements the set by random responding. To answer RQ 2, each quality indicator must
identify a subcondition's cases within a sample composed of the control group
(NCG = 11,201) and the respective subcondition group. The subcondition groups comprise
between 51 and 82 cases, so a varying cut off value is chosen to identify 499 to 530 cases
(experimental subgroup plus 4 % CG). Note that, compared to the benchmark above, a
smaller percentage of cases is identified, the cut off is stricter, and hits rate by chance is
smaller (4 to 5 % instead of 7 %). In the interest of clarity, this chapter presents only eight of
the seventeen quality indicators – those performing best in the previous analysis, and at least
one per indicator class.
Identification of Invalid Records in Internet Surveys 37
Table 4.2 summarizes the results. Notably, respondents asked to complete the questionnaire
as fast as possible (E1a) are those hardest to identify. One explanation is that manipulation in
E1a did not work as expected and that – although they could simply have omitted all
answers – most respondents from this group did the questionnaire as quickly as possible
without reducing response quality. Therefore, another instruction (E1b) was employed in
study 2. The rushing group (E1a) as well as the groups who were asked to respond
carelessly (E2a, E2b) are best identified by completion time. Those asked for careless but
plausible answers (E3a) apparently were slower and were generally harder to identify.
Completion times as well as an atypical width-scale residual can identify at least 32 % of
this faking group. In-depth analysis shows that these respondents answer both, very
consistently or very inconsistently (20 % are identified by a large residual 4.d, 17 % by a
very small residual, if this were tested as an index). The second faking group, asked not to
disclose personal data and opinions (E3b) is the only experimental subgroup much better
identified by a correlation measure (atypical within-scale residual, 4.e) than by completion
time. The group of simulated random responses (RG) expectedly scores low on missing
data, which is not accumulated in single records, and straightlining. Regarding the best
indicator to identify the random responses, the random group (RG) is somewhat similar to
the second faking group (EG3b): Both are best identified by an atypical within-scale residual
(4.e) or, in this case, by predominantly inconsistent answers within the scales.
Identification of Invalid Records in Internet Surveys 38
Study 2
The data set of study 2 represents more homogeneous data than that of study 1. As the
control group in study 2 (nCG = 621) is only slightly larger than the experimental group
(nEG = 486), 44 % of the overall experimental group would already be identified by chance.
In the homogeneous data set, the indicators based on straightlining (2), distance (3), and
correlation structure (4) fail in clearly exceeding chances. Some indicators even score worse
than chance. In contrast, the percentage of missing data, along the worst indicators in study
1, is one of the best indicators in study 2. Only completion times are equally efficient in both
studies. The findings from study 2 strongly contrast those from study 1, therefore Table 4.3
enlists all results in detail.
Table 4.4 enlists the results for the different subgroups. Here, experimental subgroups (60 to
106 respondents) are compared to the control group (nCG = 621), resulting in by-chance hit
rates between 14 % (E2a) and 18 % (E1b). A pattern indicator (2.e) and the average response
predictability (4.c) are not listed as in Table 4.2 (p. 21) because they score equal or worse
than the other indicators within the same class (2.c and 4.e). The amount of missing data
scores among the best indicators for one rushing (E1b) and one faking group (E2b). The
indicators most suitable for identifying the meaningless data from the rushing and careless
groups, and the other faking group (E2a), again, is completion time. Atypical answers
primarily identify the random data (RG).
Identification of Invalid Records in Internet Surveys 39
6. Conclusion
An innovation of Internet surveys is that questionnaire completion time is automatically
available. This study uses a design where cases of meaningless data are know a-priori.
Multiple post-hoc indicators contest to identify these cases in a larger data set. Completion
time is found to be the most reliable indicator to identify cases of meaningless data – as long
as respondents do not consciously fake data. Conscious faking apparently costs some effort
and levels completion time. Cases with fake data are best identified by anomalies in the
within-scale correlation structure that are detectable, for example, using linear regression
Regarding suspicious patterns in matrix-like questions, the findings suggest to concentrate
on respondents who always select the same point throughout a scale (straightliners). Peculiar
diagonal patterns have strong face validity to identify meaningless data, but they are rare.
Only 43 respondent from the experimental groups (5 %, n = 821) and 21 respondents from
the control group (2 ‰, N = 11,201) show such patterns in at least one of five scales. Only
five respondents in the experimental groups fill two scales with zigzag patterns (identified
by an algorithm and validated manually), no one more. Notably, most zigzag answers of the
experimental groups are found in two longer scales (10 and 13 items) while such answers of
the control group are exclusively found in two shorter scales (6 items, both). These may still
be valid answers that incidentally have a zigzag shape.
Identification of Invalid Records in Internet Surveys 40
One group of cases was simulated in each study, using random data. Findings on this group
suggest that meaningless data collected from respondents is very different from random
data. While random data is, by creation, distinguished by large in-scale variation,
meaningless data typically shows little variation. The only experimental group that shows
similarly unsystematic responses like the random group – in the sense of within-scale
correlation structure – are those respondents who were asked not to disclose any personal
data or opinion.
7. Choosing a Threshold
To remove meaningless data, it is necessary to choose a threshold that separates “good”
from “bad” data. As the presented indicators can only moderately predict the group (i.e., the
data quality), any choice of a threshold will go along with an error: Meaningless data will
still remain in the data set, while meaningful cases are removed. Note that Figure 4.1 shows
percentages of the control and experimental groups, and therefore, is somewhat misleading:
1% of the control group comprises about 100 cases, while 3½ cases make up 1% of the
experimental group. Lowering the threshold always means to remove more “good” cases
from the control group than “bad” cases from the experimental group. Generally, the
probability of losing “good” data increases with the percentage of “good” data. Therefore, it
is advisable to choose an indulgent threshold if a small amount of meaningless data is
expected, while a stricter threshold will be necessary, if a significant amount of “bad” data
has to be expected.
Identification of Invalid Records in Internet Surveys 41
Under this premise, we'll suggest an optimal threshold for cleaning data, based on
completion time (relative speed index, 5.c). While the three indices for completion time
were similarly efficient, the relative speed index is standardized by the sample's completion
time median (within a range of 0 to 3), and controls for single outliers regarding page
completion times.
We do not recommend the threshold that yields the maximum statistical discrimination
between the control and experimental group (Figure 4.1) – because of methodological issues
(limited manipulation success, meaningless data in the control group) and because the goal
of data cleaning is not to identify experimental groups. The goal is a reduction of biases
caused by “bad” data. Therefore, using the data set from study 1 (S1) we correlate
completion speed against biases. Three descriptive margins and three correlations were
computed – arbitrarily selected to cover a range of different effect strengths. We split the
overall sample from study 1 (N = 11.201 + 335) by completion speed to create 40 similarly
large groups. The first group comprises those 5% respondents (n1 = 577) with the slowest
completion speed, and so on, till group 20 with those 5% respondent (n20 = 577) who spent
least time on the questionnaire. Figure 4.2 shows the margins and correlations when
computed for these groups individually. In opposition to the assumptions stated earlier in
this paper, not all variables are independent from the response time: Political interest, as a
plausible example, is weakly correlated (r = −.08, p < .001, CG) to the completion speed
index. While some statistics show clear outliers for fast respondents, others do not. One first
Identification of Invalid Records in Internet Surveys 42
suggestions, therefore, is to check the relevant measures for correlations with response speed
(especially the subsample with a index above 1.25). If there is no correlation, fast
respondents do not introduce a bias, and there is no gain in removing them from the sample.
Another analysis is conducted to quantify possible biases: A subsample of the control group
(CG) serves as reference group (NRG = 8.959 of NCG = 11,201) to obtain (mostly) unbiased
statistics. The reference group comprises those respondents of the control group with a
completion speed index between 10% and 90% of the sample, i.e., each 10% cases with the
fastest and slowest completion speed were removed from the control group. The statistics'
confidence interval (95%) is estimated by bootstrapping (using 2.500 repetitions) from the
reference group. Different thresholds between 1 (retaining 45% of the full sample) and 3
(possible maximum, full sample) are employed and the statistics are compared to those of
the reference group (Figure 4.3). As one already expects from Figure 4.2, only few statistics
(e.g., the margin pro death penalty) become less biased, if cases are removed by completion
speed. For other statistics, a bias against the reference group occurs due to including
respondents who completed the questionnaire rather slowly (not included in the reference
group), so a larger threshold rather decreases then increases the bias (e.g., the correlation
between the positions pro death penalty, and pro legalizing marijuana). Looking at the
absolute values also reveals that gains are limited: Differences in support of death penalty
between 12.1% (reference group) and 12.7% (all respondents, including min. 3% poor data
from the experimental group) will lie within the sampling error of most studies.
Identification of Invalid Records in Internet Surveys 43
Based on the previous analyses, a sensible threshold for the relative speed index is between
1.5 and 2.0. The few records with an index above 2.0 (completed the questionnaire more
than two times faster then the typical respondent) most likely contain meaningless data. The
same is unlikely for records with an index below 1.5. In large samples, it's useful to compare
statistics when computed with the full sample and with records below an index of 1.5. When
limiting the sample has no significant effect on the results, removing cases above 2.0 or not
removing any records at all, will be sufficient. If there are differences, or if a small sample
does not allow comparisons, we suggest a threshold of 1.75 as a compromise between
retaining meaningful data and removing meaningless data.
8. Discussion
Methodologically, it is a very welcome result that completion time scores best in the
identification of meaningless data. Such paradata is probably uncorrelated to most constructs
regarding a study's research question; at least, if the field of research is not about age,
reading, computer literacy, or (as seen above) interest/involvement in the questionnaire's
issue. While it's a common criticism against cleaning data based on atypical responses,
removing cases by completion time is unlikely to cause systematic bias. Yet, ethical issues
may oppose: Completion time is collected without the respondent's knowledge, and there is
large inter-individual variation. Finally, the researcher might expunge a respondent's
valuable support due to a considerable statistical error. When cleaning data by completion
time, we suggest a speed index, that controls for outlier pages: A quotient between the
Identification of Invalid Records in Internet Surveys 44
median page completion time and the respondent's completion time is computed for each
questionnaire page. Pages that show (very) different content due to filter questions, shall be
omitted. Quotients above 3 (individual outliers) are replaced by 3. The average throughout
all pages, then, is the speed index.
Unless a lot of respondents (about one third or more) do the questionnaire suspiciously fast,
this index yields comparable values. Removing cases with an index above 1.75 will be
suitable for most studies.
Regarding the differences between studies 1 and 2, two findings merit attention. Firstly, the
amount of missing data identifies cases in a homogeneous (S2), but not in a heterogeneous
data set (S1). The most plausible explanation is that there was large variance in being
familiar with the questionnaire's issue in study 1. Implicit expressions of “don't know” likely
confound with the omission of answers due to careless responding. This makes the amount
of missing data a quite unreliable indicator for meaningless data, unless familiarly with the
issue is controlled. Secondly, the distance indicators (e.g., the multivariate Mahalanobis
distance from the sample mean) performed unexpectedly poor in identifying the
experimental group in study 2. A plausible explanation is the atypically large percentage of
meaningless data, which influences the overall distribution. In such cases, the average
becomes an inappropriate reference for “typical” data. In combination with the critique that
removing data based on distance necessarily influences the sample's variance, the uses of
distance measures for data cleaning is limited.
Identification of Invalid Records in Internet Surveys 45
On one hand, the results are far from encouraging. Even the best indicators for meaningless
data identify less than half of those records, known to contain careless data. Their efficiency
drops further, if careless respondents try to make their answers looking “good”. The benefits
of removing suspicious cases demands a critical discussion for each individual survey,
considering that half (or more) of the meaningless cases remain in the data set, while
meaningful data is deleted. Furthermore, it is important to note that removing meaningless
records is not necessarily the same as removing outliers. The latter may be sensible anyway,
if few cases disproportionately skew analyses due to extreme responses – for example,
respondents who try to sabotage a study (Konstan et al., 2005), or who always select the first
or last response option.
On the other hand, the study's estimates of identification efficiency are predominantly
conservative. Firstly, comments from the experimental group suggest that the treatment's
success in creating meaningless data was limited. It was necessary to remove a significant
amount of records that failed the manipulation check – and removal criteria were still chosen
mildly. Secondly, the absolute efficiency figures are based on conservative estimates of the
control group's data quality. Thirdly, the large variance in study 1 makes it especially
challenging to identify atypical cases. While a true challenge for the indicators was
achieved, the first two issues suggest inventing a better treatment – a design that can more
precisely yield meaningful and meaningless data. Such a treatment might even allow to
include self-reported measures of data quality without confounding a manipulation check.
Identification of Invalid Records in Internet Surveys 46
This study includes neither self-reports nor scales designed to measure inconsistent or faked
responding. This shortcoming prevents a comparison of a-priori and post-hoc measures of
data quality. Another shortcoming regarding comparability is the choice for short scales in
the questionnaire: Although short scales often meet the needs of applied social research,
other empirical work on the removal of meaningless cases in Internet surveys (Johnson,
2005; Meade & Craig, 2012) employs extensive psychological scales. An one-hour
personality scale requires much more concentration and cognitive toughness than a
diversified twenty-minute questionnaire. The effects of carelessness on response behavior
might be quite different.
Another shortcoming might lead to overestimating the identification rates: Control and
experimental group are from slightly different populations (recruited for the access panel
before/after June 2011). Indicators measuring the distance from the sample mean most likely
“benefit” from this difference. On average, the respondents are older in the experimental
group compared to the control group. Literature reports age to negatively correlate to
computer literacy, therefore, the completion time indicators likely “suffer” from the
Finally, this study strengthens the notion that meaningless data is a complex issue. (1)
completion time scores best in spite of its limitations, while most other data quality
indicators show rather weak identification rates. One interpretation is, that meaningless
response behavior is time-saving but disperse in other aspects. Some participants apparently
Identification of Invalid Records in Internet Surveys 47
reduce differentiation, which is visible as (near-)straightlining. Yet, the study's results do not
paint a clear picture of what the others do – van Vaerenbergh and Thomas (2012) present a
typology of possible response styles. A combination of different indicators may prove
fruitful. (2) In the interest of clarity, this study does not conceptualize data quality as the
continuum that it most likely is. Galesic and Bosnjak (2009), for example, show that a
respondent's carefulness might decline throughout the questionnaire. If such decline results
in the last few scales straightlined, a case-wise indicator still measures moderate data
quality. Further, if data quality is a continuum, there can be no genuine threshold value to
distinguish “good” from “bad” data.
The practical summary of this study is that completion time is quite useful in identifying
cases of meaningless data. If an attractive incentive, for example, provokes large amounts of
meaningless data, removing suspicious cases could seriously increase data quality. In
contrast, a substantial amount of meaningful data may be lost, if the percentage of
meaningless data is small. Not to remove suspicious records at all, and to risk type II errors,
may be reasonable in such a setting. Avoiding type I errors by excluding outliers from
statistical analyses shall be considered independently.
Identification of Invalid Records in Internet Surveys 48
Table: Cross-correlation between different data quality indicators
Indicator 1.a 1.b 2.c 2.e 3.a 3.b 4.c 4.e 5.a 5.c
1.a Missing Data (absolute) .95 .18 −.18 −.11 −.07 −.02 .09 .32 .37
1.b Missing Data (weighted) .95 .16 −.18 −.08 −.06 −.03 .09 .28 .35
2.c Straightlining (within scale SD) .18 .16 −.77 −.33 −.71 .21 .02 .14 .17
2.e Patterns (second derivation) −.18 −.18 −.77 .18 .62 −.05 −.04 −.16 −.20
3.a Avg. Item Dst. f. Sample Mean .11 −.08 −.33 .18 .39 −.35 −.01 −.04 −.06
3.b Mahalanobis Dst. f. Spl. Mean −.07 −.06 −.71 .62 .39 −.12 −.04 −.07 −.10
4.c Average Response Predictability −.02 −.03 .21 −.05 −.35 −.12 .06 −.02 −.03
4.e Atypical Answers .09 .09 .02 −.04 −.01 −.04 .06 .05 .07
5.a Fast Completion (absolute time) .32 .28 .14 −.16 −.04 −.07 −.02 .05 .85
5.c Fast Completion (index) .37 .35 .17 −.20 −.06 −.10 −.03 .07 .85
Notes. The table gives rank correlations (Spearman), as the absolute value of an indicator
and its distribution is irrelevant for removal by threshold.
Identification of Invalid Records in Internet Surveys 49
Allen, I. L. (1966). Detecting respondents who fake and confuse information about
questions areas on surveys. Journal of Applied Psychology, 50(6), 523–528.
Aust, F., Diedenhofen, B., Ullrich, S., & Musch, J. (2012). Seriousness checks are useful to
improve data validity in online research. Behavior Research Methods. Advance online
Barge, S., & Gehlbach, H. (2012). Using the Theory of Satisficing to Evaluate the Quality of
Survey Data. Research in Higher Education, 53(2), 182–200.
Bauermeister, J. A., Pingel, E., Zimmerman, M., Couper, M. P., Carballo-Diéguez, A., &
Strecher, V. J. (2012). Data Quality in HIV/AIDS Web-Based Surveys: Handling
Invalid and Suspicious Data. Field Methods, 24(3), 272–291.
Baumgartner, H., & Steenkamp, J.-B. E.M. (2001). Response Styles in Marketing Research:
A Cross-National Investigation. Journal of Marketing Research, 38(2), 143–156.
Bethlehem, J., & Biffignandi, S. (2012). Handbook of web surveys. Hoboken, NJ: Wiley.
Bhaskaran, V., & LeClaire, J. (2010). Online surveys for dummies. For dummies. Hoboken,
N.J: Wiley.
Identification of Invalid Records in Internet Surveys 50
Birnbaum, M. H. (2003). Methodological and Ethical Issues in Conducting Social
Psychology Research via the Internet. In C. Sansone (Ed.), The Sage handbook of
methods in social psychology (pp. 359–382). Thousand Oaks, CA: Sage.
Börkan, B. (2010). The Mode Effect in Mixed-Mode Surveys. Social Science Computer
Review, 28(3), 371–380.
Bowen, A. M., Daniel, C. M., Williams, M. L., & Baird, G. L. (2008). Identifying Multiple
Submissions in Internet Research: Preserving Data Integrity. AIDS and Behavior,
12(6), 964–973.
Burns, G. N., & Christiansen, N. D. (2011). Methods of Measuring Faking Behavior.
Human Performance, 24(4), 358–372.
Cappella, J. N., Price, V., & Nir, L. (2002). Argument Repertoire as a Reliable and Valid
Measure of Opinion Quality: Electronic Dialogue During Campaign 2000. Political
Communication, 19(1), 73–93.
Cellagaro, M. (2013). Paradata in web surveys. In F. Kreuter (Ed.), Improving surveys with
paradata. Analytic use of process information (pp. 261–279). Hoboken, NJ: John
Wiley & Sons.
Couper, M. P. (2010). Internet Surveys. In P. V. Marsden & J. D. Wright (Eds.), Handbook of
survey research (2nd ed., pp. 527–550). Bingley: Emerald.
Crites, S. L., Fabrigar, L. R., & Petty, R. E. (1994). Measuring the Affective and Cognitive
Properties of Attitudes: Conceptual and Methodological Issues. Personality and Social
Psychology Bulletin, 20(6), 619–634.
Identification of Invalid Records in Internet Surveys 51
Draisma, S., & Dijkstra, W. (2004). Response Latency and (Para)Linguistic Expressions as
Indicators of Response Error. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler,
E. Martin, J. Martin,. . . C. Skinner (Eds.), Wiley Series in Survey Methodology.
Methods for Testing and Evaluating Survey Questionnaires. Hoboken, NJ: John Wiley
& Sons.
Eveland, W. P. (2001). The Cognitive Mediation Model of Learning From the News:
Evidence From Nonelection, Off-Year Election, and Presidential Election Contexts.
Communication Research, 28(5), 571–601.
Fazio, R. H. (1990). A practical guide to the use of response latency in social psychological
research. In C. Hendrick & M. S. Clark (Eds.), Review of personality and social
psychology: Vol. 11. Research methods in personality and social psychology (pp. 74–
97). Newbury Park: Sage.
Fowler, F. J. (2009). Survey research methods (4th ed.). Applied social research methods
series: Vol. 1. Los Angeles: Sage.
Franzén, M. (2001). Nonattitudes/Pseudo-Opinions: Definitional Problems, Critical
Variables, Cognitive Components and Solutions. Retrieved from
Fuchs, D., Gerhards, J., & Neidhardt, F. (1991). Öffentliche Kommunikationsbereitschaft:
Ein Test zentraler Bestandteile der Theorie der Schweigespirale
(Veröffentlichungsreihe der Abteilung Öffentlichkeit und soziale Bewegungen des
Identification of Invalid Records in Internet Surveys 52
Forschungsschwerpunkts Sozialer Wandel, Institutionen und Vermittlungsprozesse des
Wissenschaftszentrums Berlin für Sozialforschung). Berlin. Retrieved from
Wissenschaftszentrum Berlin für Sozialforschung gGmbH (WZB) website:
Furnham, A., Hyde, G., & Trickey, G. (2013). On-line questionnaire completion time and
personality test scores. Personality and Individual Differences, 54(6), 716–720.
Galesic, M., & Bosnjak, M. (2009). Effects of questionnaire length on participation and
indicators of response quality in a web survey. Public Opinion Quarterly, 73(2), 349–
Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring
the lower-level facets of several five-factor models. In I. Mervielde, I. Deary, Fruyt F.
de, & F. Ostendorf (Eds.), Personality psychology in Europe (Vol. 7, pp. 7–28).
Tilburg, NL: Tilburg Univ. Press.
Göritz, A. S. (2004). The impact of material incentives on response quantity, response
quality, sample composition, survey outcome, and cost in online access panels.
International Journal of Market Research, 46(3), 327–345.
Göritz, A. S. (2006). Incentives in Web Studies: Methodological Issues and a Review.
International Journal of Internet Science, 1(1), 58–70. Retrieved from
Identification of Invalid Records in Internet Surveys 53
Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R.
(2011). Survey Methodology (2nd ed.). Hoboken, NJ: John Wiley & Sons.
Harzing, A.-W., Brown, M., Köster, K., & Zhao, S. (2012). Response Style Differences in
Cross-National Research. Management International Review, 52(3), 341–363.
Jandura, O., Peter, C., & Küchenhoff, H. (2012). Die Guten ins Töpfchen, doch wer sind die
Schlechten? Ein Vergleich verschiedener Strategien der Datenbereinigung [Picking
the good ones, but which are the bad ones? Comparing different strategies of cleaning
data]. 14th annual conference of the DGPuK methods group, Sept. 27.-29., Zürich,
Johnson, J. A. (2005). Ascertaining the validity of individual protocols from Web-based
personality inventories. Journal of Research in Personality, 39(1), 103–129.
Jong, M. G. de, Steenkamp, J.-B. E.M., Fox, J.-P., & Baumgartner, H. (2008). Using Item
Response Theory to Measure Extreme Response Style in Marketing Research: A
Global Investigation. Journal of Marketing Research, 45(1), 104–115.
Kahneman, D., & Tversky, A. (1982). Variants of uncertainty. In D. Kahneman, P. Slovic, &
A. Tversky (Eds.), Judgement under uncertainty. Heuristics and biases (pp. 509–520).
Cambridge, London: Cambridge Univ. Press.
Identification of Invalid Records in Internet Surveys 54
Konstan, J. A., Rosser, B. R. S., Ross, M. W., Stanton, J., & Edwards, W. M. (2005). The
story of subject naught: A cautionary but optimistic tale of Internet survey research.
Journal of Computer-Mediated Communication, 10(2). Retrieved from
Kreuter, F. (2013). Improving surveys with paradata: Introduction. In F. Kreuter (Ed.),
Improving surveys with paradata. Analytic use of process information (pp. 1–9).
Hoboken, NJ: John Wiley & Sons.
Krosnick, J. A. (1991). Response strategies for coping with the cognitive demands of
attitude measures in surveys. Applied Cognitive Psychology, 5(3), 213–236.
Krosnick, J. A. (1999). Survey reserach. Annual Review of Psychology, 50, 537–567.
Retrieved from
Krosnick, J. A., & Alwin, D. F. (1988). A Test of the Form-Resistant Correlation Hypothesis:
Ratings, Rankings, and the Measurement of Values. The Public Opinion Quarterly,
52(4), 526–538.
Krosnick, J. A., & Fabrigar, L. R. (2003). “Don’t Know” and “No Opinion” Responses:
What They Mean, Why They Occur, and How to Discourage Them: Paper presented to
the Workshop on Item Non-response and Data Quality in Large Social Surveys, Basil,
Switzerland, 10/2003.
Identification of Invalid Records in Internet Surveys 55
Kurtz, J. E., & Parrish, C. L. (2001). Semantic Response Consistency and Protocol Validity
in Structured Personality Assessment: The Case of the NEO-PI-R. Journal of
Personality Assessment, 76(2), 315–332.
Kwak, N., & Radler, B. (2002). A Comparison Between Mail and Web Surveys: Response
Pattern, Respondent Profile, and Data Quality. Journal of Official Statistics, 2(18).
Retrieved from
Leiner, D. J. (2012, March). SoSci Panel: The Noncommercial Online Access Panel. Poster
presented at the GOR 2012, Mannheim. Retrieved from
Leiner, D. J., & Doedens, S. (2010). Test-Retest-Reliabilität in der Forschungspraxis der
Online-Befragung. In N. Jackob, T. Zerback, O. Jandura, & M. Maurer (Eds.),
Methoden und Forschungslogik der Kommunikationswissenschaft: Vol. 6. Das
Internet als Forschungsinstrument und -gegenstand in der
Kommunikationswissenschaft (pp. 316–331). Köln: Halem.
Lim, J., & Butcher, J. N. (1996). Detection of Faking on the MMPI--2: Differentiation
Among Faking-Bad, Denial, and Claiming Extreme Virtue. Journal of Personality
Assessment, 67(1), 1–25.
Malhotra, N. (2008). Completion time and response order effects in web surveys. Public
Opinion Quarterly, 72(5), 914–934.
Identification of Invalid Records in Internet Surveys 56
Marsden, P. V., & Wright, J. D. (Eds.). (2010). Handbook of survey research (2nd ed.).
Bingley: Emerald.
Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data.
Psychological Methods, 17(3), 437–455.
Musch, J., & Reips, U.-D. (2000). A Brief History of Web Experimenting. In M. H.
Birnbaum (Ed.), Psychological experiments on the Internet (pp. 61–88). San Diego,
CA: Academic Press.
Nichols, D. S., Greene, R. L., & Schmolck, P. (1989). Criteria for assessing inconsistent
patterns of item endorsement on the MMPI: Rationale, development, and empirical
trials. Journal of Clinical Psychology, 45(2), 239–250.
Noelle-Neumann, E. (1989). Die Theorie der Schweigespirale als Instrument der
Medienwirkungsforschung. In M. Kaase & W. Schulz (Eds.), Kölner Zeitschrift für
Soziologie und Sozialpsychologie, Sonderhefte: Vol. 30. Massenkommunikation.
Theorien, Methoden, Befunde (pp. 418–440). Opladen: Westdt. Verl.
Olson, K., & Parkhurst, B. (2013). Collecting paradata for measurement error evaluations. In
F. Kreuter (Ed.), Improving surveys with paradata. Analytic use of process information
(pp. 43–72). Hoboken, NJ: John Wiley & Sons.
Payne, S. L. (1950). Thoughts About Meaningless Questions. Public Opinion Quarterly,
14(4), 687–696.
Identification of Invalid Records in Internet Surveys 57
Perse, E. M. (1990). Media Involvement and Local News Effects. Journal of Broadcasting
& Electronic Media, 34(1), 17–36.
Pine, D. E. (1995). Assessing the validity of job ratings: An empirical study of false
reporting in task inventories. Public Personnel Management, 24(4), 451.
Rogers, F., & Richarme, M. (2009). The Honesty of Online Survey Respondents: Lessons
Learned and Prescriptive Remedies. Retrieved from
Schendera, C. F. G. (2007). Datenqualität mit SPSS. München et al.: Oldenbourg.
Schneider, K. C. (1985). Uninformed response rates in survey research: New evidence.
Journal of Business Research, 13(2), 153–162.
Schonlau, M., & Toepoel, V. (2015). Straightlining in Web survey panels over time. Survey
Research Methods, 9(2).
Schuman, H., & Presser, S. (1980). Public Opinion and Public Ignorance: The Fine Line
between Attitudes and Nonattitudes. American Journal of Sociology, 85(5), 1214–
Selm, M. van, & Jankowski, N. W. (2006). Conducting Online Surveys. Quality & Quantity,
40(3), 435–456.
Shin, E., Johnson, T. P., & Rao, K. (2012). Survey Mode Effects on Data Quality:
Comparison of Web and Mail Modes in a U.S. National Panel Survey. Social Science
Computer Review, 30(2), 212–228.
Identification of Invalid Records in Internet Surveys 58
Sniderman, P. M., & Bullock, J. (2004). A Consistency Theory of Public Opinion and
Political Choice: The Hypothesis of Menu Dependance. In W. E. Saris & P. M.
Sniderman (Eds.), Studies in public opinion. Attitudes, nonattitudes, measurement
error, and change (pp. 337–357). Princeton: Princeton Univ. Press.
Stieger, S., & Reips, U.-D. (2010). What are participants doing while filling in an online
questionnaire: A paradata collection tool and an empirical study. Computers in Human
Behavior, 26(6), 1488–1495.
Sue, V. M., & Ritter, L. A. (2012). Conducting online surveys (2nd ed.). Los Angeles: Sage.
van Vaerenbergh, Y., & Thomas, T. D. (2012). Response Styles in Survey Research: A
Literature Review of Antecedents, Consequences, and Remedies. International
Journal of Public Opinion Research, 25(2), 195–217.
Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data
consumers. Journal of Management Information Systems, 12(4), 5–34. Retrieved from
Woods, C. M. (2006). Careless Responding to Reverse-Worded Items: Implications for
Confirmatory Factor Analysis. Journal of Psychopathology and Behavioral
Assessment, 28(3), 186–191.
Yan, T., & Tourangeau, R. (2008). Fast times and easy questions: the effects of age,
experience and question complexity on web survey response times. Applied Cognitive
Psychology, 22(1), 51–68.
... To ensure high data quality that allows valid inferences, it is recommended to exclude low-quality data prior to the analysis 57,58 . We followed recommendations to identify meaningless data by inspecting implausible values and response patterns, such as straight-lining 58 . We aimed to be as liberal as possible. ...
... For example, we excluded any wave where the sum of all daily use was above 64 hours. Applying all exclusion criteria removed about 10% of observations and 9% of participantsboth relatively low estimates of meaningless data 57,58 . Of the 2,159 remaining participants with data for at least three waves, 52% identified as women (Mage = 47, SDage = 15). ...
Full-text available
It is often assumed that traditional forms of media such as books enhance well-being, whereas new media do not. However, we lack evidence for such claims and media research is mainly focused on how much time people spend with a medium, but not whether someone used a medium or not. We explored the effect of media use during one week on well-being at the end of the week, differentiating time spent with a medium and use versus nonuse, over a wide range of different media types: music, TV, films, video games, (e-)books, (digital) magazines, and audiobooks. Results from a six-week longitudinal study representative of the UK population 16 years and older (N = 2159) showed that effects were generally small; between-person relations but rarely within-person effects; mostly for use versus nonuse and not time spent with a medium; and on affective well-being, not life satisfaction.
... In the first step, respondents, who showed no variance in questions with Likert scales [147], p. 210) or who gave "don't know" answers too frequently were discarded. As research results on speeding as a reliable indicator for low response quality are ambiguous (e.g., [146,[148][149][150]), we combined the time required for the survey with another indicator of poor response behavior [151,149] in step two. Following Orme [152], we compared individuals' root likelihood (RLH) estimates with estimates from a simulated sample with random responses. ...
... In the first step, respondents, who showed no variance in questions with Likert scales [147], p. 210) or who gave "don't know" answers too frequently were discarded. As research results on speeding as a reliable indicator for low response quality are ambiguous (e.g., [146,[148][149][150]), we combined the time required for the survey with another indicator of poor response behavior [151,149] in step two. Following Orme [152], we compared individuals' root likelihood (RLH) estimates with estimates from a simulated sample with random responses. ...
Electricity from renewable and/or nuclear sources (“carbon–neutral electricity”) can be used for charging plug-in electric vehicles in order to significantly reduce their lifecycle carbon emissions. However, little is known about how carbon–neutral charging services (CNCS) for electric vehicles must be designed to attract consumers’ interest. Therefore, we conducted a survey including an Adaptive Choice-based Conjoint Analysis to investigate private consumers’ awareness and preferences for different service attributes of CNCS, specifically the energy source, regionality of generation, additionality, balancing period, charging flexibility, and price. The online survey in Germany of 510 private consumers who were interested in electric vehicles shows that awareness is highest for the energy source and regionality. Behind price, these are also the second and third most important attributes when choosing a CNCS. Even though, on average, there seems to be a link between the awareness for an attribute, its importance and consumers’ willingness to pay, this link proves to be non-significant for individual respondents. In conclusion, providers of CNCS could prominently advertise the attributes consumers are most aware of (i.e., the energy source and regionality of generation). Explaining the less-known, and therefore deemed unimportant, attributes (i.e., additionality, balancing period, and charging flexibility) to consumers could be difficult and costly for providers. From a system perspective, however, more convincing explanation efforts could lead to an increased use of renewables and enhanced system reliability. This raises the challenge for policymakers to create a framework which is both advantageous for the energy system and attractive to consumers. Government-set standards for charging tariffs may be a viable option.
... In the end, 568 people started the online questionnaire, 275 (48.4% of those who started) completed the entire questionnaire, and 279 completed all tests and questions except for the Headphones and Loudspeaker Test (HALT; Wycisk et al., 2022;Wycisk et al., 2018). As recommended by Leiner (2019), participants with a high relative speed index (1.75 or greater) were excluded. That was the case for one person. ...
According to Feldman (1993), musical prodigies are expected to perform at the same high level as professional adult musicians and, therefore, are indistinguishable from adults. This widespread definition was the basis for the study by Comeau et al. (2017), which investigated if participants could determine whether an audio sample was played by a professional pianist or a child prodigy. Our paper is a replication of this previous study under more controlled conditions. Our main findings partly confirmed the previous findings: Comparable to Comeau et al.’s (2017) study (N = 51), the participants in our study (N = 278) were able to discriminate between prodigies and adult professionals by listening to music recordings of the same pieces. The overall discrimination performance was slightly above chance (correct responses: 53.7%; sensitivity d’ = 0.20), which was similar to Comeau et al.’s (2017) results of the identification task with prodigies aged between 11 and 14 years (approximately 54.6% correct responses; sensitivity approximately d’ = 0.13). Contrary to the original study, musicians and pianists in our study did not perform significantly better than other participants. Nevertheless, it is generally possible for listeners to differentiate prodigies from adult performers—although this is a demanding task.
... Also, despite rigorous post-hoc checks of the data on a case-by-case basis, further procedures could have been put in place during the data collection to verify the quality of the responses. Such procedures include adding trap items, instructional manipulation checks, collecting more metadata (browser information in addition to IPs and geolocation data), collecting paradata (timing, mouse clicks), implementing URL control (creating individual links, utilizing cookies) and using response pattern detection algorithms (Leiner, 2013;Oppenheimer et al., 2009;Teitcher et al., 2015). ...
Full-text available
Developmental crisis is a construct that is central to many theories of psychosocial adult development, yet there is currently no validated psychometric measure of adult developmental crisis that can be used across adult age groups. To address this gap in the literature, we developed and validated an age-independent measure of adult developmental crisis for research and applied purposes, entitled the Developmental Crisis Questionnaire (DCQ-12). Exploratory and confirmatory factor analyses were conducted separately on different samples. A three-factor structure emerged as the best fit with the data: (1) Disconnection and Distress; (2) Lack of Clarity and Control and (3) Transition and Turning Point. The DCQ-12 showed predictive validity with measures of self-esteem, locus of control, authentic living, optimism, presence of and search for meaning, turning points and a related crisis measure. Four-week test–retest reliability ranged from 0.78 to 0.89 across subscales. As well as research uses, the DCQ-12 measure has potential application in practice, given that assessment of developmental crisis has relevance to professionals working in clinical and non-clinical roles to support and coach adults through periods of transition.
... These included control questions on the content of the information provided, as well as the processing time of the questionnaire. Subjects who completed the study in an implausibly short time (i.e. more than two times faster than the average respondent, as proposed (Leiner, 2013)) were excluded. ...
Full-text available
Background To slow down the spread of COVID-19, the observance of basic hygiene measures, and physical distancing is recommended. Initial findings suggest that physical distancing in particular can prevent the spread of COVID-19. Objectives To investigate how information to prevent the spread of infectious diseases should be presented to increase willingness to comply with preventive measures. Methods In a preregistered online experiment, 817 subjects were presented with either interactively controllable graphics on the spread of COVID-19 and information that enable them to recognize how much the spread of COVID-19 is reduced by physical distancing (experimental group) or text-based information about quantitative evidence (control group). It was hypothesized that participants receiving interactive information on the prevention of COVID-19 infections show a significantly higher willingness to comply with future containment measures than participants reading the text-based information. Explorative analyses were conducted to examine whether other factors influence compliance. Results As predicted, we found a small effect (d = 0.22, 95% CI: 0.11; 0.23, p < .001) for the tested intervention. The exploratory analysis suggests a decline in compliance later in the study (r = −0.10, 95% CI: −0.15; −0.07). Another significant predictor of change in compliance was health-related anxiety, but the effect was trivial. Conclusions When presented interactively, information on how the own behavior can help prevent infectious diseases can lead to slightly stronger changes in attitude towards behavioral prevention measures than just text-based information. Given the scalability of this simple internet-based intervention, it could play a role in fostering compliance during a pandemic within universal prevention strategies. Future work on the predictive validity of self-reported compliance and the real-world effects on the intervention is needed.
... Straight-liner is common in the online survey who repeatedly selected the same answer for the matrix-style questions in the online questionnaire [89,90]. Therefore, returned questionnaires with more than 90% of 16 items of TAM-based questionnaire items sharing the same answer were filtered and excluded from the analysis, aiming to remove possible careless respondents. ...
Full-text available
VR technology has been demonstrated great potential as a training platform for construction training & education (T&E). Although the effectiveness of VR technology has been widely validated, their low acceptance and usage are still observed, lacking the understanding of users' attitudes towards using VR technology. Therefore, this research aims to investigate the factors behind low acceptance of VR technology through the extended TAM. Perceived price value, self-efficacy, and perceived playfulness were adopted as the external motivators integrated with the TAM framework. Results revealed that perceived usefulness (PU) and perceived ease of use (PEU) are significant direct predictors of attitude. Moreover, PU does not directly affect intention to use (IU), but a significant influence of perceived playfulness on IU was observed. These findings provide theoretical support for predicting users' acceptance of VR technology in construction T&E and empirical implications for guiding decisions in the design and development of VR systems.
... Resilience questionnaires were filled out during an informational preparation session for the fMRI measurements and administered online using Unipark software (EFS Survey, Questback GmbH). As completion time of online questionnaires has been identified as the most reliable indicator of data being meaningful or meaningless (Leiner, 2013), we evaluated the quality of questionnaire data using a quality index provided by the Unipark system that compares the completion time of each participant with the average completion time of our sample. As preregistered ( ...
Full-text available
This study aimed at replicating a previously reported negative correlation between node flexibility and psychological resilience, i.e., the ability to retain mental health in the face of stress and adversity. To this end, we used multiband resting-state BOLD fMRI (TR = .675 sec) from 52 participants who had filled out three psychological questionnaires assessing resilience. Time-resolved functional connectivity was calculated by performing a sliding window approach on averaged time series parcellated according to different established atlases. Multilayer modularity detection was performed to track network reconfigurations over time and node flexibility was calculated as the number of times a node changes community assignment. In addition, node promiscuity (the fraction of communities a node participates in) and node degree (as proxy for time-varying connectivity) were calculated to extend previous work. We found no substantial correlations between resilience and node flexibility. We observed a small number of correlations between the two other brain measures and resilience scores, that were however very inconsistently distributed across brain measures, differences in temporal sampling, and parcellation schemes. This heterogeneity calls into question the existence of previously postulated associations between resilience and brain network flexibility and highlights how results may be influenced by specific analysis choices.
In this study, we investigate how the attractiveness and gender of an influencer impact receivers’ reaction depends on the users’ own attractiveness and gender. In social media, these variables may play different roles for individuals in varying contexts. To analyse these issues, a survey including 374 observations was conducted and analysed through structural equation modelling in SmartPLS. The results of our quantitative investigation were partially counter-intuitive. In most cases, a highly attractive influencer is more advantageous than an influencer of low attractiveness. More surprisingly, for male fashion, a female influencer appears to be more advantageous. Explanations are provided; based on the findings and implications for practitioners and influencers are proposed.
Full-text available
To successfully introduce blockchain-enabled booking platforms in the tourism and hospitality industry, providers need to understand their target audiences. We present the results of a survey of 505 US consumers who, in a simulated hotel booking scenario for a leisure trip, picked between traditional Online Travel Agencies (OTA) and a blockchain-enabled booking app with varying degrees of services, discounts, and brand recognition. We find that blockchain-enabled booking apps that meet the following three conditions could attract up to half of the market: (1) offer discounts over OTAs, (2) provide services which go beyond mere booking, and (3) have well-known brand names. In a series of three nested logistic regressions, we investigate the impact of demographic, psychographic, and service-related traveler characteristics. We find that early adopters of blockchain-enabled hotel booking platforms will be young and highly educated. Potential cost savings over OTAs will also attract travelers with lower incomes and from larger households. Other traveler characteristics that facilitate adoption include a high preparedness to take risks, high IT innovativeness, prior familiarity with blockchain technology, and, mediated through IT innovativeness, a high Generalized Sense of Power. Male travelers are more likely than female travelers to be early adopters due to their higher familiarity with blockchain technology.
Full-text available
The Internet is a new medium of communication, and as such it may create new types of social relationships, communication styles, and social behaviors. Social psychology may contribute to understanding characteristics and dynamics of Internet use. There are now several reviews of the psychology of the Internet, a topic that will not be treated in this chapter (see Joinson, 2002; McKenna & Bargh, 2000; Wallace, 2001). Instead, this chapter reviews the critical methodological and ethical issues in this new approach to psychological research. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Full-text available
Response styles are a source of contamination in questionnaire ratings, and therefore they threaten the validity of conclusions drawn from marketing research data. In this article, the authors examine five forms of stylistic responding (acquiescence and disacquiescence response styles, extreme response style/response range, midpoint responding, and noncontingent responding) and discuss their biasing effects on scale scores and correlations between scales. Using data from large, representative samples of consumers from 11 countries of the European Union, the authors find systematic effects of response styles on scale scores as a function of two scale characteristics (the proportion of reverse-scored items and the extent of deviation of the scale mean from the midpoint of the response scale) and show that correlations between scales can be biased upward or downward depending on the correlation between the response style components. In combination with the apparent lack of concern with response styles evidenced in a secondary analysis of commonly used marketing scales, these findings suggest that marketing researchers should pay greater attention to the phenomenon of stylistic responding when constructing and using measurement instruments.
Meine Antrittsvorlesung an der Universität Mainz im Herbst 1965 widmete ich dem Thema „Öffentliche Meinung und Soziale Kontrolle.“ Den Ausgangspunkt bildete die Verwirrung, was öffentliche Meinung sei, die schon Hermann Oncken 1904 ausgedrückt hatte: „Schwankendes und Fließendes wird dadurch nicht begriffen, daß es in eine Formel eingesperrt wird... Schließlich wird jeder, wenn er gefragt wird, genau wissen, was Öffentliche Meinung bedeutet“ (Oncken 1914, S. 236, vgl. S. 224f.). Jürgen Habermas klagte 1962: „nicht nur die Umgangssprache... hält daran fest; auch die Wissenschaften, vor allem Jurisprudenz, Politik und Soziologie, sind offensichtlich außerstande, traditionelle Kategorien wie... ‘öffentliche Meinung’ durch präzisere Bestimmungen zu ersetzen“ (Habermas 1962, S. 13). Harwood L. Childs, Professor für Politische Wissenschaften in Princeton, zählte in seinem 1965 veröffentlichten Buch „Public Opinion“ in einem der Einleitungskapitel fünfzig Definitionen von „öffentlicher Meinung“ auf (Childs 1965). Ich war überzeugt, daß es dabei nicht bleiben durfte: „Wie können wir untersuchen, welchen Einfluß die Massenmedien auf die Bildung der öffentlichen Meinung haben, und wie sich umgekehrt die öffentliche Meinung in der Publizistik ausdrückt, wenn wir uns über die öffentliche Meinung so ganz im unklaren sind?“ (Noelle 1966, S. 4f.).
Despite the widespread use of task inventories in job analysis, little is known about the validity of the obtained task ratings. One approach for examining the validity of such ratings is the use of a “false reporting” index to identify invalid responding. The purpose of this field experiment was to examine the effects of the type of frequency rating scale and method of task inventory administration on the degree of false reporting in task inventory ratings. A total of 177 Correctional Officers from a state correctional system responded to a 68 item task inventory using frequency and importance rating scales. Five of the items in the task inventory were bogus tasks not performed by the target job and formed a false reporting index. In a 2 × 2 design, the type of frequency rating scale (Relative-Time-Spent vs. Actual-Time-Spent) and method of task inventory administration (anonymous vs. identified) were manipulated. Analysis of variance results showed a significantly greater degree of false reporting in Relative-Time-Spent ratings. No significant differences in false reporting were found for method of task inventory administration or scale × method interactions. Overall, 45% of respondents indicated that they performed tasks that were not part of the job, which raises concerns about whether job incumbents are capable of providing accurate and complete task rating data.
The analytic use of paradata offers an additional tool in the survey researcher's tool box to study survey errors and survey costs. Paradata capture information about the data collection process on a more microlevel. Paradata that capture the minutes needed to interview each respondent or even the seconds it took to administer a single question within the survey would become the metadata information on the average time it took to administer the survey. Paradata are not the only source of additional data used in survey research to enrich final datasets and estimates. Researchers also use what they call ‘auxiliary data’. The keyword auxiliary data are used to encompass all data outside of the actual survey data itself, which would make all paradata also auxiliary data. Paradata are available during data collection and can be used to monitor and inform the collection process in (almost) real time.
Straightlining, an indicator of satisficing, refers to giving the same answer in a series of questions arranged on a grid. We investigated whether straightlining changes with respondents’ panel experience in the LISS panel in the Netherlands. Specifically, we considered straightlining on 10 grid questions in LISS core modules (7 waves) and on a grid of evaluation questions in the LISS panel (150+ waves). For both core modules and evaluation questions we found that straightlining increases with respondents’ panel experience for at least three years. Straightlining is also associated with younger age. Where straightlining corresponded to a plausible set of answers, prevalence of straightlining was much larger (15-40%) than where straightlining was implausible (<2% in wave 1).
Exclusively combining design and sampling issues, Handbook of Web Surveys presents a theoretical yet practical approach to creating and conducting web surveys. From the history of web surveys to various modes of data collection to tips for detecting error, this book thoroughly introduces readers to the this cutting-edge technique and offers tips for creating successful web surveys. The authors provide a history of web surveys and go on to explore the advantages and disadvantages of this mode of data collection. Common challenges involving under-coverage, self-selection, and measurement errors are discussed as well as topics including: Sampling designs and estimation procedures Comparing web surveys to face-to-face, telephone, and mail surveys Errors in web surveys Mixed-mode surveys Weighting techniques including post-stratification, generalized regression estimation, and raking ratio estimation Use of propensity scores to correct bias Web panels Real-world examples illustrate the discussed concepts, methods, and techniques, with related data freely available on the book's Website. Handbook of Web Surveys is an essential reference for researchers in the fields of government, business, economics, and the social sciences who utilize technology to gather, analyze, and draw results from data. It is also a suitable supplement for survey methods courses at the upper-undergraduate and graduate levels.
Web surveys can suffer from their nonrandom nature (coverage error) and low response rate (nonresponse error). Therefore, web surveys should be supported by mail survey to eliminate these problems. However, using different survey methods together may introduce another problem: the mode effect. This experimental study investigated the mode effect between two survey modes. A randomly selected group of 1,500 teachers were assigned to two experimental groups, one of which received mail surveys, while the other received web surveys. Nonrespondents in both groups were followed up with the opposite mode. Overall, results show that there is no mode effect between mail surveys and web surveys on psychometric quality of the rating scales and data quality (item nonresponse rate) of the survey except regarding respondents’ age and unit-response rate. Our findings indicate that web surveys had a substantially lower unit-response rate than mail surveys and that web survey respondents are significantly younger than mail survey respondents.