Content uploaded by Samuel Greiff
Author content
All content in this area was uploaded by Samuel Greiff on Mar 25, 2021
Content may be subject to copyright.
Editorial
The World Beyond Rating Scales
Why We Should Think More Carefully About the Response
Format in Questionnaires
Eunike Wetzel
1
and Samuel Greiff
2
1
Department of Psychology, University of Mannheim, Germany
2
Cognitive Science and Assessment, University of Luxembourg, Luxembourg
Researchers constructing a new questionnaire think very
carefully about a lot of things: the construct definition,
the target population, the wording of the items, the item
selection, and so on. These are all explicit steps in the test
construction process as it is described by textbooks and
research articles (Clark & Watson, 1995;Simms,2008;
Ziegler, 2014). One aspect that appears to receive less
attention is the choice of response format. When the ques-
tionnaire is a self-report (or other-report) measure of one or
more psychological constructs, test constructors appear to
automatically implement a rating scale such as strongly dis-
agree to strongly agree as the response format. The reason
for this is that rating scales have served us well in the past
and continue to do so.
However, in this editorial we will argue that choosing the
response format should be an explicit step in the test con-
struction process that deserves special attention and consid-
erable thought. In fact, the response format should be
chosen to fit the construct best. We will also argue that
we need a greater diversity of response formats and more
research on them.
Rating Scales: The Default Response
Format in Questionnaires
The most common response format in self-report or other-
report questionnaires assessing personality traits, interests,
motivations, or other psychological constructs is the rating
scale response format. With rating scales, each item is pre-
sented individually and respondents rate their endorsement
of the item on a scale with multiple response categories (see
example in Figure 1). Common rating scales include rating
scales on agreement (strongly disagree to strongly agree),
rating scales on degree or extent (notatallto very much),
and frequency scales (never to always). There are a number
of issues to consider when deciding which type of rating
scale to use (for a comprehensive list see Saris & Gallhofer,
2014): Should it be unipolar or bipolar? How many response
categories should the rating scale have? Should there be a
middle (neutral) category? How should the categories be
labeled (numerically, verbally, with symbols, or combina-
tions of these)? Are the categories equidistant? Should
there be a separate NA/do not want to respond category?
There is a large amount of research addressing these ques-
tions (e.g., Hernández, Drasgow, & González-Romá, 2004;
Krosnick & Berent, 1993; Revilla, Saris, & Krosnick, 2014;
Schwarz, Knauper, Hippler, Noelle-Neumann, & Clark,
1991). This research overwhelmingly shows that choices
on these issues matter. For example, Schwarz et al. (1991)
found that average ratings differed between the pre-
sentation of an 11-point rating scale with values from 0to
10 versus values from –5to +5with identical verbal end-
point labels. Krosnick (1999), Clark and Watson (1995),
and DeCastellarnau (2017) review some of the research
and provide specific recommendations on how to deal with
these matters.
Despite the consistent finding that choices on how the
rating scale is set up do matter, overall rating scales work
well. They allow reliable and valid assessments of a great
diversity of psychological constructs and many people find
them easy to use. Nevertheless, they also have a number of
important drawbacks:
(1) They are susceptible to response biases such as
response styles and socially desirable responding.
This can have detrimental consequences such as dis-
torting correlations between traits (Moors, 2012)orbe
rather inconsequential, especially when the trait and
response style are unrelated or weakly correlated
Ó2018 Hogrefe Publishing European Journal of Psychological Assessment (2018), 34(1), 1–5
https://doi.org/10.1027/1015-5759/a000469
http://econtent.hogrefe.com/doi/pdf/10.1027/1015-5759/a000469 - Samuel Greiff <samuel.greiff@uni.lu> - Friday, February 16, 2018 1:52:44 AM - IP Address:88.65.127.23
(Plieninger, 2017; Wetzel, Böhnke, & Rose, 2016).
Either way, from a measurement perspective, any
additional influence on item responses that is not
the construct of interest is problematic.
1
(2) There are interindividual differences in the interpre-
tation of response category labels. For example, going
out “often”might mean once a month to one person
and three times a week to another person.
(3) Different subgroups (high vs. low education level, dif-
ferent cultures) use rating scales differently (Johnson,
Kulesa, Cho, & Shavitt, 2005; Rammstedt & Farmer,
2013).
(4) Especially with long questionnaires, rating scales may
be tiresome and invite careless responding (Meade &
Craig, 2012).
Variations of Rating Scales
The simplest response format consists of just two response
options such as true and false or yes and no. The rationale
behind extending this dichotomous format to a rating scale
with more than two response categories was that more
information could be obtained by allowing greater differen-
tiation between individual responses (Masters, 1988).
A number of variations or attempted advancements of
rating scales have been suggested, sometimes explicitly to
overcome one or more of the disadvantages of existing
rating scales. In some questionnaires, rather than using
the same rating scale for all items, variations of the item
stem that represent different degrees of the trait are used.
For example, one item in the Beck Depression Inventory-II
(BDI-II; Beck, Steer, & Brown, 1996) consists of the
response options 0(= Idonotfeelsad), 1(= Ifeelsadmuch
of the time), 2(= I am sad all the time), and 3(= Iamsosador
unhappy that I can’t stand it). This format might be able
to reduce some problems with regular rating scales such
as reducing the ambiguity of rating scale labels, but it
is challenging to construct because specific behaviors
need to be found that capture different trait levels while
being exhaustive at the same time. In the example above,
there is no response option for people who feel sad
occasionally.
Another example for an attempt to improve rating scales
is Visual Analogue Scales (VAS; Hayes & Patterson, 1921),
which are often presented as a slider scale ranging from
0to 100. The idea behind VAS is that, since the underlying
construct is continuous, its measurement should also be
continuous. However, trying to make the measurement
continuous is unnecessary because methods exist for trans-
forming a discrete (e.g., rating scale) measurement onto a
continuous trait level scale (item response models, aggre-
gating across items). In addition, VAS assume that
participants can actually make such fine-grained differenti-
ations. However, when rating scales with many (e.g., 100)
numbered categories are presented to people, they tend
to choose ones that are divisible by 10 (or 5), indicating that
Figure 1. Example for the rating
scale response format with items
from the Big Five Triplets (Wetzel &
Frick, 2017).
1
For more information on response biases see a previous editorial by Ziegler (2015) and an overview chapter by Wetzel, Böhnke, and Brown (2016).
European Journal of Psychological Assessment (2018), 34(1), 1–5Ó2018 Hogrefe Publishing
2 Editorial
http://econtent.hogrefe.com/doi/pdf/10.1027/1015-5759/a000469 - Samuel Greiff <samuel.greiff@uni.lu> - Friday, February 16, 2018 1:52:44 AM - IP Address:88.65.127.23
the potentially overtaxing differentiation is broken down
into coarser segments by participants and not all available
options are actually used (Henss, 1989). In line with this,
research shows that adding more response categories
beyond around seven does not improve measurement
notably (Preston & Colman, 2000). VAS thus offer a differ-
entiation that is more fine-grained than participants’judg-
ments. In consequence, these additional gradations are
meaningless and should not be interpreted.
In sum, modifications of rating scales have not been able
to eliminate the problems inherent to rating scales (e.g.,
interindividual differences in using the rating scale). Thus,
replacing rating scales with alternative response formats
might be a more sensible course of action than trying to
improve rating scales.
Alternatives to Rating Scales
There appear to be few viable alternative response formats
that are not just variations of the rating scale format. In the
1960s, Behaviorally Anchored Rating Scales (BARS) were
proposed (Smith & Kendall, 1963). With BARS, the response
options consist of behavioral anchors that represent differ-
ent trait levels (see example in Figure 2). Ideally, one or
two BARS would be sufficient for assessing a trait. BARS
are used in particular for performance ratings in industrial
and organizational psychology (Grote, 1996), but they are
not widespread because their psychometric properties
(e.g., reliability) were not superior to other assessment
methods and their construction is complex and costly
(Bernardin & Smith, 1981; Schwab, Heneman, & DeCotiis,
1975). In addition to this, finding specific behaviors that
adequately represent moderate trait levels appears to be
particularly challenging (Hauenstein, Brown, & Sinclair,
2010). However, with more sophisticated (e.g., item
response theory) methods, it might be possible to construct
BARS that provide reliable and valid assessments while
simultaneously being efficient for test users. It may then
be feasible to apply BARS for the assessment of other con-
structs such as personality traits. Attempts have been made
to assess personality with concrete behavioral indicators (for
an example with conscientiousness see Jackson et al., 2010)
and these could be transformed into anchors on a BARS.
Another alternative to rating scales is the multidimen-
sional forced-choice (MFC) format, which has been around
for a long time, but has recently gained traction with the
development of item response models that allow obtaining
normative trait estimates (as opposed to ipsative trait esti-
mates) from MFC data (Brown & Maydeu-Olivares, 2011,
2013). In the MFC format, several items are presented
simultaneously to respondents in an item block and they
have to rank the items according to their preference (e.g.,
with activities or products) or according to how well they
describe the respondent (e.g., with personality items; see
example in Figure 3).
2
There are several variations of the
MFC format that depend on the size of the item block
(pairs, triplets, quads, and so forth) and the instruction to
participants (full ranking vs. partial ranking). The process
of responding to MFC item blocks differs from the response
process to rating scale items in that the items within the
block have to be weighed against each other (Sass, Frick,
Reips, & Wetzel, in press). Despite the potentially higher
cognitive effort involved in responding to MFC item blocks,
test motivation does not appear to differ between the MFC
and rating scale format (Sass et al., in press). The MFC
format eliminates response styles such as extreme response
style. Other response biases such as faking can still occur to
some extent, though the ranking task puts a limit on the
amount of faking that can occur because not all traits
can be faked at the same time and equally strong (in the
triplet in Figure 3, a participant would have to decide
whether to focus on faking extraversion, emotional stabil-
ity, or conscientiousness by placing it on rank 1). Studies
have shown that the MFC format is less susceptible to
faking than the rating scale format (Christiansen, Burns,
& Montgomery, 2005). Constructing an MFC question-
naire, however, is more complex than constructing a rating
scale questionnaire because a lot of additional considera-
tions play a role such as which items are presented in a
block, and more research is needed on these basic test
construction issues.
Figure 2. Example for a behaviorally anchored rating scale assessing
the construct orderliness.
2
The MFC format therefore is both an item format and a response format at the same time.
Ó2018 Hogrefe Publishing European Journal of Psychological Assessment (2018), 34(1), 1–5
Editorial 3
http://econtent.hogrefe.com/doi/pdf/10.1027/1015-5759/a000469 - Samuel Greiff <samuel.greiff@uni.lu> - Friday, February 16, 2018 1:52:44 AM - IP Address:88.65.127.23
Concluding Remarks
When writing this editorial, we realized that there are
surprisingly few real alternatives to the rating scale format.
In contrast, in achievement testing, there are numerous
creative assessment methods such as computer-simulated
microworlds (Wüstenberg, Greiff, & Funke, 2012)orinno-
vatively constructed response formats (Thissen, Koch,
Becker, & Spinath, 2016) that complement classical open-
ended questions and multiple-choice formats.
In a way, one could ask, why is assessing noncognitive
traits so boring and why do we always use rating scales?
One reason appears to be that rating scales overall do the
job they are supposed to do, despite their disadvantages,
especially when we are only looking at homogenous student
samples from one country. With more heterogeneous
samples and in particular in cross-cultural research, this
might not be the case. Another reason for the popularity
of rating scales might be that they are very convenient
and easy to construct. Thus, we as test constructors have
not found it necessary to look beyond them. In addition,
some of the attempts to introduce alternatives have run into
severe problems (see BARS or formerly the ipsativity of
MFC trait scores). However, some of these problems can
be solved with more sophisticated methods such as item
response theory as in the case of MFC data.
This editorial is not a call to stop using rating scales, but
we believe that the choice of response format should be an
explicit step in the process of constructing a questionnaire.
Therefore, we would like to encourage researchers and test
constructors to explicitly consider alternatives to rating
scales instead of automatically using a rating scale and to
make a well-thought-out decision that is based on weighing
the pros and cons. Some feasible alternatives such as the
MFC format already exist, but more are needed. We
encourage research on response formats, both research
on existing response formats and especially research
exploring alternative response formats. Right now, many
researchers including the authors of this editorial assume
that the default of rating scales is appropriate for virtually
any construct we assess with self-report questionnaires
and the only aspect that needs to be considered is the
specifics of the rating scale (e.g., number of response cate-
gories). However, this may not be the case. Different
response formats may be appropriate for different con-
structs, as in achievement testing. For example, there are
a number of constructs that often show low response vari-
ability when assessed with rating scales (test motivation,
self-esteem). In these cases, a different response format
(perhaps with specific behavioral anchors) might be more
informative. In psychological assessment, it is our goal to
measure the constructs we are interested in as good as
we can and to draw valid inferences regarding people’strait
levels from our questionnaire results. Thus, we need to
make sure we use the response format that will allow
achieving these goals. Test constructors think so carefully
about many details of the questionnaire they are construct-
ing (e.g., the wording of individual items or which items to
select for the final version). They should think equally care-
fully about which response format to use.
Acknowledgment
The authors thank Susanne Frick for her comments on a
draft of this editorial.
References
Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Beck Depression
Inventory-II. San Antonio, TX: The Psychological Corporation.
Bernardin, H. J., & Smith, P. C. (1981). A clarification of
some issues regarding the development and use of behav-
iorally anchored rating-scales (BARS). Journal of Applied
Psychology, 66, 458–463. https://doi.org/10.1037/0021-9010.
66.4.458
Brown, A., & Maydeu-Olivares, A. (2011). Item response
modeling of forced-choice questionnaires. Educational and
Psychological Measurement, 71, 460–502. https://doi.org/
10.1177/0013164410375112
Brown, A., & Maydeu-Olivares, A. (2013). How IRT can solve
problems of ipsative data in forced-choice questionnaires.
Psychological Methods, 18,36–52. https://doi.org/10.1037/
a0030641
Christiansen, N. D., Burns, G. N., & Montgomery, G. E. (2005).
Reconsidering forced-choice item formats for applicant per-
sonality assessment. Human Performance, 18, 267–307.
https://doi.org/10.1207/s15327043hup1803_4
Clark, L. A., & Watson, D. (1995). Constructing validity: Basic
issues in objective scale development. Psychological Assess-
ment, 7, 309–319. https://doi.org/10.1037/1040-3590.7.3.309
DeCastellarnau, A. (2017). A classification of response scale
characteristics that affect data quality: A literature review.
Quality & Quantity. Advance online publication. https://doi.org/
10.1007/s11135–017-0533–4
Figure 3. Example for the multidimensional forced-choice format
from the Big Five Triplets (Wetzel & Frick, 2017).
European Journal of Psychological Assessment (2018), 34(1), 1–5Ó2018 Hogrefe Publishing
4 Editorial
http://econtent.hogrefe.com/doi/pdf/10.1027/1015-5759/a000469 - Samuel Greiff <samuel.greiff@uni.lu> - Friday, February 16, 2018 1:52:44 AM - IP Address:88.65.127.23
Grote, D. (1996). The complete guide to performance appraisal.
New York, NY: American Management Association.
Hauenstein, N. M. A., Brown, R. D., & Sinclair, A. L. (2010). BARS
and those mysterious, missing middle anchors. Journal of
Business and Psychology, 25, 663–672. https://doi.org/
10.1007/s10869-010-9180-7
Hayes, M. H. S., & Patterson, D. G. (1921). Experimental develop-
ment of the graphic rating method. Psychological Bulletin, 18,
98–99.
Henss, R. (1989). Zur Vergleichbarkeit von Ratingskalen unter-
schiedlicher Kategorienzahl [On the comparability of rating
scales with different numbers of categories]. Psychologische
Beiträge, 31, 264–284.
Hernández, A., Drasgow, F., & González-Romá, V. (2004). Inves-
tigating the functioning of a middle category by means of a
mixed-measurement model. Journal of Applied Psychology, 89,
687–699. https://doi.org/10.1037/0021-9010.89.4.687
Jackson, J. J., Wood, D., Bogg, T., Walton, K. E., Harms, P. D., &
Roberts, B. W. (2010). What do conscientious people do?
Development and validation of the Behavioral Indicators of
Conscientiousness (BIC). Journal of Research in Personality, 44,
501–511. https://doi.org/10.1016/j.jrp.2010.06.005
Johnson, T., Kulesa, P., Cho, Y. I., & Shavitt, S. (2005). The relation
between culture and response styles –Evidence from 19
countries. Journal of Cross-Cultural Psychology, 36, 264–277.
https://doi.org/10.1177/0022022104272905
Krosnick, J. A. (1999). Survey research. Annual Review of
Psychology, 50, 537–567. https://doi.org/10.1146/annurev.
psych.50.1.537
Krosnick, J. A., & Berent, M. K. (1993). Comparisons of party
identification and policy preferences –The impact of survey
question format. American Journal of Political Science, 37,
941–964. https://doi.org/10.2307/2111580
Masters, G. N. (1988). Measurement models for ordered response
categories. In R. Langeheine & J. Rost (Eds.), Latent trait and
latent class models (pp. 11–29). New York, NY: Plenum Press.
Meade, A. W., & Craig, S. B. (2012). Identifying careless
responses in survey data. Psychological Methods, 17,
437–455. https://doi.org/10.1037/A0028085
Moors, G. (2012). The effect of response style bias on the
measurement of transformational, transactional, and laissez-
faire leadership. European Journal of Work and Organizational
Psychology, 21, 271–298. https://doi.org/10.1080/1359432x.
2010.550680
Plieninger, H. (2017). Mountain or molehill? A simulation study
on the impact of response styles. Educational and Psy-
chological Measurement, 77,32–53. https://doi.org/10.1177/
0013164416636655
Preston, C. C., & Colman, A. M. (2000). Optimal number of
response categories in rating scales: Reliability, validity,
discriminating power, and respondent preferences. Acta
Psychologica, 104,1–15. https://doi.org/10.1016/S0001-6918
(99)00050-5
Rammstedt, B., & Farmer, R. F. (2013). The impact of acquies-
cence on the evaluation of personality structure. Psychologi-
cal Assessment, 25, 1137–1145. https://doi.org/10.1037/
a0033323
Revilla, M. A., Saris, W. E., & Krosnick, J. A. (2014). Choosing the
number of categories in agree-disagree scales. Sociological
Methods & Research, 43,73–97. https://doi.org/10.1177/
0049124113509605
Saris, W. E., & Gallhofer, I. N. (2014). Design, evaluation, and
analysis of questionnaires for survey research. Hoboken, NJ:
Wiley.
Sass, R., Frick, S., Reips, U.-D., & Wetzel, E. (in press). Taking the
test-taker’s perspective: Response process and test motivation
in multidimensional forced-choice vs. rating scale instruments.
Assessment.
Schwab, D. P., Heneman, H. G., & DeCotiis, T. A. (1975). Behav-
iorally anchored rating scales: A review of the literature.
Personnel Psychology, 28, 549–562. https://doi.org/10.1111/
J.1744-6570.1975.Tb01392.X
Schwarz, N., Knauper, B., Hippler, H. J., Noelle-Neumann, E., &
Clark, L. (1991). Rating scales –Numeric values may change
the meaning of scale labels. Public Opinion Quarterly, 55,
570–582. https://doi.org/10.1086/269282
Simms, L. J. (2008). Classical and modern methods of psycho-
logical scale construction. Social and Personality Psychology
Compass, 2, 414–433. https://doi.org/10.1111/j.1751-9004.
2007.00044.x
Smith, P. C., & Kendall, L. M. (1963). Retranslation of expecta-
tions: An approach to the construction of unambiguous
anchors for rating scales. Journal of Applied Psychology, 47,
149–155. https://doi.org/10.1037/H0047060
Thissen, A., Koch, M., Becker, N., & Spinath, F. M. (2016).
Construct your own response: The cube construction task as
a novel format for the assessment of spatial ability. European
Journal of Psychological Assessment. Advance online publica-
tion. https://doi.org/10.1027/1015–5759/a000342
Wetzel, E., Böhnke, J. R., & Brown, A. (2016). Response biases. In
F. R. L. Leong, B. Bartram, F. Cheung, K. F. Geisinger, &
D. Iliescu (Eds.), The ITC international handbook of testing and
assessment (pp. 349–363). New York, NY: Oxford University
Press.
Wetzel, E., Böhnke, J. R., & Rose, N. (2016). A simulation study on
methods of correcting for the effects of extreme response style.
Educational and Psychological Measurement, 76, 304–324.
https://doi.org/10.1177/0013164415591848
Wetzel, E., & Frick, S. (2017). The Big Five Triplets –Development
of a multidimensional forced-choice questionnaire. Manuscript
in preparation.
Wüstenberg, S., Greiff, S., & Funke, J. (2012). Complex problem
solving –More than reasoning? Intelligence, 40,1–14. https://
doi.org/10.1016/j.intell.2011.11.003
Ziegler, M. (2014). Stop and state your intentions! Let’s not forget
the ABC of test construction. European Journal of Psychological
Assessment, 30, 239–242. https://doi.org/10.1027/1015-5759/
a000228
Ziegler, M. (2015). “F*** you, I won’t do what you told me!”–
Response biases as threats to psychological assessment.
European Journal of Psychological Assessment, 31, 153–158.
https://doi.org/10.1027/1015-5759/a000292
Eunike Wetzel
Department of Psychology
University of Mannheim
L13, 15
68161 Mannheim
Germany
eunike.wetzel@uni-mannheim.de
Samuel Greiff
Cognitive Science and Assessment
University of Luxembourg
6, rue Richard Coudenhove-Kalergi
4366 Esch-sur-Alzette
Luxembourg
samuel.greiff@uni.lu
Ó2018 Hogrefe Publishing European Journal of Psychological Assessment (2018), 34(1), 1–5
Editorial 5
http://econtent.hogrefe.com/doi/pdf/10.1027/1015-5759/a000469 - Samuel Greiff <samuel.greiff@uni.lu> - Friday, February 16, 2018 1:52:44 AM - IP Address:88.65.127.23