ArticlePDF Available
Editorial
The World Beyond Rating Scales
Why We Should Think More Carefully About the Response
Format in Questionnaires
Eunike Wetzel
1
and Samuel Greiff
2
1
Department of Psychology, University of Mannheim, Germany
2
Cognitive Science and Assessment, University of Luxembourg, Luxembourg
Researchers constructing a new questionnaire think very
carefully about a lot of things: the construct definition,
the target population, the wording of the items, the item
selection, and so on. These are all explicit steps in the test
construction process as it is described by textbooks and
research articles (Clark & Watson, 1995;Simms,2008;
Ziegler, 2014). One aspect that appears to receive less
attention is the choice of response format. When the ques-
tionnaire is a self-report (or other-report) measure of one or
more psychological constructs, test constructors appear to
automatically implement a rating scale such as strongly dis-
agree to strongly agree as the response format. The reason
for this is that rating scales have served us well in the past
and continue to do so.
However, in this editorial we will argue that choosing the
response format should be an explicit step in the test con-
struction process that deserves special attention and consid-
erable thought. In fact, the response format should be
chosen to fit the construct best. We will also argue that
we need a greater diversity of response formats and more
research on them.
Rating Scales: The Default Response
Format in Questionnaires
The most common response format in self-report or other-
report questionnaires assessing personality traits, interests,
motivations, or other psychological constructs is the rating
scale response format. With rating scales, each item is pre-
sented individually and respondents rate their endorsement
of the item on a scale with multiple response categories (see
example in Figure 1). Common rating scales include rating
scales on agreement (strongly disagree to strongly agree),
rating scales on degree or extent (notatallto very much),
and frequency scales (never to always). There are a number
of issues to consider when deciding which type of rating
scale to use (for a comprehensive list see Saris & Gallhofer,
2014): Should it be unipolar or bipolar? How many response
categories should the rating scale have? Should there be a
middle (neutral) category? How should the categories be
labeled (numerically, verbally, with symbols, or combina-
tions of these)? Are the categories equidistant? Should
there be a separate NA/do not want to respond category?
There is a large amount of research addressing these ques-
tions (e.g., Hernández, Drasgow, & González-Romá, 2004;
Krosnick & Berent, 1993; Revilla, Saris, & Krosnick, 2014;
Schwarz, Knauper, Hippler, Noelle-Neumann, & Clark,
1991). This research overwhelmingly shows that choices
on these issues matter. For example, Schwarz et al. (1991)
found that average ratings differed between the pre-
sentation of an 11-point rating scale with values from 0to
10 versus values from 5to +5with identical verbal end-
point labels. Krosnick (1999), Clark and Watson (1995),
and DeCastellarnau (2017) review some of the research
and provide specific recommendations on how to deal with
these matters.
Despite the consistent finding that choices on how the
rating scale is set up do matter, overall rating scales work
well. They allow reliable and valid assessments of a great
diversity of psychological constructs and many people find
them easy to use. Nevertheless, they also have a number of
important drawbacks:
(1) They are susceptible to response biases such as
response styles and socially desirable responding.
This can have detrimental consequences such as dis-
torting correlations between traits (Moors, 2012)orbe
rather inconsequential, especially when the trait and
response style are unrelated or weakly correlated
Ó2018 Hogrefe Publishing European Journal of Psychological Assessment (2018), 34(1), 15
https://doi.org/10.1027/1015-5759/a000469
http://econtent.hogrefe.com/doi/pdf/10.1027/1015-5759/a000469 - Samuel Greiff <samuel.greiff@uni.lu> - Friday, February 16, 2018 1:52:44 AM - IP Address:88.65.127.23
(Plieninger, 2017; Wetzel, Böhnke, & Rose, 2016).
Either way, from a measurement perspective, any
additional influence on item responses that is not
the construct of interest is problematic.
1
(2) There are interindividual differences in the interpre-
tation of response category labels. For example, going
out oftenmight mean once a month to one person
and three times a week to another person.
(3) Different subgroups (high vs. low education level, dif-
ferent cultures) use rating scales differently (Johnson,
Kulesa, Cho, & Shavitt, 2005; Rammstedt & Farmer,
2013).
(4) Especially with long questionnaires, rating scales may
be tiresome and invite careless responding (Meade &
Craig, 2012).
Variations of Rating Scales
The simplest response format consists of just two response
options such as true and false or yes and no. The rationale
behind extending this dichotomous format to a rating scale
with more than two response categories was that more
information could be obtained by allowing greater differen-
tiation between individual responses (Masters, 1988).
A number of variations or attempted advancements of
rating scales have been suggested, sometimes explicitly to
overcome one or more of the disadvantages of existing
rating scales. In some questionnaires, rather than using
the same rating scale for all items, variations of the item
stem that represent different degrees of the trait are used.
For example, one item in the Beck Depression Inventory-II
(BDI-II; Beck, Steer, & Brown, 1996) consists of the
response options 0(= Idonotfeelsad), 1(= Ifeelsadmuch
of the time), 2(= I am sad all the time), and 3(= Iamsosador
unhappy that I cant stand it). This format might be able
to reduce some problems with regular rating scales such
as reducing the ambiguity of rating scale labels, but it
is challenging to construct because specific behaviors
need to be found that capture different trait levels while
being exhaustive at the same time. In the example above,
there is no response option for people who feel sad
occasionally.
Another example for an attempt to improve rating scales
is Visual Analogue Scales (VAS; Hayes & Patterson, 1921),
which are often presented as a slider scale ranging from
0to 100. The idea behind VAS is that, since the underlying
construct is continuous, its measurement should also be
continuous. However, trying to make the measurement
continuous is unnecessary because methods exist for trans-
forming a discrete (e.g., rating scale) measurement onto a
continuous trait level scale (item response models, aggre-
gating across items). In addition, VAS assume that
participants can actually make such fine-grained differenti-
ations. However, when rating scales with many (e.g., 100)
numbered categories are presented to people, they tend
to choose ones that are divisible by 10 (or 5), indicating that
Figure 1. Example for the rating
scale response format with items
from the Big Five Triplets (Wetzel &
Frick, 2017).
1
For more information on response biases see a previous editorial by Ziegler (2015) and an overview chapter by Wetzel, Böhnke, and Brown (2016).
European Journal of Psychological Assessment (2018), 34(1), 15Ó2018 Hogrefe Publishing
2 Editorial
http://econtent.hogrefe.com/doi/pdf/10.1027/1015-5759/a000469 - Samuel Greiff <samuel.greiff@uni.lu> - Friday, February 16, 2018 1:52:44 AM - IP Address:88.65.127.23
the potentially overtaxing differentiation is broken down
into coarser segments by participants and not all available
options are actually used (Henss, 1989). In line with this,
research shows that adding more response categories
beyond around seven does not improve measurement
notably (Preston & Colman, 2000). VAS thus offer a differ-
entiation that is more fine-grained than participantsjudg-
ments. In consequence, these additional gradations are
meaningless and should not be interpreted.
In sum, modifications of rating scales have not been able
to eliminate the problems inherent to rating scales (e.g.,
interindividual differences in using the rating scale). Thus,
replacing rating scales with alternative response formats
might be a more sensible course of action than trying to
improve rating scales.
Alternatives to Rating Scales
There appear to be few viable alternative response formats
that are not just variations of the rating scale format. In the
1960s, Behaviorally Anchored Rating Scales (BARS) were
proposed (Smith & Kendall, 1963). With BARS, the response
options consist of behavioral anchors that represent differ-
ent trait levels (see example in Figure 2). Ideally, one or
two BARS would be sufficient for assessing a trait. BARS
are used in particular for performance ratings in industrial
and organizational psychology (Grote, 1996), but they are
not widespread because their psychometric properties
(e.g., reliability) were not superior to other assessment
methods and their construction is complex and costly
(Bernardin & Smith, 1981; Schwab, Heneman, & DeCotiis,
1975). In addition to this, finding specific behaviors that
adequately represent moderate trait levels appears to be
particularly challenging (Hauenstein, Brown, & Sinclair,
2010). However, with more sophisticated (e.g., item
response theory) methods, it might be possible to construct
BARS that provide reliable and valid assessments while
simultaneously being efficient for test users. It may then
be feasible to apply BARS for the assessment of other con-
structs such as personality traits. Attempts have been made
to assess personality with concrete behavioral indicators (for
an example with conscientiousness see Jackson et al., 2010)
and these could be transformed into anchors on a BARS.
Another alternative to rating scales is the multidimen-
sional forced-choice (MFC) format, which has been around
for a long time, but has recently gained traction with the
development of item response models that allow obtaining
normative trait estimates (as opposed to ipsative trait esti-
mates) from MFC data (Brown & Maydeu-Olivares, 2011,
2013). In the MFC format, several items are presented
simultaneously to respondents in an item block and they
have to rank the items according to their preference (e.g.,
with activities or products) or according to how well they
describe the respondent (e.g., with personality items; see
example in Figure 3).
2
There are several variations of the
MFC format that depend on the size of the item block
(pairs, triplets, quads, and so forth) and the instruction to
participants (full ranking vs. partial ranking). The process
of responding to MFC item blocks differs from the response
process to rating scale items in that the items within the
block have to be weighed against each other (Sass, Frick,
Reips, & Wetzel, in press). Despite the potentially higher
cognitive effort involved in responding to MFC item blocks,
test motivation does not appear to differ between the MFC
and rating scale format (Sass et al., in press). The MFC
format eliminates response styles such as extreme response
style. Other response biases such as faking can still occur to
some extent, though the ranking task puts a limit on the
amount of faking that can occur because not all traits
can be faked at the same time and equally strong (in the
triplet in Figure 3, a participant would have to decide
whether to focus on faking extraversion, emotional stabil-
ity, or conscientiousness by placing it on rank 1). Studies
have shown that the MFC format is less susceptible to
faking than the rating scale format (Christiansen, Burns,
& Montgomery, 2005). Constructing an MFC question-
naire, however, is more complex than constructing a rating
scale questionnaire because a lot of additional considera-
tions play a role such as which items are presented in a
block, and more research is needed on these basic test
construction issues.
Figure 2. Example for a behaviorally anchored rating scale assessing
the construct orderliness.
2
The MFC format therefore is both an item format and a response format at the same time.
Ó2018 Hogrefe Publishing European Journal of Psychological Assessment (2018), 34(1), 15
Editorial 3
http://econtent.hogrefe.com/doi/pdf/10.1027/1015-5759/a000469 - Samuel Greiff <samuel.greiff@uni.lu> - Friday, February 16, 2018 1:52:44 AM - IP Address:88.65.127.23
Concluding Remarks
When writing this editorial, we realized that there are
surprisingly few real alternatives to the rating scale format.
In contrast, in achievement testing, there are numerous
creative assessment methods such as computer-simulated
microworlds (Wüstenberg, Greiff, & Funke, 2012)orinno-
vatively constructed response formats (Thissen, Koch,
Becker, & Spinath, 2016) that complement classical open-
ended questions and multiple-choice formats.
In a way, one could ask, why is assessing noncognitive
traits so boring and why do we always use rating scales?
One reason appears to be that rating scales overall do the
job they are supposed to do, despite their disadvantages,
especially when we are only looking at homogenous student
samples from one country. With more heterogeneous
samples and in particular in cross-cultural research, this
might not be the case. Another reason for the popularity
of rating scales might be that they are very convenient
and easy to construct. Thus, we as test constructors have
not found it necessary to look beyond them. In addition,
some of the attempts to introduce alternatives have run into
severe problems (see BARS or formerly the ipsativity of
MFC trait scores). However, some of these problems can
be solved with more sophisticated methods such as item
response theory as in the case of MFC data.
This editorial is not a call to stop using rating scales, but
we believe that the choice of response format should be an
explicit step in the process of constructing a questionnaire.
Therefore, we would like to encourage researchers and test
constructors to explicitly consider alternatives to rating
scales instead of automatically using a rating scale and to
make a well-thought-out decision that is based on weighing
the pros and cons. Some feasible alternatives such as the
MFC format already exist, but more are needed. We
encourage research on response formats, both research
on existing response formats and especially research
exploring alternative response formats. Right now, many
researchers including the authors of this editorial assume
that the default of rating scales is appropriate for virtually
any construct we assess with self-report questionnaires
and the only aspect that needs to be considered is the
specifics of the rating scale (e.g., number of response cate-
gories). However, this may not be the case. Different
response formats may be appropriate for different con-
structs, as in achievement testing. For example, there are
a number of constructs that often show low response vari-
ability when assessed with rating scales (test motivation,
self-esteem). In these cases, a different response format
(perhaps with specific behavioral anchors) might be more
informative. In psychological assessment, it is our goal to
measure the constructs we are interested in as good as
we can and to draw valid inferences regarding peoplestrait
levels from our questionnaire results. Thus, we need to
make sure we use the response format that will allow
achieving these goals. Test constructors think so carefully
about many details of the questionnaire they are construct-
ing (e.g., the wording of individual items or which items to
select for the final version). They should think equally care-
fully about which response format to use.
Acknowledgment
The authors thank Susanne Frick for her comments on a
draft of this editorial.
References
Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Beck Depression
Inventory-II. San Antonio, TX: The Psychological Corporation.
Bernardin, H. J., & Smith, P. C. (1981). A clarification of
some issues regarding the development and use of behav-
iorally anchored rating-scales (BARS). Journal of Applied
Psychology, 66, 458463. https://doi.org/10.1037/0021-9010.
66.4.458
Brown, A., & Maydeu-Olivares, A. (2011). Item response
modeling of forced-choice questionnaires. Educational and
Psychological Measurement, 71, 460502. https://doi.org/
10.1177/0013164410375112
Brown, A., & Maydeu-Olivares, A. (2013). How IRT can solve
problems of ipsative data in forced-choice questionnaires.
Psychological Methods, 18,3652. https://doi.org/10.1037/
a0030641
Christiansen, N. D., Burns, G. N., & Montgomery, G. E. (2005).
Reconsidering forced-choice item formats for applicant per-
sonality assessment. Human Performance, 18, 267307.
https://doi.org/10.1207/s15327043hup1803_4
Clark, L. A., & Watson, D. (1995). Constructing validity: Basic
issues in objective scale development. Psychological Assess-
ment, 7, 309319. https://doi.org/10.1037/1040-3590.7.3.309
DeCastellarnau, A. (2017). A classification of response scale
characteristics that affect data quality: A literature review.
Quality & Quantity. Advance online publication. https://doi.org/
10.1007/s11135017-05334
Figure 3. Example for the multidimensional forced-choice format
from the Big Five Triplets (Wetzel & Frick, 2017).
European Journal of Psychological Assessment (2018), 34(1), 15Ó2018 Hogrefe Publishing
4 Editorial
http://econtent.hogrefe.com/doi/pdf/10.1027/1015-5759/a000469 - Samuel Greiff <samuel.greiff@uni.lu> - Friday, February 16, 2018 1:52:44 AM - IP Address:88.65.127.23
Grote, D. (1996). The complete guide to performance appraisal.
New York, NY: American Management Association.
Hauenstein, N. M. A., Brown, R. D., & Sinclair, A. L. (2010). BARS
and those mysterious, missing middle anchors. Journal of
Business and Psychology, 25, 663672. https://doi.org/
10.1007/s10869-010-9180-7
Hayes, M. H. S., & Patterson, D. G. (1921). Experimental develop-
ment of the graphic rating method. Psychological Bulletin, 18,
9899.
Henss, R. (1989). Zur Vergleichbarkeit von Ratingskalen unter-
schiedlicher Kategorienzahl [On the comparability of rating
scales with different numbers of categories]. Psychologische
Beiträge, 31, 264284.
Hernández, A., Drasgow, F., & González-Romá, V. (2004). Inves-
tigating the functioning of a middle category by means of a
mixed-measurement model. Journal of Applied Psychology, 89,
687699. https://doi.org/10.1037/0021-9010.89.4.687
Jackson, J. J., Wood, D., Bogg, T., Walton, K. E., Harms, P. D., &
Roberts, B. W. (2010). What do conscientious people do?
Development and validation of the Behavioral Indicators of
Conscientiousness (BIC). Journal of Research in Personality, 44,
501511. https://doi.org/10.1016/j.jrp.2010.06.005
Johnson, T., Kulesa, P., Cho, Y. I., & Shavitt, S. (2005). The relation
between culture and response styles Evidence from 19
countries. Journal of Cross-Cultural Psychology, 36, 264277.
https://doi.org/10.1177/0022022104272905
Krosnick, J. A. (1999). Survey research. Annual Review of
Psychology, 50, 537567. https://doi.org/10.1146/annurev.
psych.50.1.537
Krosnick, J. A., & Berent, M. K. (1993). Comparisons of party
identification and policy preferences The impact of survey
question format. American Journal of Political Science, 37,
941964. https://doi.org/10.2307/2111580
Masters, G. N. (1988). Measurement models for ordered response
categories. In R. Langeheine & J. Rost (Eds.), Latent trait and
latent class models (pp. 1129). New York, NY: Plenum Press.
Meade, A. W., & Craig, S. B. (2012). Identifying careless
responses in survey data. Psychological Methods, 17,
437455. https://doi.org/10.1037/A0028085
Moors, G. (2012). The effect of response style bias on the
measurement of transformational, transactional, and laissez-
faire leadership. European Journal of Work and Organizational
Psychology, 21, 271298. https://doi.org/10.1080/1359432x.
2010.550680
Plieninger, H. (2017). Mountain or molehill? A simulation study
on the impact of response styles. Educational and Psy-
chological Measurement, 77,3253. https://doi.org/10.1177/
0013164416636655
Preston, C. C., & Colman, A. M. (2000). Optimal number of
response categories in rating scales: Reliability, validity,
discriminating power, and respondent preferences. Acta
Psychologica, 104,115. https://doi.org/10.1016/S0001-6918
(99)00050-5
Rammstedt, B., & Farmer, R. F. (2013). The impact of acquies-
cence on the evaluation of personality structure. Psychologi-
cal Assessment, 25, 11371145. https://doi.org/10.1037/
a0033323
Revilla, M. A., Saris, W. E., & Krosnick, J. A. (2014). Choosing the
number of categories in agree-disagree scales. Sociological
Methods & Research, 43,7397. https://doi.org/10.1177/
0049124113509605
Saris, W. E., & Gallhofer, I. N. (2014). Design, evaluation, and
analysis of questionnaires for survey research. Hoboken, NJ:
Wiley.
Sass, R., Frick, S., Reips, U.-D., & Wetzel, E. (in press). Taking the
test-takers perspective: Response process and test motivation
in multidimensional forced-choice vs. rating scale instruments.
Assessment.
Schwab, D. P., Heneman, H. G., & DeCotiis, T. A. (1975). Behav-
iorally anchored rating scales: A review of the literature.
Personnel Psychology, 28, 549562. https://doi.org/10.1111/
J.1744-6570.1975.Tb01392.X
Schwarz, N., Knauper, B., Hippler, H. J., Noelle-Neumann, E., &
Clark, L. (1991). Rating scales Numeric values may change
the meaning of scale labels. Public Opinion Quarterly, 55,
570582. https://doi.org/10.1086/269282
Simms, L. J. (2008). Classical and modern methods of psycho-
logical scale construction. Social and Personality Psychology
Compass, 2, 414433. https://doi.org/10.1111/j.1751-9004.
2007.00044.x
Smith, P. C., & Kendall, L. M. (1963). Retranslation of expecta-
tions: An approach to the construction of unambiguous
anchors for rating scales. Journal of Applied Psychology, 47,
149155. https://doi.org/10.1037/H0047060
Thissen, A., Koch, M., Becker, N., & Spinath, F. M. (2016).
Construct your own response: The cube construction task as
a novel format for the assessment of spatial ability. European
Journal of Psychological Assessment. Advance online publica-
tion. https://doi.org/10.1027/10155759/a000342
Wetzel, E., Böhnke, J. R., & Brown, A. (2016). Response biases. In
F. R. L. Leong, B. Bartram, F. Cheung, K. F. Geisinger, &
D. Iliescu (Eds.), The ITC international handbook of testing and
assessment (pp. 349363). New York, NY: Oxford University
Press.
Wetzel, E., Böhnke, J. R., & Rose, N. (2016). A simulation study on
methods of correcting for the effects of extreme response style.
Educational and Psychological Measurement, 76, 304324.
https://doi.org/10.1177/0013164415591848
Wetzel, E., & Frick, S. (2017). The Big Five Triplets Development
of a multidimensional forced-choice questionnaire. Manuscript
in preparation.
Wüstenberg, S., Greiff, S., & Funke, J. (2012). Complex problem
solving More than reasoning? Intelligence, 40,114. https://
doi.org/10.1016/j.intell.2011.11.003
Ziegler, M. (2014). Stop and state your intentions! Lets not forget
the ABC of test construction. European Journal of Psychological
Assessment, 30, 239242. https://doi.org/10.1027/1015-5759/
a000228
Ziegler, M. (2015). F*** you, I wont do what you told me!”–
Response biases as threats to psychological assessment.
European Journal of Psychological Assessment, 31, 153158.
https://doi.org/10.1027/1015-5759/a000292
Eunike Wetzel
Department of Psychology
University of Mannheim
L13, 15
68161 Mannheim
Germany
eunike.wetzel@uni-mannheim.de
Samuel Greiff
Cognitive Science and Assessment
University of Luxembourg
6, rue Richard Coudenhove-Kalergi
4366 Esch-sur-Alzette
Luxembourg
samuel.greiff@uni.lu
Ó2018 Hogrefe Publishing European Journal of Psychological Assessment (2018), 34(1), 15
Editorial 5
http://econtent.hogrefe.com/doi/pdf/10.1027/1015-5759/a000469 - Samuel Greiff <samuel.greiff@uni.lu> - Friday, February 16, 2018 1:52:44 AM - IP Address:88.65.127.23
... Firstly, SS measures may suffer from varying interpretations of items' response options among respondents, such that the same "Agree" may indicate different levels of the latent trait for different respondents (Brown & Maydeu-Olivares, 2018). Direct statement comparisons in FC measures overcome the issue of idiosyncratic interpretations of response options (Brown & Maydeu-Olivares, 2013;Friedman & Amoo, 1999;Schwarz et al., 1991;Wetzel & Greiff, 2018). ...
Article
Full-text available
Previous studies have shown that contextualization can improve the reliability and criterion-related validity of single-statement personality measures. However, it is unknown whether contextualization has similar effects on forced-choice measures of personality. If so, what type of contextualization is the most effective? The present study provides the first empirical examination of the effects of three types of contextualization on the reliability and criterion-related validity of forced-choice personality measures. Employing an experimental design, we obtained and cross-validated results using two forced-choice personality measures. Results showed that while contextualization has no systematic effect on the reliability of forced-choice scores, it improves their criterion-related validity substantially. Specifically, contextualization of both the statements and instructions yielded the highest levels of criterion-related validity for work-related outcomes, with an average validity coefficient of .18 and an average multiple correlation coefficient of .40 across two measures, followed by statement contextualization only (Mr = .18, MmultipleR = .35) and then by instruction contextualization only (Mr = .14, MmultipleR = .31). The original scales with no contextualization showed the lowest levels of criterion-related validity (Mr = .10, MmultipleR = .27). Contextualization also increased the intercorrelations of personality dimensions. These patterns were well replicated across the two forced-choice scales.
... Although the assessment package may have comprehensively assessed the Dark Triad facets, the long survey may have invited fatigue or thoughtless responses (Wetzel & Greiff, 2018) and may explain the high number of incomplete responses. This study's assessment package may also be impacted by impression management and participants' level of insight. ...
Article
Full-text available
Early maladaptive schemas (EMS) may contribute to the Dark Triad’s (i.e., Machiavellianism, narcissism, psychopathy) dysfunctional workplace outcomes. EMS — the core concept of Schema Therapy — are entrenched emotional, cognitive, memories, and physiological patterns that form during early life in response to unmet needs and elaborate throughout life. As the workplace can involve hierarchy and power, EMS may be potentially reinforced in this context. This novel study aimed to explore the relationships between EMS and the Dark Triad facets within a working sample. The study also examined whether the Dark Triad facets yield distinct relationships with EMSs. The sample ( N = 210) reported working ≥ 20 h per week in paid- or full-time role in private and public sectors. Participants completed an online self-report survey comprising the Young Schema Questionnaire Short Form-Third Edition, Five Factor Narcissism Inventory-Short Form (Antagonism and Extraversion facets), Five Factor Machiavellianism Inventory (Planfulness, Antagonism and Agency facets), and Corporate Psychopathy Inventory-Revised (Boldness, Ruthlessness, and Interpersonal Dominance facets). Zero-order correlations indicated that all Dark Triad facets significantly and positively correlated with the Entitlement/Grandiosity, Unrelenting Standards, and Punitiveness EMSs. Most Dark Triad facets positively correlated with Approval/Recognition Seeking and Mistrust/Abuse EMSs. The Entitlement/Grandiosity EMS demonstrated the strongest positive correlation with most Dark Triad facets. The results suggest that those with elevated Dark Triad traits in the working sample share a similar cognitive and emotional worldview (i.e., EMS) that may activate in the workplace context. Expanding this research could inform a ‘Dark’ Schema Workplace model and Schema Therapy interventions to potentially reduce the Dark Triad’s dysfunctional workplace outcomes.
... It is no exaggeration to state that most of our knowledge on personality is built upon LK rating scales. Despite its undeniable contributions, researchers have been voicing concerns about the LK format's susceptibility to various response biases that can jeopardize the reliability and validity of the obtained scores (Podsakoff et al., 2012;Wetzel & Greiff, 2018). For example, acquiescent responding, the general tendency to (dis)agree with statements regardless of statement content (Jackson & Messick, 1958;Wetzel, Lüdtke et al., 2016), can produce spurious factors (McCrae et al., 2001) or mask the theoretical factor structure (Soto et al., 2008). ...
Article
Full-text available
Unveiling the roles personality plays during childhood and adolescence necessitates its accurate measurement. However, traditional Likert (LK) scales are susceptible to various response biases, which can be particularly prevalent in children and adolescents, thus likely undermining measurement accuracy. Forced-choice (FC) scales appear to be a promising alternative because they are largely free from these biases by design. However, some argue that the FC format may not perform satisfactorily in children and adolescents due to its complexity. Little empirical evidence exists regarding the suitability of the FC format for children and adolescents. As such, the current study examined the psychometric performance of an FC measure of the Big Five personality factors in three children and adolescent samples: 5th-6th graders (N = 428), 7th-8th graders (N = 449), and 10th-11th graders (N = 555). Across the three age groups, FC scale demonstrated better fit to the Big Five model and better discriminant validity in comparison to the LK counterpart. Personality scores from the FC scale also converged well with those from the LK scale and demonstrated high reliability as well as sizeable criterion-related validity. Furthermore, the FC scale had more invariant statements than its LK counterpart across age groups. Overall, we found good evidence showing that FC measurement of personality is suitable for children and adolescents.
... On a Likert scale (Likert, 1932), test-takers have to rate their agreement to self-describing statements, i.e., from 1 (strongly agree) to 5 (strongly disagree). However, these items are subject to a number of biases, such as faking in a socially desirable direction, acquiescence (confirmation tendency), exaggerated coherence between items of different traits ("halo" effect), and several others (Paulhus & Vazire, 2007;Wetzel & Greiff, 2018;Wetzel et al., 2016). ...
Article
Full-text available
This study compares the faking resistance of Likert scales and graded paired comparisons (GPCs) analyzed with Thurstonian IRT models. We analyzed whether GPCs are more resistant to faking than Likert scales by exhibiting lower score inflation and better recovery of applicants’ true (i.e., honest) trait scores. A total of N=573N=573 participants completed either the Likert or GPC version of a personality questionnaire first honestly and then in an applicant scenario. Results show that participants were able to increase their scores in both the Likert and GPC format, though their score inflation was smaller in the GPC than the Likert format. However, GPCs did not exhibit higher honest–faking correlations than Likert scales; under certain conditions, we even observed negative associations. These results challenge mean score inflation as the dominant paradigm for judging the utility of forced-choice questionnaires in high-stakes situations. Even if forced-choice factor scores are less inflated, their ability to recover true trait standings in high-stakes situations might be lower compared with Likert scales. Moreover, in the GPC format, faking effects correlated almost perfectly with the social desirability differences of the corresponding statements, highlighting the importance of matching statements equal in social desirability when constructing forced-choice questionnaires.
... Following guidelines for scale development, essential content and format specification were established prior to item development (American Educational Research Association, American Psychological Association & National Council on Measurement in Education, 2014; Wetzel & Greiff, 2018). With respect to content specification, item development was guided by the definition of student-teacher evaluation with unique emphasis on dissertation tutoring. ...
Article
Full-text available
Academic dissertation tutoring has become an important academic teaching competence in higher education. However, no assessment instrument has been developed to measure the needed transversal compe-tences in quality tutoring. In two studies, we aim to develop and explore the psychometric properties of the Student Evaluation of Dissertation Tutoring Scale (SEDITUS). In Study 1 (N = 82, 72% women), the initial eight items proposed a unidimensional construct and good fit properties to the Polytomous Rasch model. Differential item functioning (DIF) revealed no significant differences across gender for all the items, and the four-category structure did function well. In Study 2 (N = 1046, 69% women), an expert committee decided to remove one item due to the lack of a generalization process for the item across faculty degrees. A multilevel Rasch model was used to consider the nested nature of the data (students nested in eight faculties). Results replicated Study 1, with the scale showing appropriate psychometric properties at the item and global level. Overall, the two studies suggest the SEDITUS can be recommended as a rapid assessment of the tutoring process in dissertations developed both in undergraduate and master's degrees.
... The popularity of the Likert-type format is probably due to the fact that corresponding items are relatively easy to develop, administer, answer, score, and analyze. However, the format has also received a lot of controversial discussions (e.g., DeCastellarnau, 2018;Wetzel & Greiff, 2018). One issue is the scale level (ordinal vs. interval), which in turn determines the statistical models that may or may not be used. ...
Article
Full-text available
Many researchers use self-report data to examine abilities, personalities, or attitudes. At the same time, there is a widespread concern that response styles, such as the tendency to give extreme, midscale, or acquiescent responses, may threaten data quality. As an alternative to post hoc control of response styles using psychometric models, a priori control using specific response formats may be a means to reduce biasing response style effects in self-report data in day-to-day research practice. Previous research has suggested that response styles were less influential in a Drag-and-Drop (DnD) format compared to the traditional Likert-type format. In this article, we further examine the advantage of the DnD format, test its generalizability, and investigate its underlying mechanisms. In two between-participants experiments, we tested different versions of the DnD format against the Likert format. We found no evidence for reduced response style influence in any of the DnD conditions, nor did we find any difference between the conditions in terms of the validity of the measures to external criteria. We conclude that adaptations of response formats, such as the DnD format, may be promising, but require more thorough examination before recommending them as a means to reduce response style influence in psychological measurement.
... Taking the perspective that a survey instrument should be designed such that the response format matches the construct to be measured (Wetzel & Greiff, 2018), the SEVT-based instrument created is unique in that it adopts a novel bipolar statement format developed and applied in previous studies by the second author. It could be argued that the bipolar item format in the instrument developed here offers much in attempting to represent with integrity the cost-benefit appraisal central to decision-making in SEVT. ...
Article
Full-text available
Higher-level mathematics courses in upper secondary school serve as a critical filter to future educational courses and careers in Science, Technology, Engineering and Mathematics (STEM). However, the percentage of senior school students in Australia undertaking higher-level mathematics courses is decreasing. Given that these courses provide students with skills and knowledge integral to STEM disciplines, it is important to discover factors that serve to encourage or detract students in choosing higher-level mathematics courses. Considering that educational and career choices are influenced by personal interests, values, and expectations, the purpose of this study was to design and validate a bipolar format survey instrument to investigate motivational factors on mathematics course choices of Year 10 Australian school students based upon Situated Expectancy-Value Theory (SEVT). A 25-item survey instrument using a bipolar format was developed to measure: Expectancy for success (operationalized as Competence-beliefs); Intrinsic value; Attainment value; Utility value; and Cost in relation to mathematics. Confirmatory and exploratory factor analyses of data collected from Year 10 students (n = 886) revealed a four-factor model consisting of well-defined factors of Competence-beliefs, Intrinsic value, Attainment value, and Utility value. Unexpectedly, the items designed to measure the Cost factor dispersed variously into the factors of Competence-beliefs, Intrinsic value, and Utility value, and conceptually plausible explanations are offered for this finding. This survey represents a promising instrument for measuring predictors of mathematics course choices in senior school students. The study findings also suggest interrelationships between specific cost dimensions and other factors relevant to the measurement of SEVT constructs more generally.
Preprint
Full-text available
Background Tree imagery in drawing tests has demonstrated effectiveness in predicting mental disorders; however, there remains a lack of uniformity in the selection and interpretation of predictors. This study aimed to integrate various tree imagery characteristics in mental disorders through a systematic review and meta-analysis and to further identify valid indicators for predicting mental disorders. Methods A search of the following electronic databases was performed in May 2023: PubMed, Web of Science, Embase, EBSCO, CNKI, VIP, and Wanfang. Screening and checking of the literature were performed independently by two authors. A total of 42 studies were included in the meta-analysis. The strength of the association between drawing characteristics and mental disorders was measured by the ratio (OR) with a 95% CI. Publication bias was assessed using a funnel plot, Rosenthal’s fail-safe number (Nfs), and the trim and fill method. Results The analysis demonstrated a total of 45 drawing characteristics that appeared at least three times in previous studies, 24 of which were found to significantly predict mental disorders. The effective predictors could be categorized into five categories: blackened out, scribbled lines, oddly shaped, no vitality, and overly simple. Subgroup analyses indicated that “blackened tree”, “no motion”, and “excessive separation” were specific indicators of affective disorders, whereas “roots” was an indicator of thought disorders. Common indicators for mental disorders included “weak or intermittent tree lines”, “no additional decoration”, “simplified drawing”, “small drawing size” and “very small tree”. Conclusion This study confirms the value of drawing tests in screening for mental disorders, and provides reference for the selection and interpretation of drawing indicators.
Preprint
Full-text available
In graded paired comparisons (GPCs) two items are compared using a verbally anchored, multi-point rating scale. GPCs can measure non-cognitive constructs. GPCs are expected to reduce faking/response tendencies compared with Likert-type scales and to produce more reliable and less ipsative trait scores than traditional-binary forced-choice response formats. To investigate the statistical properties of GPCs, we conducted a set of 288 simulations in which we varied six independent factors and additionally implemented conditions with algorithmically optimized item combinations. Traits were estimated using Thurstonian IRT models. Under ideal conditions, good reliabilities and low ipsativity of trait estimates were achieved. However, under conditions similar to those in high-stakes assessments, reliabilities did not consistently exceed the conventional threshold. Moreover, there was evidence for ipsativity. In these conditions, more response categories and optimized combination of items led to higher reliabilities and nearly fully normative trait scores. In sum, this simulation informs about the psychometric properties of GPCs under different conditions and makes specific recommendations as to how these properties can be improved. We discuss the generalizability of our results to response processes in selection situations.
Article
Full-text available
Estudio descriptivo sobre la opinión de las matronas españolas en cuanto a su situación actual y el futuro de su formación profesional Originales Resumen Objetivo: el objetivo principal de esta investigación es recabar la opinión de las matronas españolas en cuanto a la posibilidad de un cambio en su formación que dé un nuevo rumbo al futuro de su profesión. Secundariamente se analizará el cono-cimiento de las matronas sobre la historia de su profesión y la opinión del colectivo sobre la posición que ocupa la matrona dentro de la profesión enfermera. Material y método: estudio descriptivo transversal, no experimental en función de las opiniones de matronas de toda la geografía española, participantes voluntarias en la encuesta propuesta. Resultados y conclusiones: como resultados de esta investi-gación destaca la falta de conocimientos de la historia de la profesión. La profesión de matrona se beneficiaría de un grado independiente de matrona y, por tanto, de un colegio profesional propio. Es necesaria la participación activa de todo colectivo para iniciar el camino del cambio que desea la mayoría. Palabras clave: matronas; formación; competencias; satisfacción; expectativas; conocimiento. Abstract Descriptive study on the opinion of spanish midwives regarding their current situation and the future of their professional training Objective: the primary objective of this research is to obtain the opinion of Spanish midwives regarding a potential change in training which will provide a new direction to the future of their profession. Secondarily, there will be an analysis of the knowledge of midwives regarding the history of their profession and their opinion as a group of the position occupied by midwives within the nursing profession. Materials and method: a descriptive cross-sectional study, non-experimental, based on the opinions of midwives throughout the Spanish geography, who volunteered to participate in this survey. Results and conclusions: the lack of knowledge of the history of the profession stands out as a result of this research. The midwife profession would benefit of an independent midwife degree, and therefore, a specific professional association. The active participation of the whole group is necessary to take the path of change wanted by the majority.
Article
Full-text available
Quite a lot of research is available on the relationships between survey response scales’ characteristics and the quality of responses. However, it is often difficult to extract practical rules for questionnaire design from the wide and often mixed amount of empirical evidence. The aim of this study is to provide first a classification of the characteristics of response scales, mentioned in the literature, that should be considered when developing a scale, and second a summary of the main conclusions extracted from the literature regarding the impact these characteristics have on data quality. Thus, this paper provides an updated and detailed classification of the design decisions that matter in questionnaire development, and a summary of what is said in the literature about their impact on data quality. It distinguishes between characteristics that have been demonstrated to have an impact, characteristics for which the impact has not been found, and characteristics for which research is still needed to make a conclusion.
Article
Full-text available
The cube construction task represents a novel format in the assessment of spatial ability through mental cube rotation tasks. Instead of selecting the correct answer from several response options, testees construct their own response in a computerized test environment. The format has several advantages: It is no longer possible to guess the correct response or to compare the reference cube to the response options, resulting in a higher demand for spatial ability. Moreover, it is possible to create items with a particularly high difficulty which are needed for the assessment of intellectual giftedness. In the present study, we developed 28 items and presented them to a sample of 130 university students. Test results showed that the items possess a very high statistical difficulty. Furthermore, the item set yielded a very high internal consistency. The results of an exploratory factor analysis as well as of an multidimensional IRT-analysis indicated that a two-factorial solution (“spatial relations” vs. “spatial visualization”) is plausible. Response time had a negligible influence on accuracy. Perspectives on further research concerning the cube construction task and possibilities for practical applications are being discussed.
Chapter
Full-text available
Response biases comprise a variety of systematic tendencies of responding to questionnaire items. Response biases exert an influence on item responses in addition to any constructs that the questionnaire is designed to measure and can therefore potentially bias the corresponding trait level estimates. This chapter addresses general response biases that are independent of item content, including response styles (e.g., extreme response style, acquiescence) and rater biases (halo effect, leniency/severity bias), as well as response biases that are related to item content and depend strongly on the context (socially desirable responding). The chapter summarizes research on correlates of response biases and research on inter-individual and cross-cultural differences in engaging in response styles and rater biases. It describes different methods that can be applied at the test construction stage to prevent or minimize the occurrence of response biases. Finally, it depicts methods developed for correcting for the effects of response biases.
Article
Recently, the question of whether rating scales with different numbers of categories are comparable with each other, received a new impetus by Parducci's and Wedell's work on the 'category effect' (which states an inverse relationship between the degree of context effects resulting from skewing the stimulus distribution on the one hand, and the number of categories provided by the rating scale on the other hand). In order to address this question, the present article focusses on the fundamental, though largely neglected, problem of choosing an appropriate model of scale transformation. Two linear models, which are common in psychological research, are compared with each other. While model A (which is employed by Parducci and his coworkers) matches the midpoints of the end categories, model B matches the lower and upper scale borders, respectively (by assuming equal category widths on each scale). Goodness of fit of the two models is empirically evaluated by analyzing ratings of physical attractiveness, which were obtained on a 9-point scale as well as on a category scale ranging from '0' to '100'. Data analysis shows that the results obtained on the two scales can very well be linearly transformed into each other. Concerning goodnes of fit, model B turns out to be clearly superior to model A. A closer inspection of the data reveals some minor but nevertheless systematic deviations from linearity. As a possible explanation for this finding the 'prominence structure of the decimal system' is discussed.
Article
The multidimensional forced-choice (MFC) format has been proposed as an alternative to the rating scale (RS) response format. However, it is unclear how changing the response format may affect the response process and test motivation of participants. In Study 1, we investigated the MFC response process using the think-aloud technique. In Study 2, we compared test motivation between the RS format and different versions of the MFC format (presenting 2, 3, 4, and 5 items simultaneously). The response process to MFC item blocks was similar to the RS response process but involved an additional step of weighing the items within a block against each other. The RS and MFC response format groups did not differ in their test motivation. Thus, from the test taker’s perspective, the MFC format is somewhat more demanding to respond to, but this does not appear to decrease test motivation.
Article
Even though there is an increasing interest in response styles, the field lacks a systematic investigation of the bias that response styles potentially cause. Therefore, a simulation was carried out to study this phenomenon with a focus on applied settings (reliability, validity, scale scores). The influence of acquiescence and extreme response style was investigated, and independent variables were, for example, the number of reverse-keyed items. Data were generated from a multidimensional item response model. The results indicated that response styles may bias findings based on self-report data and that this bias may be substantial if the attribute of interest is correlated with response style. However, in the absence of such correlations, bias was generally very small, especially for extreme response style and if acquiescence was controlled for by reverse-keyed items. An empirical example was used to illustrate and validate the simulations. In summary, it is concluded that the threat of response styles may be smaller than feared.
Chapter
Quantitative educational research depends on the availability of carefully constructed variables. The construction and use of a variable begin with the idea of a single dimension or line on which students can be compared and along which progress can be monitored. This idea is operationalized by inventing items intended as indicators of this latent variable and using these items to elicit observations from which students’ positions on the variable might be inferred.
Article
Reflecting modern developments in the field of survey research, the Second Edition of Design, Evaluation, and Analysis of Questionnaires for Survey Research continues to provide cutting-edge analysis of the important decisions researchers make throughout the survey design process. The new edition covers the essential methodologies and statistical tools utilized to create reliable and accurate survey questionnaires, which unveils the relationship between individual question characteristics and overall question quality. Since the First Edition, the computer program Survey Quality Prediction (SQP) has been updated to include new predictions of the quality of survey questions on the basis of analyses of Multi-Trait Multi-Method experiments. The improved program contains over 60,000 questions, with translations in most European languages. Featuring an expanded explanation of the usage and limitations of SQP 2.0, the Second Edition also includes: New practice problems to provide readers with real-world experience in survey research and questionnaire design; A comprehensive outline of the steps for creating and testing survey questionnaires; Contemporary examples that demonstrate the many pitfalls of questionnaire design and ways to avoid similar decisions. Design, Evaluation, and Analysis of Questionnaires for Survey Research, Second Edition is an excellent textbook for upper-undergraduate and graduate-level courses in methodology and research questionnaire planning, as well as an ideal resource for social scientists or survey researchers needing to design, evaluate, and analyze questionnaires.