ArticlePDF Available

Abstract and Figures

Extreme response style (ERS), the tendency of participants to select extreme item categories regardless of the item content, has frequently been found to decrease the validity of Likert-type questionnaire results. For this reason, various item response theory (IRT) models have been proposed to model ERS and correct for it. Comparisons of these models are however rare in the literature, especially in the context of cross-cultural comparisons, where ERS is even more relevant due to cultural differences between groups. To remedy this issue, the current article examines two frequently used IRT models that can be estimated using standard software: a multidimensional nominal response model (MNRM) and a IRTree model. Studying conceptual differences between these models reveals that they differ substantially in their conceptualization of ERS. These differences result in different category probabilities between the models. To evaluate the impact of these differences in a multigroup context, a simulation study is conducted. Our results show that when the groups differ in their average ERS, the IRTree model and MNRM can drastically differ in their conclusions about the size and presence of differences in the substantive trait between these groups. An empirical example is given and implications for the future use of both models and the conceptualization of ERS are discussed.
Content may be subject to copyright.
Original Research Article
Educational and Psychological
Measurement
1–26
ÓThe Author(s) 2023
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/00131644231155838
journals.sagepub.com/home/epm
Correcting for Extreme
Response Style: Model
Choice Matters
Martijn Schoenmakers
1
, Jesper Tijmstra
1
,
Jeroen Vermunt
1
and Maria Bolsinova
1
Abstract
Extreme response style (ERS), the tendency of participants to select extreme item
categories regardless of the item content, has frequently been found to decrease the
validity of Likert-type questionnaire results. For this reason, various item response
theory (IRT) models have been proposed to model ERS and correct for it.
Comparisons of these models are however rare in the literature, especially in the
context of cross-cultural comparisons, where ERS is even more relevant due to cul-
tural differences between groups. To remedy this issue, the current article examines
two frequently used IRT models that can be estimated using standard software: a
multidimensional nominal response model (MNRM) and a IRTree model. Studying
conceptual differences between these models reveals that they differ substantially in
their conceptualization of ERS. These differences result in different category prob-
abilities between the models. To evaluate the impact of these differences in a multi-
group context, a simulation study is conducted. Our results show that when the
groups differ in their average ERS, the IRTree model and MNRM can drastically differ
in their conclusions about the size and presence of differences in the substantive trait
between these groups. An empirical example is given and implications for the future
use of both models and the conceptualization of ERS are discussed.
Keywords
item response theory, multidimensional nominal response model, IRTree model,
response styles, extreme responding
1
Tilburg University, The Netherlands
Corresponding Author:
Martijn Schoenmakers, Department of Methodology and Statistics, Tilburg University, P.O. Box 90153,
5000LE Tilburg, The Netherlands.
Email: M. Schoenmakers@tilburguniversity.edu
Likert-type scales are frequently used in social science questionnaires to measure a
wide array of constructs (Nemoto & Beglar, 2014; Van Vaerenbergh & Thomas,
2013; Willits et al., 2016). While the use of Likert-type scales is widespread, several
threats to their valid use exist. Response style, the tendency of participants to provide
a particular response to a question regardless of the questions content, is one of these
threats (Falk & Cai, 2016; Van Vaerenbergh & Thomas, 2013). Response styles
affect means and variances of Likert-type scales, thus potentially leading to unwar-
ranted conclusions being drawn. For example, Moors (2012) found that the correla-
tion between gender and leadership preference disappeared when response styles
were taken into account. Due to the effects of response styles, it is important to detect
and correct for response styles when conducting questionnaire research.
Many response styles exist, for example, acquiescent response style (ARS; ten-
dency to agree with items regardless of content), mild response style (MLRS; ten-
dency to avoid the scale endpoints), and extreme response style (ERS; tendency to
respond using scale endpoints; Moors, 2012; Van Vaerenbergh & Thomas, 2013).
While all of these response styles have the potential to reduce the validity of results
based on Likert-type scales by affecting means and variances, response styles
become especially relevant in contexts where different groups who may not deal with
items in the exact same way are compared, such as in cross-cultural research. For this
reason, response styles and their relation to culture have received extensive attention.
In particular, many studies have shown ERS to differ substantially across cultures
(Batchelor & Miao, 2016; Chun et al., 1974; Clarke, 2001; Hui & Triandis, 1989;
Morren et al., 2011; Van Vaerenbergh & Thomas, 2013). The following section
briefly summarizes antecedents and consequences of ERS.
ERS is defined as the tendency of participants to prefer responding extremely on
item scales, independent of item content (Greenleaf, 1992; Van Vaerenbergh &
Thomas, 2013). A variety of potential causes for ERS have been identified in the lit-
erature. These causes can be divided into questionnaire properties and person proper-
ties. Concerning the questionnaire, the use of bipolar scales and the way scale
endpoints are labeled have been found to influence ERS in participants (Lau, 2007;
Moors et al., 2014). Furthermore, the visual distance between scale options and
whether options are presented horizontally or vertically has been shown to relate to
ERS (Weijters et al., 2021). On the personal level, personality constructs such as
neuroticism (Iwawaki & Zax, 1969), extraversion, and conscientiousness (Austin
et al., 2006) influence ERS. In addition, race, gender, intelligence (Batchelor &
Miao, 2016), and culture (Batchelor & Miao, 2016; Hui & Triandis, 1989; Morren
et al., 2011) have been found to correlate with extreme responding.
As a response style, ERS has the potential to bias results obtained via question-
naires. Generally, ERS can reduce the magnitude of effects obtained, as extreme
responding increases the variance of questionnaire responses (Van Vaerenbergh &
Thomas, 2013). As an example, one study found that failing to account for ERS
reduced explained variance from 69.5% to 53.5% (Lau, 2007). Given this potential
distortion of results caused by ERS, various item response theory (IRT) models have
2Educational and Psychological Measurement 00(0)
been developed to detect and correct for ERS. The way these various models func-
tion and the assumptions they make regarding response styles differ substantially.
One of the most notable differences between these models is conceptualizing ERS as
a categorical versus continuous trait.
When conceptualizing ERS as a categorical latent trait, mixture IRT models can
be used. Mixture IRT models are a combination of IRT modeling and latent class
analysis, creating several latent classes based on observed responses (Rost, 1991).
For every latent class, person and item parameters are estimated separately. Thus,
there can be between-class differences in parameters, but within-class homogeneity is
assumed. For example, a two-class model may result in one ordinary responding class
and one extreme responding class (Austin et al., 2006; Bockenholt & Meiser, 2017).
For modeling ERS, the mixture partial credit model is often applied (Cho, 2013;
Huang, 2016; Sen & Cohen, 2019). While the categorical view of ERS is sometimes
utilized in modeling, assuming a lack of within-class variation is not theoretically jus-
tified (Huang, 2016). Individuals within a class may very well differ in the presence
and strength of their ERS tendency. As categorical ERS models are not able to model
this within-class variation, this article will focus on models that model response styles
continuously rather than categorically.
A variety of models that conceptualize ERS as a continuous variable have been
developed. For example, extensions of the rating scale model (Jin & Wang, 2014),
unfolding models (Javaras & Ripley, 2007), recent applications of multidimensional
nominal response models (Bolt et al., 2014; Bolt & Johnson, 2009; Falk & Cai,
2016), item response tree (IRTree) models (Bockenholt, 2012; Bockenholt & Meiser,
2017; Meiser et al., 2019; Thissen-Roe & Thissen, 2013) and heterogeneous thresh-
old models (Johnson, 2003) have been proposed. All of these models introduce an
additional continuous latent variable for ERS in addition to the substantive latent
variable that the questionnaire is intended to measure, but the models differ in the
exact way in which the ERS dimension is included in the model.
While all these models allow one to correct for ERS, many of them do not have
implementations in standard software packages. The lack of implementation in stan-
dard software packages makes models less likely to be used, especially by applied
researchers. In addition, not all of these models have extensions beyond ERS to other
response styles, and not all models are able to estimate a correlation between ERS
and the substantive trait, making them less flexible. Two notable exceptions are the
multidimensional nominal response model (MNRM) and the IRTree models which
can be implemented in many standard packages for multidimensional IRT, for exam-
ple, the R package mirt (Chalmers, 2012), are widely used (Zhang & Wang, 2020),
and can be applied to a variety of response styles while estimating a correlation
between the substantive trait and the response style(s) (Falk & Cai, 2016; Meiser
et al., 2019; Zhang & Wang, 2020).
While both the MNRM and IRTree can flexibly be used to correct for ERS, the lit-
erature comparing these models in their ability to deal with response styles and their
assumptions is sparse. Two notable exceptions that do compare MNRM and IRTree
Schoenmakers et al. 3
models should be discussed (Leventhal, 2019; Zhang & Wang, 2020). While the two
aforementioned articles do compare MNRM and IRTree models, their parametriza-
tion and outcome measures differ substantially from the current article. Zhang and
Wang (2020) focus on empirical rather than simulated data, limiting their conclusions
somewhat, as true parameter values for the substantive and response style factors are
unknown. In addition, both Leventhal (2019) and Zhang and Wang (2020) use a
somewhat different operationalization of the MNRM and a substantially different
IRTree model. Specifically, Zhang and Wang (2020) do not utilize a multidimen-
sional node IRTree model. While Leventhal (2019) does utilize a multidimensional
node IRTree model, they use a restrictive parameterization which reduces the validity
of conclusions for less restrictive models. This issue will be discussed further in the
Models section. Finally, neither article considers a multigroup setting. Multigroup
settings are relevant in response style modeling, as there is an abundance of research
linking response style to group characteristics such as culture (Batchelor & Miao,
2016; Chun et al., 1974; Clarke, 2001; Hui & Triandis, 1989; Morren et al., 2011;
Van Vaerenbergh & Thomas, 2013). For this reason, the current article aims to
expand upon previous research by comparing the MNRM and a multidimensional
node IRTree model with minimal restrictions on their assumptions and ways of mod-
eling ERS in a multigroup setting.
The rest of this article is organized as follows. First, the ‘‘Models’’ section pre-
sents the MNRM and IRTree models used in this article in detail and discusses the
conceptual differences between these models. Second, the ‘‘Methods’’ section dis-
cusses the conditions of a simulation study that examines the practical impact of these
conceptual differences in a multigroup context. Third, the ‘Results’ section contains
the results of the simulation study. Fourth, the ‘Empirical example’’ section shows
that similar differences between the models as we observe in the simulation study can
occur when the models are fit to real data. Finally, the article ends with a discussion
where the conceptual and practical implications of the results for the future use of
both models and the conceptualization of ERS are discussed.
Models
The first model we discuss is the MNRM (Takane & de Leeuw, 1987). In the
MNRM, the probability of endorsing an item category is the following:
PY
i=kjuðÞ=exp ~
aT
ik u+cik

P
K
j=1
exp ~
aT
ij u+cij

,ð1Þ
where ~
aik is a vector of slope parameters, uis a vector containing the participant’s
scores on the latent traits, with the first element being the participants score on the
substantive trait and the second element being the participant’s score on the ERS
trait, and cik is a category intercept. Subscript irefers to items, and subscript krefers
4Educational and Psychological Measurement 00(0)
to categories (ranging from 1 to the number of categories). The MNRM has been
used extensively to model response styles as continuous latent traits (Bolt et al.,
2014; Bolt & Johnson, 2009; Bolt & Newton, 2011). Building on earlier work by
Bolt and Johnson (2009), Falk and Cai (2016) developed a version of the MNRM
that splits the vector of slope parameters ~
aik into an estimated item slope and a pre-
specified scoring matrix reflecting the loading of the response style(s) on categories:
PY
i=kjuðÞ=
exp aisk
½
Tu+cik

P
K
j=1
exp aisj

Tu+cij

,ð2Þ
where aiis a vector of item slope parameters, denotes Schur/Hadamard multiplica-
tion, and skis a vector of scoring matrix s. Due to splitting the slopes into a scoring
matrix and an item slope, this model is able to flexibly model a variety of response
styles while maintaining item-specific discrimination, unlike the model utilized in
both Zhang and Wang (2020) and Leventhal (2019). Note that the models illustrated
in this article will be limited to four-category items for simplicity and to reduce
simulation time, although fewer or more categories are also possible under both
models. While the formulation mentioned above uses item intercepts, these can eas-
ily be converted to item thresholds. For example, in the unidimensional case with a
four-category item with three thresholds, the thresholds can be calculated as
tg=cig ci(g+1)
ai
,ð3Þ
where tgdenotes the g
th
threshold, with granging from one to three. The conversion
of intercepts to thresholds results in a somewhat more intuitive interpretation of para-
meters (the point on the substantive dimension where the k
th
category is exactly as
likely as the (k+1)
th
category when the ERS is equal to 0). For this reason, the cur-
rent paper will largely discuss item thresholds rather than item intercepts.
To illustrate the MNRM adaptation further, two examples of scoring matrices for
a four-category item will be given. To specify the generalized partial credit model
(i.e., no response styles) for a 4-category item, we could specify the scoring matrix
as follows:
½0123:ð4Þ
Adding ERS, the scoring matrix would be
0123
1001

:ð5Þ
Note that many response styles could potentially be added to this model, but the
current article will be limited to ERS.
Schoenmakers et al. 5
Another class of widely used models for response styles that can be easily imple-
mented in standard software are IRTree models (Bockenholt, 2012; Bockenholt &
Meiser, 2017; Thissen-Roe & Thissen, 2013; Zhang & Wang, 2020). IRTree models
are sequential decision models, where the response of a participant to a question is
divided into multiple steps. An example of an IRTree four-category item decision
process is illustrated in Figure 1.
The IRTree model depicted in Figure 1 splits the decision process up into multiple
binary decision nodes, which can be modeled as pseudo-items. In effect, 1 four-
category item is thus split up into 3 two-category pseudo-items. In the first node, a
participant decides whether they agree or disagree with the four-category item. After
this, depending on their decision in Node 1, they proceed to Node 2 or Node 3,
where it is established if they agree/disagree moderately or extremely. Note that the
sequential nature of this model does not necessarily imply a chronological order
(Bockenholt & Meiser, 2017). While this type of IRTree is often used when model-
ing the effect of ERS on four-category items, a wide variety of IRTree models can
be constructed based on a priori beliefs about a participants response process, the
number of categories, the presence or absence of response styles, and so on
(Bockenholt & Meiser, 2017).
Various parametrizations of the IRTree nodes have been proposed. Early exam-
ples of IRTree models modeled ERS by having the substantive dimension load on
Figure 1. Example of an Item Response Tree Decision Process for a 4-Category Item.
6Educational and Psychological Measurement 00(0)
the first node and having the latent ERS dimension load on Node 2 and Node 3
(Bockenholt, 2012; Bockenholt & Meiser, 2017; Meiser et al., 2019). More recently,
multidimensional IRTree models have emerged, where the substantive dimension
loads on all three nodes and the latent ERS dimension loads on Node 2 and Node 3
(Meiser et al., 2019; Thissen-Roe & Thissen, 2013). This is unlike the IRTree models
utilized in Zhang and Wang (2020), and the IRTree models mentioned before
(Bockenholt, 2012; Bockenholt & Meiser, 2017), which only have the substantive
dimension load on the first node. Having the substantive dimension load on all nodes
rather than just the first node is desirable, as the substantive trait should conceptually
also have a role to play in choosing between 1 or 2 and 3 or 4, rather than just influ-
encing general agreement or disagreement. The desired multidimensional node
IRTree model can be achieved by utilizing a unidimensional two-parameter logistic
model for the first node and multidimensional IRT models for the second and the
third node. To illustrate the desired multidimensional node model, let us by Yim
denote the response on the pseudo-item in node m. Table 1 depicts the relationship
between the observed response on item iand the corresponding pseudo-items. Note
that Nodes 2 and 3 are coded such that scores of 1 on these nodes correspond to
higher scores for the observed response.
Equation 6 depicts the general parameterization of the IRTree nodes:
PY
im =1juðÞ=
exp P
2
v=1
aimvuv+dim

1 + exp P
2
v=1
aimvuv+dim
,ð6Þ
where aimv denotes the slope parameter of item iin node mfor dimension v,uv
denotes the v
th
latent trait, with the first dimension being the substantive trait and the
second dimension being the ERS trait, and dim denotes the intercept of item iin node
m. In addition, several node-specific restrictions are present. In the first node, ERS
has no effect, as participants are merely choosing whether they agree with an item or
not, a process conceptually unrelated to ERS. For this reason, the slope ai12 is con-
strained to be equal to 0. In practice, a correlation between ERS and various substan-
tive traits, such as personality and intelligence, is often found (Austin et al., 2006;
Table 1. Relationship Between the Observed Responses Yiand the Pseudoitem Responses
Yi1,Yi2, and Yi3.
Observed responses
Pseudo-items 1 2 3 4
Yi1: Agreement 0011
Yi2: Extreme/nonextreme disagreement 0 1 NA NA
Yi3: Extreme/nonextreme agreement NA NA 0 1
Schoenmakers et al. 7
Batchelor & Miao, 2016). However, without imposing any constraints on the item
slopes, the correlation between the two dimensions is not identified. To allow for a
correlation between ERS and the substantive trait to be modeled, ai21 is constrained
to be equal to ai31, and ai22 is constrained to be equal to ai32.
Note that the constraints that we use on the slope parameters result in a less
restrictive IRTree model than the models used by Meiser et al. (2019) and Leventhal
(2019). In Meiser et al. (2019), the slope parameters are constrained to be equal
across items, which allows dropping the restriction of the slope in Node 2 being
equal to minus the slope in Node 3 but still requires restricting the common slope in
Node 2 to be of the opposite sign to the common slope in Node 3. In Leventhal
(2019), a model originally proposed in Thissen-Roe and Thissen (2013) is used. This
model has five rather than six item-specific parameters. In addition to constraining
ai21 =ai31 and ai22 =ai32, a constraint di3di2=2di1ai21
ai11 is imposed. As such, the
difference in the intercepts in the third and second nodes is regressed on the intercept
in the first node. This constraint is typically not used in other IRTree models, where
the intercepts in all nodes are instead freely estimated (e.g., Bockenholt, 2012;
Bockenholt & Meiser, 2017; Meiser et al., 2019). In our article, we chose a parame-
trization with item-specific slope parameters (i.e., more general than the constraints
of Meiser et al., 2019) and node-specific intercepts (i.e., more general than the con-
straints of Leventhal, 2019).
While the MNRM and IRTree models are distinct in the way response styles are
modeled, both utilize continuous latent traits for both the substantive and the response
style dimensions. As the conceptualization of the latent traits seems quite similar, one
could be led to believe the models would have close or identical results when utilized.
However, the models also have some important differences. IRTree models split any
single item up into multiple pseudo-items, while the MNRM utilizes a divide-by-total
approach (the probability of a category is the numerator of that category divided by
the summed numerators of all categories) to model all item categories in a single step.
This seemingly small divergence results in major differences in the impact of ERS.
The IRTree model’s separation of the item into pseudoitems allows for a very precise
effect of ERS to be modeled. Specifically, as ERS loads on Nodes 2 and 3, but not
Node 1, ERS only affects the probability of a response being extreme, without affect-
ing the probability of general agreement with an item. At first glance, the scoring
matrix of ERS in the MNRM, depicted in Equation 5, appears to achieve the same.
However, the divide-by-total nature of the MNRM does not allow ERS to be modeled
without affecting the agreement probability of items. This effect is most prominent
for items that do not have thresholds symmetric around the substantive trait level of a
participant (e.g., [21, 0, 1] for u1= 0) but asymmetric thresholds (e.g., [21, 0, 1] for
u1= 1) when modeling ERS under the MNRM.
To illustrate the difference between the models described above, several figures
are provided. Data for the figures displayed here were generated using the MNRM,
with a sample size of 50,000 participants with u1;N(0, 1), u2;N(0, 1), an aof 1.5
for both dimensions, the scoring matrix as depicted in Formula 5, and thresholds of
8Educational and Psychological Measurement 00(0)
21, 0, 1. Note that these parameter values were chosen to match the simulation study
described in the Methods section. The reasons for choosing these values are dis-
cussed in the Methods section and in Supplemental Appendix B. Item parameters for
both the MNRM and the IRTree were estimated from the data using the mirt R pack-
age (Chalmers, 2012) and averaged over 10 identical items after which category
probabilities were calculated. With the large sample size of 50,000 participants, para-
meter estimates showed very little uncertainty and were nearly identical to the true
values. As figures using IRTree-generated data are very similar, these are displayed
in Supplemental Appendix A. Figure 2 illustrates that the agreement probability
depends on ERS under the MNRM but not under the IRTree, especially when the
item thresholds are not symmetric around u1.
As can be seen in Figure 2, the probability of agreeing with an item under the
IRTree never depends on the extent of ERS. Generally, the probability of agreeing
under the MNRM does depend on ERS. The size of this effect varies depending on
how symmetric the thresholds are around the substantive trail level.
While the MNRM does not result in invariant agreement probability in the pres-
ence of ERS in combination with asymmetric item thresholds, it does offer a property
not present in the IRTree model. Specifically, the probability of agreement given that
the response is extreme (selecting a 4 versus selecting a 1) remains identical in the
MNRM, regardless of the extent of ERS. This is not the case for the IRTree model,
as is displayed in Figure 3.
Figure 2. Probability of Agreement for an Item With Thresholds [21, 0, 1] Given Various
Levels of the Substantive Trait (u1)as a Function of ERS (u2)Under the MNRM (Solid Line)
and IRTree Models (Dotted Line).
Note. Colors indicate various u1values. ERS = Extreme response style; MNRM = multidimensional
nominal response model; IRTree = item response tree.
Schoenmakers et al. 9
As can be seen in Figure 3, the probability of endorsing Category 4 given an
extreme response generally depends on the extent of ERS present in the data for the
IRTree model but not for the MNRM model. Note that the same holds for the prob-
ability of agreement conditional on the response not being extreme (selecting a 3 vs.
selecting a 2).
Overall, the two differences between the models discussed above result in differ-
ent category probabilities for each model. Figure 4 shows the probabilities of differ-
ent categories as a function of ERS when the substantive trait is kept constant at 0
(i.e., the item thresholds are symmetric around participants u1). Solid lines indicate
an MNRM model, while dotted lines indicate an IRTree model.
As can be seen in Figure 4, the IRTree and MNRM models are quite close together
when the item thresholds are symmetric around u1. As ERS increases, the probability
of extreme responses (1 or 4) increases, and the probability of moderate responses (2
or 3) decreases, as one would expect. Figure 5 depicts the same scenario, but this time
with an item with asymmetric thresholds (thresholds 0, 1, 2).
As we can see in Figure 5, the IRTree and MNRM models start to diverge more
substantially when the item thresholds are not symmetric around u1and ERS is pres-
ent. In this instance, we see the probability of endorsing category 1 increasing more
rapidly for positive ERS values, and decreasing more rapidly for negative ERS val-
ues, under the MNRM than under the IRTree model. The probability of Category 2
Figure 3. Probability of Endorsing Category Four of a Four-Category Item With Thresholds
[21, 0, 1] Given an Extreme Response and Various Levels of the Substantive Trait (u1)as a
Function of ERS (u2)Under the MNRM (Solid Line) and IRTree (Dotted Line) Models.
Note. Colors indicate various u1values. ERS = Extreme response style; MNRM = multidimensional
nominal response model; IRTree = item response tree.
10 Educational and Psychological Measurement 00(0)
decreases more for positive ERS values and decreases less for negative ERS values
under the MNRM than under the IRTree. The difference between the two models is
especially striking for Category 3. For the MNRM, the probability of Category 3
Figure 5. Category Probabilities (1 = Red, 2 = Orange, 3 = Black, 4 = Green) Under the
MNRM (Solid Line) and IRTree (Dotted Line) Models for an Item With Asymmetric
Thresholds [0, 1, 2] Given u1=0 and Various u2Values.
Note. MNRM = multidimensional nominal response model; IRTree = item response tree; ERS = Extreme
response style.
Figure 4. Category Probabilities (1 = Red, 2 = Orange, 3 = Black, 4 = Green) Under the
MNRM (Solid Line) and IRTree (Dotted Line) Models for an Item With Symmetric Thresholds
[21, 0, 1] Given u1=0 and Varying u2Values.
Note. MNRM = multidimensional nominal response model; IRTree = item response tree; ERS = Extreme
response style.
Schoenmakers et al. 11
goes from ~0.2 at 23 ERS to ~0 at + 3 ERS, while the IRTree probability remains
almost constant at ~0.1, only shifting slightly downward at the higher ERS values.
Finally, the probability of Category 4 remains close to 0 for the MNRM, regardless
of ERS extent, while this probability starts to increase as ERS increases for the
IRTree model.
While it is clear the MNRM and IRTree differ substantially in their modeling of
response probabilities given ERS and items with asymmetric thresholds, the practical
impact of these differences remains unclear. The differences between models may be
especially relevant in a cross-cultural multigroup context, as culture has frequently
been found to influence the extent to which individuals engage in ERS (Clarke, 2001;
Johnson et al., 2005; Morren et al., 2011). To clarify the practical impact of using dif-
ferent models in a multigroup setting, a simulation study was conducted. The follow-
ing section describes and explains the method chosen for this simulation.
Method
To examine the practical impact of the conceptual differences between the MNRM
and IRTree models discussed in the previous section, a simulation study was con-
ducted. The bias in the substantive trait group mean and variance was of primary
interest here. To generate the data required for the simulation study, R version 4.1.2
was used (R Core Team, 2017). First, MNRM data was generated. Second, data were
generated using an IRTree model. Finally, a control condition (i.e., no ERS is pres-
ent) was generated using a generalized partial credit model (GPCM). All code used
to generate data and the generated data itself can be found at https://osf.io/nrgmy/.
Multidimensional Nominal Response Model
To simulate a cross-cultural comparison, participants were divided in two groups.
Participant substantive trait scores for both groups were drawn from N(0, 1), and
ERS trait scores were drawn from N(0, 1) for group 1, and Nm,1ðÞfor group 2,
where mwas 1, 0, or 21 depending on the condition. After careful consideration,
which is further described in Supplemental Appendix B, we chose to use alphas of
1.5 for the both the substantive and the ERS dimension with item thresholds of
½1+m,0+m,1+mor ½0+m,1+m,2+mdepending on the condition, where m
was added to all item thresholds to create varying item thresholds for different items
as one would see in a test in practice. The mvalues were equally spaced between
20.5 and 0.5.
Together, these item loadings and item thresholds resulted in category probabil-
ities that were not too low for any category ( ..05), reasonable correlations between
the latent traits and the item responses, and noticeable bias in the substantive trait
mean and variance of group 2 when ignoring ERS (i.e., estimating a GPCM). The
number of items (10 or 20) was also varied in the simulation. In total, this resulted in
2 (threshold sets) 32 (number of items) 33 (group 2 ERS means) = 12 data generating
12 Educational and Psychological Measurement 00(0)
conditions, with 500 replications per condition. All three models were applied to this data.
Outcome measures were the bias in the substantive trait mean and variance in Group 2.
For identification purposes, the substantive trait mean of Group 1 was fixed to 0, and the
substantive trait variance in Group 1 was fixed to 1. As such, the mean of group 2 is the
difference between the mean of Group 1 and Group 2, and the variance of Group 2 is the
ratio of the variance in Group 2 to the variance in Group 1. Note that based on reviewer
feedback, we also included sample size and distribution of the m-values as additional fac-
tors of the simulation. However, since the results were very similar to the main condi-
tions, these additional results are presented in Supplemental Appendix E.
Item Response Tree
For the IRTree model, person parameters were generated identically to the MNRM
model. Item parameters were obtained by generating data under an MNRM with
500,000 participants per group, item parameters as described in the MNRM section
and applying an IRTree model to these data to obtain IRTree item parameters esti-
mates. These estimates were used as the true values for the IRTree item parameters
when generating data to ensure maximum comparability between the IRTree and
MNRM models. The same 12 conditions and the same outcome measures were used
as for the MNRM.
Control
In the control condition, no response style was present. To generate data with no
response style present, the MNRM was used, with every participant scoring a 0 on
the ERS dimension, equivalent to setting the item slope to 0 for the ERS dimension.
In this case, the MNRM simplifies to a GPCM (Falk & Cai, 2016). Other item para-
meters were the same as in the MNRM. In the control condition, we varied the num-
ber of items in the test and which threshold was used, resulting in 232 = 4 conditions
with the GPCM as the generating model. Outcome measures were the same as in the
MNRM condition.
Results
Results for the control condition with the GPCM as the data-generating model are
displayed in Table 2. As the data were generated under the GPCM, no response style
is present.
As can be seen in Table 2, little bias occurs, regardless of which model is applied
to the GPCM data. As there is almost no bias in any condition, the effect of factors is
difficult to discern and is of little practical significance if present at all. Results seem
to indicate that applying the MNRM or the IRTree model to data that has no response
style present seems to have a negligible impact on the estimated substantive trait
mean and variance. Table 3 displays the results when the MNRM is the data-
generating model.
Schoenmakers et al. 13
Results for the MNRM-generated data shows substantially more bias than the
GPCM-generated data. Several interesting trends emerge in these results. To start,
the MNRM has no problems recovering the substantive trait mean and variance when
it is the data-generating model. Both the GPCM and the IRTree do run into problems
estimating these outcomes when data are generated under the MNRM, but only when
the groups differ in ERS.
When the difference between the two groups mean ERS is 21, the variance in the
substantive trait is underestimated by both the IRTree and GPCM, although for the
GPCM underestimation is more severe than for the IRTree. Problems estimating the
substantive trait mean only occur when an item threshold shift is introduced, creating
asymmetry of the item thresholds around the mean participant substantive trait level.
In this case, both the GPCM and IRTree overestimate the substantive trait mean,
although the GPCM again does so more severely than the IRTree model.
As one may expect, these trends are reversed when the difference between the two
groups ERS means is 1. In this case, both the IRTree and the GPCM overestimate the
substantive trait variance when ERS mean difference between groups is present and
underestimate the substantive trait mean when ERS mean difference between groups
and item threshold shifts is present. In essence, the IRTree model seems to correct
too little for the ERS present in the data when the MNRM is the generating model,
reducing the bias compared with ignoring the response style completely but not elim-
inating it. Note that the extent to which the models show bias in both the substantive
trait variance and the mean increases when the ERS mean in Group 2 is 1 compared
to when it the ERS mean in group 2 was 21. This difference is small but noticeable
for the IRTree but quite substantial in size for the GPCM.
Table 4 shows the same conditions as Table 3, but this time with the IRTree gener-
ating data. When the IRTree model generates data, it does not have problems estimat-
ing the substantive trait mean and variance in any condition. Just as before, the two
models that did not generate data run into problems estimating substantive trait var-
iance when the mean ERS difference between the two groups is not zero. This lack of
group-level bias in the absence of ERS mean differences between the groups may
Table 2. Results for the Control Condition.
MNRM IRTree GPCM
Factors Nitems tm
ubias s2bias mubias s2bias mubias s2bias
GPCM 10 ½1,0,120.002 0.005 20.002 0.006 20.002 0.006
½0,1,20.003 0.017 0.004 0.014 0.004 0.017
20 ½1,0,10.003 0.007 0.002 0.008 0.002 0.006
½0,1,20.001 0.006 0.002 0.004 0.002 0.005
Note. mubias refers to bias in the substantive trait mean in group 2, s2bias refers to bias in the
substantive trait variance in Group 2, Nitems refers to the number of items in the condition, and trefers
to the mean item thresholds. MNRM = multidimensional nominal response model; IRTree = item
response tree; GPCM = generalized partial credit model.
14 Educational and Psychological Measurement 00(0)
Table 3. Results for the MNRM Condition.
MNRM IRTree GPCM
Factors DERS Nitems tm
ubias s2bias mubias s2bias mubias s2bias
MNRM 110½1,0,10.000 0.015 20.001 20.238 20.001 20.585
½0,1,20.000 0.013 0.147 20.250 0.324 20.541
20 ½1,0,120.003 0.006 20.005 20.268 20.004 20.577
½0,1,220.005 0.022 0.145 20.265 0.317 20.521
010½1,0,10.002 0.008 0.000 0.007 0.000 0.005
½0,1,20.001 0.015 0.001 0.014 0.002 0.010
20 ½1,0,120.001 0.022 20.006 0.015 20.003 0.009
½0,1,20.001 0.022 20.001 0.016 20.001 0.011
110½1,0,10.003 20.002 0.002 0.359 0.003 1.414
½0,1,220.002 0.019 20.192 0.401 20.507 1.251
20 ½1,0,10.001 0.040 20.005 0.419 20.006 1.330
½0,1,20.005 0.036 20.196 0.420 20.471 1.108
Note. mubias refers to bias in the substantive trait mean in group 2, s2bias refers to bias in the substantive trait variance in group 2, DERS refers to the difference
in the ERS mean between group 1 (constant ERS mean at 0) and group 2 (21, 0 or 1 ERS mean), Nitems refers to the number of items in the condition, trefers to
the mean item thresholds. Values substantially differing from zero are marked in bold. MNRM = multidimensional nominal response model; IRTree = item
response tree; GPCM = generalized partial credit model; ERS = Extreme response style.
15
Table 4. Results for the IRTree Condition.
MNRM IRTree GPCM
Factors DERS Nitems tm
ubias s2bias mubias s2bias mubias s2bias
IRTree 110½1,0,10.003 0.354 0.001 0.012 0.002 20.491
½0,1,220.185 0.372 20.004 0.005 0.259 20.464
20 ½1,0,120.003 0.386 20.004 0.011 20.002 20.474
½0,1,220.197 0.401 20.010 0.008 0.248 20.436
010½1,0,10.001 0.017 0.000 0.015 0.002 0.015
½0,1,20.000 0.013 20.001 0.011 20.001 0.012
20 ½1,0,10.002 0.020 20.002 0.012 0.001 0.009
½0,1,20.001 0.018 20.002 0.012 20.004 0.015
110½1,0,120.003 20.271 20.002 0.008 20.001 0.899
½0,1,20.182 20.295 0.001 0.008 20.343 0.847
20 ½1,0,120.002 20.276 0.000 0.008 20.001 0.799
½0,1,20.199 20.285 20.001 0.010 20.323 0.744
Note. Notation is as described above for the MNRM table. IRTree = item response tree; MNRM = multidimensional nominal response model; GPCM =
generalized partial credit model.
16
lead one to believe that using a model to estimate the substantive trait other than the
data-generating model does not lead to any substantive trait bias if there is no ERS
mean difference between the groups. While this is true at the group level, individual-
level results show bias occurring at the individual level even when groups do not dif-
fer in mean ERS (see Supplemental Appendices C and F).
Although the IRTree and GPCM showed bias in the same direction when the
MNRM generated the data, this is not the case for the MNRM and GPCM with
IRTree generated the data. When the ERS difference between the two groups is 21,
the MNRM overestimates the substantive trait variance, while the GPCM underesti-
mates it. After an item threshold shift is added, the MNRM underestimates the sub-
stantive trait mean, while the GPCM overestimates it. In both cases, the GPCM
shows more bias in an absolute sense than the MNRM. Interestingly, the IRTree
model appears to have less bias when estimating MNRM data than the MNRM has
when estimating on IRTree data in these conditions. The reverse is true for the
GPCM, which has less bias on MNRM data than on IRTree data.
When the ERS difference between the two groups is 1, the trends are again
reversed. Now, the MNRM underestimates the substantive trait variance, while the
GPCM overestimates it. After the threshold shift is introduced, the MNRM overesti-
mates the substantive trait mean, while the GPCM underestimates it. Again, the
MNRM has less bias in an absolute sense than the GPCM. While the GPCM suffers
more bias when the difference between the groups’ mean ERS is positive than when
it was negative, this is not unequivocally the case for the MNRM. In fact, the bias for
the substantive trait variance seems to have decreased somewhat when the ERS mean
difference between groups changed from negative to positive. The advantage the
IRTree showed in estimating on MNRM data compared with the MNRM estimating
on IRTree data in the negative mean ERS condition is also diminished, with the
IRTree showing less bias for 20 items but more for 10 items. Essentially, the MNRM
appears to overcorrect for the ERS present in the IRTree-generated data, reducing the
bias somewhat compared with ignoring the response style but overshooting the mark
and switching the sign of the bias in the process.
Summarizing, the importance of correcting for ERS using the true model becomes
clear from the results. Ignoring the ERS present in the data by using a GPCM to esti-
mate the substantive trait mean and variance leads to notable bias in the substantive
trait variance when groups differ in mean ERS and leads to bias in the substantive
trait mean when groups differ in ERS and asymmetrical thresholds are present.
While the IRTree and the MNRM both do well in correcting ERS when they are the
generating and estimating model, group differences in mean ERS and item threshold
shifts again lead to bias when data are generated by the other model. Specifically,
the IRTree model undercorrects for ERS in these conditions when the MNRM gener-
ates the data, reducing the bias in magnitude but failing to eliminate it. On the other
hand, the MNRM overcorrects for ERS in these conditions when data is generated
with an IRTree model, reducing the bias in magnitude but switching the sign of the
bias. Thus, both models do not manage their goal of eliminating the effect of ERS on
Schoenmakers et al. 17
trait inference given group differences in mean ERS and item threshold shifts. As the
results show the importance of selecting the right model to correct for ERS, tools for
selecting the right model become very relevant. For this reason, an exploratory inves-
tigation into the use of model fit indices to select the right model for estimation is
detailed in Supplemental Appendix D. While the results of this investigation natu-
rally depend on the conditions considered, the use of the model fit indices for this
purpose seems promising under the conditions studied in this paper.
Empirical Example
To show that the differences between models found in the simulation study can also
be found in real data, an empirical example is provided. To make this empirical
example as similar as possible to the simulation study conducted in the paper, we
looked for a multigroup dataset with items with four categories. As the Programme
for International Student Assessment (PISA) is a well-known and publicly accessible
source for multigroup data, we chose to look for this type of questionnaire here. We
used the mathematics work ethic scale from PISA 2012 (Organisation for Economic
Co-Operation and Development, 2013). The scale has 9 four-category items. Some
examples of items in this scale were ‘I work hard on my mathematics homework’’
and ‘‘I pay attention in mathematics class. The response options were ‘Strongly dis-
agree,’’ ‘Disagree,’ ‘‘Agree,’ and ‘‘Strongly agree,’’ and they were scored with
higher scores indicating higher levels of mathematics work ethic.
1
The mean test
score across all countries was 2.86, indicating that the test scores are left skewed,
which is important in illustrating the differences between the ERS models.
For the purposes of this example, we chose to examine Costa Rica (N= 2,863) and
Malaysia (N= 3,389). These groups were preferred over other countries for several
reasons. First of all, the countries substantially differed in mean ERS estimated by the
MNRM (0.549 ERS mean difference between countries). Second, the countries were
a good illustration of possible differences in conclusion between the IRT models, as
models reached different conclusions regarding the substantive trait mean difference
between Costa Rica and Malaysia. It is important to note here that we merely chose
this example to show that it is possible to obtain different conclusions when utilizing
different models; we do not claim that this will always (or often) happen in practice.
As a first step in the analysis, we estimated the GPCM, MNRM, and IRTree mod-
els with the mean and variance of the latent variables in Costa Rica fixed to 0 and 1
for identification purposes. The average item slopes under the MNRM were 1.235 for
the substantive trait and 2.547 for the ERS trait, with average item thresholds equal to
23.263, 21.467, and 0.603, indicating asymmetric average item thresholds. Note
that the ERS loading on the items and the asymmetry in item thresholds is stronger in
the empirical example than in the simulation study, while the ERS mean difference
between groups (0.549) is weaker than the ERS differences simulated.
Group-level results of Malaysia are displayed in Table 5. As can be seen in the
table, conclusions regarding the substantive trait difference between Costa Ricans
and Malaysians differ depending on the model used. Under the GPCM, the
18 Educational and Psychological Measurement 00(0)
Malaysian group shows a significantly lower mean mathematics work ethic com-
pared with the group from Costa Rica (95% confidence interval [95% CI:
½0:417, 0:307). When modeling ERS using the IRTree, the difference in means
between the countries decreases (95% CI :½0:214, 0:077) but remains signifi-
cant. Under the MNRM, the countries do not differ significantly in mathematics
work ethic (95% CI :½0:107, 0:057).
Note that the differences between the estimated substantive trait mean between
the models is of similar size to the differences found between models in the simula-
tion study. It thus seems the increased asymmetry in item thresholds in combination
with the higher ERS loading on items offset the lower ERS mean difference between
groups in the empirical example. A final noteworthy result is that the MNRM correc-
tion for ERS also appears to be stronger here than the IRTree correction for ERS,
which was also found in the simulation study. Overall, the results of the empirical
example indicate substantive conclusions can differ not only depending on whether a
correction for ERS is used but also which correction for ERS is used.
Table 6 contains the fit indices for each model. These results indicate that the
GPCM is not the data-generating model, given the detected ERS by both models and
the improved fit of both ERS models over the GPCM. Furthermore, the IRTree seems
to exhibit the best fit. While one could choose to prefer the IRTree model on this
basis, it may be more insightful to think about the conceptualization of ERS under
Table 5. Estimated Group Parameters in the Focal Group for the Various Models.
Model mumERS s2
usu,ERS s2
ERS
GPCM 0:36 (0:03)NA 0:81 (0:04)NA NA
MNRM 0:03 (0:04)0:55 (0:04)1:67 (0:12)0:01 (0:04)1:08 (0:08)
IRTree 0:15 (0:04)0:52 (0:04)1:35 (0:08)0:09 (0:04)1:11 (0:08)
Note. mudenotes the substantive trait mean, mERS denotes the mean ERS, s2
uis the substantive trait
variance, su,ERS is the covariance between the substantive trait and ERS, s2
ERS is the ERS trait variance.
GPCM = generalized partial credit model; MNRM = multidimensional nominal response model; IRTree =
item response tree; ERS = Extreme response style.
Table 6. Model Fit Indices for the Various Models.
Model LogLik Parameters AIC BIC SABIC HQ
GPCM 51014:0 38 102,104:0 102,360:2 102,239:4 102,192:8
MNRM 47846:651 95,795:396,139:195,977:095,915:4
IRTree 47783:960 95,687:896,092:295,901:595,827:9
Note. LogLik = log-likelihood; AIC = Akaike information criterion; BIC = Bayesian information criterion;
SABIC = Sample-size adjusted BIC; HQ = Hannan-Quinn information criterion; GPCM = generalized
partial credit model; MNRM = multidimensional nominal response model; IRTree = item response tree.
Schoenmakers et al. 19
different models when choosing which model to prefer as we elaborate on further in
the discussion.
Discussion
The present study set out to compare two widely used and flexible IRT models in
their modeling of ERS under a variety of conditions. First, conceptual differences
between the two models were compared. Second, the practical implications of these
differences were examined by means of a multigroup simulation study with the bias
of the group mean and variance as outcome measures. The results will be discussed
in this order.
Conceptually, the IRTree and MNRM models appear to be very different in their
modeling of ERS. Beyond the obvious differences in the IRTree modeling the
response process as a multistep process and the MNRM using a divide-by-all
approach, two major less obvious differences between the models were found. First
of all, it was revealed that under the MNRM, the extent of ERS influences the prob-
ability of agreeing with an item. This effect is particularly noticeable when item
thresholds are not symmetric around the participant’s substantive trait level. In the
IRTree model, the probability of agreeing (responding with 3 or 4) with the item is
independent of ERS. As ERS is often conceptualized as influencing only the prob-
ability of an extreme response, not the probability of agreeing with an item, the
IRTree model seems to fit this conceptualization better. However, under the IRTree
model, the probability of agreeing with an item given an extreme response (i.e.,
responding with 1 vs. 4 on a four-category item) depends on the extent of ERS pres-
ent. The same holds for the probability of agreeing with an item given a non-extreme
response (i.e., responding with 2 vs. 3 on a four-category item). In contrast, under
the MNRM these probabilities are independent of ERS. While both models thus
technically have a property the other model does not have, we find it difficult to
imagine a situation where the property of agreeing with an item given an extreme
response is of primary concern over the unconditional probability of agreeing with
an item. Of course, it is up to researchers to consider their conceptualization of ERS
and which of these two properties they value more before deciding which model to
use. If both properties are deemed to be of importance, other models not presented in
this paper may be of interest, although these models may have other shortcomings in
modeling ERS.
The practical impact of the differences between the two models was examined
using a multigroup simulation study. Results indicate that completely ignoring the
response style by using a GPCM leads to substantial bias in the estimated variance
of Group 2 when groups differ in their mean ERS levels. If thresholds are not sym-
metric around the average substantive trait level and groups differ in mean ERS, the
estimation of the mean substantive trait of Group 2 is also biased.
These results are somewhat surprising, as some previous research on the conse-
quences of ignoring ERS found minimal effects on individual trait estimates,
20 Educational and Psychological Measurement 00(0)
reliability, and the correlation between substantive traits as long as the response style
and the substantive trait are uncorrelated (Plieninger, 2017; Wetzel et al., 2016). The
fact that an effect of ERS is found here but not in previous research may be due to
several reasons. First, the current article considers a multigroup context, whereas
both Plieninger (2017) and Wetzel et al. (2016) considered only a single group.
Second, the current article examines items with locations not centered around the
mean of the substantive trait. This is an important scenario to consider, as many psy-
chological questionnaires do not have symmetric thresholds (i.e., items have
expected values not exactly at the middle of the scale; consequently, the mean of the
test is not exactly in the middle of the theoretical test range). As one example of this,
Big Five personality scores are often found to be quite far removed from the middle
point of the scale (Solin
˜o & Farizo, 2014; Weisberg et al., 2011). Another example
of a test likely to be severely skewed is mental health tests such as the Beck depres-
sion inventory, especially when applied in a nonclinical population (Beck, 1961;
Gorenstein et al., 2005). If groups with a different mean ERS were to be compared
on their personality or mental health status using the wrong ERS model (or no ERS
model at all), we could thus expect results to be biased both in mean and variance.
A second main result is that using an MNRM to estimate the substantive trait mean
and variance of IRTree-generated data, or using an IRTree model to estimate the sub-
stantive trait mean and variance of MNRM-generated data, runs into problems in the
same conditions as when the GPCM is used for estimation. While the bias resulting
from using the ‘‘wrong’ ERS model is smaller than the bias that results from ignoring
the response style altogether, both the substantive trait variance and the substantive trait
mean can be substantially biased if the model used for estimation is not the data-
generating model. The size of the bias that can be expected when using the ‘‘wrong’
model to correct for ERS, or when ignoring ERS completely, is mainly based on two
factors. First, the presence of an ERS mean difference between the groups leads to dif-
ferences between the IRTree and MNRM model in the substantive trait variance. While
the present study only formally examined ERS mean differences of 21, 0 and 1, larger
ERS mean differences between the groups are expected to lead to larger differences
between models. Second, the combination of ERS mean differences between groups
and asymmetric item thresholds leads to differences between the models in both the
substantive trait mean and variance. While we again only formally examined two levels
of item threshold asymmetry, we expect that the differences between the models will
increase in size as item threshold asymmetry increases. One valuable insight here may
be that asymmetric item thresholds necessarily lead to skewed test scores (i.e., mean test
scores that are not exactly in the middle of the theoretical range of test scores). We thus
recommend researchers who are interested in the possible bias resulting from applying
the wrong model to the data to check the skewness of the test scores empirically.
A second note when comparing the models is that the IRTree model seems to
undercorrect for ERS when the data-generating model is the MNRM, while the
MNRM overcorrects for ERS when the data-generating model is the IRTree model.
This is caused by the MNRM-generating data where the extent of ERS influences
Schoenmakers et al. 21
the probability of agreement, which the IRTree cannot model, and the IRTree-gener-
ating data where the probability of agreement given an extreme response depends on
the extent of ERS, which the MNRM cannot model. These results point to the impor-
tance of choosing a model to estimate the substantive trait that is compatible with
the model that created the data.
While the importance of picking the right model is clear, it is less clear how this can
best be achieved. From one perspective, the difference between models presented here
can be seen as a fundamental difference between the conceptualization of ERS between
the MNRM and IRTree models. In the IRTree model, the ERS trait only becomes rele-
vant after an initial decision between agreeing or not agreeing with the item is made in
Node 1, while the MNRM assumes no such steps in the response process. From this per-
spective, it follows that the choice between models should be based on conceptual views
of how ERS should function. Note that these conceptual views on ERS are not limited to
choices between the models but also include choices on how to specify the models. For
example, the scoring matrix in the MNRM used in this simulation study assumes sym-
metry in the ERS effect, which does not need to be the case, and other scoring matrices
with asymmetric ERS effects could be specified. Similarly, the constraint of equal ERS
loadings across Nodes 2 and 3 in the IRTree model assumes a symmetry in the ERS
effect across nodes and can potentially be removed by imposing other constraints.
A second perspective on choosing the right model is that the model that empiri-
cally has the best fit to the dataset the researcher works with should be preferred. For
this reason, an exploratory analysis of using fit indices for model selection was con-
ducted (see Supplemental Appendix D). Results indicate that the use of the Akaike
information criterion or log-likelihood (due to the low cost of preferring a more com-
plex model over the GPCM), Bayesian information criterion (BIC), sample-size
adjusted BIC, or Hannan-Quinn information criterion for model selection is promis-
ing. One avenue of further research could be the development of a tool that examines
model fit based on the fundamental conceptual differences between the two models
outlined earlier, rather than general model fit. Beyond picking the best model of the
two models presented here, one may also question whether other models that are not
presented here, or perhaps even models that do not yet exist, created the data they
are currently examining. Future research would do well to further compare existing
models, both conceptually and practically, test if current fit indices can be used to
select the right models when other models are compared and develop new models
with alternative conceptualizations of the response process. Finally, future research
could work on making IRTree models more accessible for the applied researcher, as
setting up a properly specified IRTree model is currently not the easiest of tasks.
Despite the current findings, this article also has some limitations. First of all, the
paperislimitedtoERSandonlycomparestwomodels. As indicated earlier, the litera-
ture on ERS models alone is quite substantial, and more conceptual and practical differ-
ences may be uncovered by studying other models more in detail. On top of this, ERS is
only one of many response styles. As large conceptual differences between ERS models
were uncovered in this paper, it would not be surprising to see similar differences for
22 Educational and Psychological Measurement 00(0)
other response styles such as midpoint responding, acquiescent responding, and so on.
For these reasons, future research should aim to expand the framework presented here to
other models and response styles. As a second limitation, the current article is based on
a simulation study, which is naturally limited in how many conditions can be displayed
and how realistic the conditions are compared to real data. For example, all item slopes
were fixed to 1.5 in this study, and all items had equidistant thresholds, which is unlikely
to occur in real data. Future research should examine if these results are the same when
other parameters are used. Third, only items with four categories were included in this
article. It is thus not guaranteed that results generalize to items with more categories,
especially if a middle category is also present. Future studies should examine the effects
discussed here on items with differing number of categories and a middle category.
Finally, the substantive trait mean was not varied between groups, making it impossible
to infer what happens when groups are not identical in their substantive trait.
From this article, several practical recommendations can be made. First of all, a
researcher should take great care in considering which model to use when modeling
ERS. In this consideration, a conceptual underpinning of the expected response pro-
cess and the effects of ERS should take center stage. In addition, model fit indices
for the various models should be consulted to pick a model for estimation that is as
close to the data-generating model as possible.
Second, the importance of considering ERS in a context where groups may differ
in their ERS propensity is illustrated. Using a GPCM on data that contains ERS can
result in substantial bias in the substantive trait mean and variance. While using the
‘‘wrong’ ERS model on the data does not completely resolve these issues, it at least
seems to reduce the bias. While further research is needed to confirm that this holds
for conditions and models not considered in this study, the preliminary data presented
here suggests the use of an ERS model that is not the data-generating model when
ERS is present is preferable to not acknowledging the presence of ERS at all.
Overall, this article reveals that the MNRM and IRTree models cannot be used
interchangeably to correct for ERS. Conceptual differences between the models were
examined, and the practical impact of these differences on both the group and the
individual level was illustrated using a simulation study. Researchers would do well
to consider these differences between the models and their impact on future research
when attempting to correct for ERS.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship,
and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of
this article.
Schoenmakers et al. 23
ORCID iD
Martijn Schoenmakers https://orcid.org/0000-0003-3338-3565
Supplemental Material
Supplemental material for this article is available online.
Note
1. Note that in the original datafile, items were coded as 1 (Strongly agree), 2 (Agree), 3
(Disagree), and 4 (Strongly disagree). Items were recoded for the purposes of this analysis
so that the latent variable can be interpreted as mathematics work ethic rather than a lack
of mathematics work ethic.
References
Austin, E. J., Deary, I. J., & Egan, V. (2006). Individual differences in response scale use:
Mixed Rasch modelling of responses to NEO-FFI items. Personality and Individual
Differences,40(6), 1235–1245. https://doi.org/10.1016/j.paid.2005.10.018
Batchelor, J. H., & Miao, C. (2016). Extreme response style: A meta-analysis. Journal of
Organizational Psychology,16(2), 51–62.
Beck, A. T. (1961). An inventory for measuring depression. Archives of General Psychiatry,
4(6), 561–571. https://doi.org/10.1001/archpsyc.1961.01710120031004
Bockenholt, U. (2012). Modeling multiple response processes in judgment and choice.
Psychological Methods,17, 665–678. https://doi.org/10.1037/a0028111
Bockenholt, U., & Meiser, T. (2017). Response style analysis with threshold and multi-process
IRT models: A review and tutorial. British Journal of Mathematical and Statistical
Psychology,70, 159–181. https://doi.org/10.1111/bmsp.12086
Bolt, D. M., & Johnson, T. R. (2009). Addressing score bias and differential item functioning
due to individual differences in response style. Applied Psychological Measurement,33(5),
335–352. https://doi.org/10.1177/0146621608329891
Bolt, D. M., Lu, Y., & Kim, J.-S. (2014). Measurement and control of response styles using
anchoring vignettes: A model-based approach. Psychological Methods,19(4), 528–541.
https://doi.org/10.1037/met0000016
Bolt, D. M., & Newton, J. R. (2011). Multiscale measurement of extreme response style.
Educational and Psychological Measurement,71(5), 814–833. https://doi.org/10.1177/
0013164410388411
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R
environment. Journal of Statistical Software,48, 1–29. https://doi.org/10.18637/jss.v048.i06
Cho, Y. (2013). The mixture distribution polytomous Rasch model used to account for response
styles on rating scales: A simulation study of parameter recovery and classification
accuracy [Doctoral dissertation, University of Maryland].
Chun, K.-T., Campbell, J. B., & Yoo, J. H. (1974). Extreme response style in cross-cultural
research: A reminder. Journal of Cross-cultural Psychology,5(4), 465–480. https://doi.org/
10.1177/002202217400500407
Clarke, I. (2001). Extreme response style in cross-cultural research. International Marketing
Review,18(3), 301–324. https://doi.org/10.1108/02651330110396488
24 Educational and Psychological Measurement 00(0)
Falk, C. F., & Cai, L. (2016). A flexible full-information approach to the modeling of response
styles. Psychological Methods,21(3), 328–347. https://doi.org/10.1037/met0000059
Gorenstein, C., Andrade, L., Zanolo, E., & Artes, R. (2005). Expression of depressive
symptoms in a nonclinical Brazilian adolescent sample. The Canadian Journal of
Psychiatry,50(3), 129–136. https://doi.org/10.1177/070674370505000301
Greenleaf, E. A. (1992). Measuring extreme response style. Public Opinion Quarterly,56(3),
328–351. https://doi.org/10.1086/269326
Huang, H.-Y. (2016). Mixture random-effect IRT models for controlling extreme response
style on rating scales. Frontiers in Psychology,7, Article 1706. https://www.frontiersin
.org/article/10.3389/fpsyg.2016.01706
Hui, C. H., & Triandis, H. C. (1989). Effects of culture and response format on extreme
response style. Journal of Cross-Cultural Psychology,20(3), 296–309. https://doi.org/10
.1177/0022022189203004
Iwawaki, S., & Zax, M. (1969). Personality dimensions and extreme response tendency.
Psychological Reports,25(1), 31–34. https://doi.org/10.2466/pr0.1969.25.1.31
Javaras, K. N., & Ripley, B. D. (2007). An ‘‘unfolding’’ latent variable model for Likert
attitude data: Drawing inferences adjusted for response style. Journal of the American
Statistical Association,102(478), 454–463. https://doi.org/10.1198/016214506000000960
Jin, K.-Y., & Wang, W.-C. (2014). Generalized IRT models for extreme response style.
Educational and Psychological Measurement,74(1), 116–138. https://doi.org/10.1177/
0013164413498876
Johnson, T. R. (2003). On the use of heterogeneous thresholds ordinal regression models to
account for individual differences in response style. Psychometrika,68(4), 563–583. https://
doi.org/10.1007/BF02295612
Johnson, T. R., Kulesa, P., Cho, Y. I., & Shavitt, S. (2005). The relation between culture and
response styles: Evidence from 19 countries. Journal of Cross-Cultural Psychology,36(2),
264–277. https://doi.org/10.1177/0022022104272905
Lau, M. Y. (2007). Extreme response style: An empirical investigation of the effects of scale
response format and fatigue [Doctoral dissertation, University of Notre Dame].
Leventhal, B. C. (2019). Extreme response style: A simulation study comparison of three
multidimensional item response models. Applied Psychological Measurement,43(4),
322–335. https://doi.org/10.1177/0146621618789392
Meiser, T., Plieninger, H., & Henninger, M. (2019). IRTree models with ordinal and
multidimensional decision nodes for response styles and trait-based rating responses.
British Journal of Mathematical and Statistical Psychology,72, 501–516. https://doi.org/
10.1111/bmsp.12158
Moors, G. (2012). The effect of response style bias on the measurement of transformational,
transactional, and laissez-faire leadership. European Journal of Work and Organizational
Psychology,21(2), 271–298. https://doi.org/10.1080/1359432X.2010.550680
Moors, G., Kieruj, N. D., & Vermunt, J. K. (2014). The effect of labeling and numbering of
response scales on the likelihood of response bias. Sociological Methodology,44(1),
369–399. https://doi.org/10.1177/0081175013516114
Morren, M., Gelissen, J. P. T. M., & Vermunt, J. K. (2011). Dealing with extreme response
style in cross-cultural research: A restricted latent class factor analysis approach.
Sociological Methodology,41(1), 13–47. https://doi.org/10.1111/j.1467-9531.2011.01238.x
Nemoto, T., & Beglar, D. (2014). Developing Likert-scale questionnaires. In N. Sonda &
A. Krause (Eds.), JALT2013 conference proceedings.JALT.
Schoenmakers et al. 25
Organisation for Economic Co-operation and Development. (2013). PISA 2012 assessment
and analytical framework: Mathematics, reading, science, problem solving and financial
literacy. https://doi.org/10.1787/9789264190511-en
Plieninger, H. (2017). Mountain or molehill? A simulation study on the impact of response
styles. Educational and Psychological Measurement,77(1), 32–53. https://doi.org/10.1177/
0013164416636655
R Core Team. (2017). R: A language and environment for statistical computing. R Foundation
for Statistical Computing.
Rost, J. (1991). A logistic mixture distribution model for polychotomous item responses.
British Journal of Mathematical and Statistical Psychology,44(1), 75–92. https://doi.org/
10.1111/j.2044-8317.1991.tb00951.x
Sen, S., & Cohen, A. S. (2019). Applications of mixture IRT models: A literature review.
Measurement: Interdisciplinary Research and Perspectives,17(4), 177–191. https://doi
.org/10.1080/15366367.2019.1583506
Solin
˜o, M., & Farizo, B. (2014). Personal traits underlying environmental preferences: A
discrete choice experiment. PLOS ONE,9, Article e89603. https://doi.org/10.1371/journal
.pone.0089603
Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and
factor analysis of discretized variables. Psychometrika,52(3), 393–408. https://doi.org/10.
1007/BF02294363
Thissen-Roe, A., & Thissen, D. (2013). A two-decision model for responses to Likert-type
items. Journal of Educational and Behavioral Statistics,38(5), 522–547. https://doi.org/10
.3102/1076998613481500
Van Vaerenbergh, Y., & Thomas, T. D. (2013). Response styles in survey research: A
literature review of antecedents, consequences, and remedies. International Journal of
Public Opinion Research,25(2), 195–217. https://doi.org/10.1093/ijpor/eds021
Weijters, B., Millet, K., & Cabooter, E. (2021). Extremity in horizontal and vertical Likert
scale format responses. Some evidence on how visual distance between response categories
influences extreme responding. International Journal of Research in Marketing,38(1),
85–103. https://doi.org/10.1016/j.ijresmar.2020.04.002
Weisberg, Y. J., DeYoung, C. G., & Hirsh, J. B. (2011). Gender Differences in personality
across the ten aspects of the big five. Frontiers in Psychology,2, Article 178. https://doi.
org/10.3389/fpsyg.2011.00178
Wetzel, E., Bo
¨hnke, J. R., & Rose, N. (2016). A simulation study on methods of correcting for
the effects of extreme response style. Educational and Psychological Measurement,76(2),
304–324. https://doi.org/10.1177/0013164415591848
Willits, F. K., Theodori, G. L., & Luloff, A. E. (2016). Another look at Likert scales. Journal
of Rural Social Sciences,31(3), 126–139.
Zhang, Y., & Wang, Y. (2020). Validity of three IRT models for measuring and controlling
extreme and midpoint response styles. Frontiers in Psychology,11, Article 271. https://doi.
org/10.3389/fpsyg.2020.00271
26 Educational and Psychological Measurement 00(0)
... Since the ERS appears to be a dominant response style (see e.g., Bolt & Johnson, 2009;Henninger, 2021), it is likely to be present in many datasets. Hence, ad hoc approaches fail to account for RS effects if the actual RSs present in the data are not included in the model (Schoenmakers et al., 2023). Furthermore, an approach's effectiveness in eliminating RS effects can vary for the same dataset (see e.g., Scharl & Gnambs, 2022;Schoenmakers et al., 2023;Ulitzsch et al., 2023;Wetzel et al., 2016aWetzel et al., , 2016b. ...
... Hence, ad hoc approaches fail to account for RS effects if the actual RSs present in the data are not included in the model (Schoenmakers et al., 2023). Furthermore, an approach's effectiveness in eliminating RS effects can vary for the same dataset (see e.g., Scharl & Gnambs, 2022;Schoenmakers et al., 2023;Ulitzsch et al., 2023;Wetzel et al., 2016aWetzel et al., , 2016b. Third, unlike rating scales, alternative response formats are challenging to implement (Dykema et al., 2022;Harzing et al., 2009;, as these formats are more time-consuming and demanding to respond to (Sass et al., 2020;. ...
... Recent research has provided evidence that RSs are prevalent in a substantial portion of samples (11-60%; Carter et al., 2011;Kutscher et al., 2017;Kutscher & Eid, 2020;Meiser & Machunsky, 2008;Wetzel et al., 2013), and the extent of variance attributable to RSs can fluctuate up to 43% (Tempelaar et al., 2020). This presence of RSs can bias estimates of means and variances (Schoenmakers et al., 2023;Weijters et al., 2010b), affect the shape and location of response distributions (Cheung & Rensvold, 2000;Mõttus et al., 2012;Reynolds & Smith, 2010), and impact the dimensional structure of a trait or attitude (Aichholzer, 2014;Jin & Wang, 2014;Navarro-González et al., 2016). Furthermore, the presence of RSs in the data can jeopardize the criterion validity of a scale of interest, leading to inflated or deflated correlation and regression coefficients, as well as biased estimates of model parameters (Khorramdel & von Davier, 2014;Moors, 2012;Morren et al., 2012;Plieninger, 2017;Rossi et al., 2001;Tutz et al., 2018;Weijters et al., 2008). ...
Article
Full-text available
Rating scales are susceptible to response styles that undermine the scale quality. Optimizing a rating scale can tailor it to individuals’ cognitive abilities, thereby preventing the occurrence of response styles related to a suboptimal response format. However, the discrimination ability of individuals in a sample may vary, suggesting that different rating scales may be appropriate for different individuals. This study aims to examine (1) whether response styles can be avoided when individuals are allowed to choose a rating scale and (2) whether the psychometric properties of self-chosen rating scales improve compared to given rating scales. To address these objectives, data from the flourishing scale were used as an illustrative example. MTurk workers from Amazon’s Mechanical Turk platform ( N = 7042) completed an eight-item flourishing scale twice: (1) using a randomly assigned four-, six-, or 11-point rating scale, and (2) using a self-chosen rating scale. Applying the restrictive mixed generalized partial credit model (rmGPCM) allowed examination of category use across the conditions. Correlations with external variables were calculated to assess the effects of the rating scales on criterion validity. The results revealed consistent use of self-chosen rating scales, with approximately equal proportions of the three response styles. Ordinary response behavior was observed in 55–58% of individuals, which was an increase of 12–15% compared to assigned rating scales. The self-chosen rating scales also exhibited superior psychometric properties. The implications of these findings are discussed.
Article
Item response tree (IRTree) models form a family of psychometric models that allow researchers to control for multiple response processes, such as different sorts of response styles, in the measurement of latent traits. While IRTree models can capture quantitative individual differences in both the latent traits of interest and the use of response categories, they maintain the basic assumption that the nature and weighting of latent response processes are homogeneous across the entire population of respondents. In the present research, we therefore propose a novel approach for detecting heterogeneity in the parameters of IRTree models across subgroups that engage in different response behavior. The approach uses score‐based tests to reveal violations of parameter heterogeneity along extraneous person covariates, and it can be employed as a model‐based partitioning algorithm to identify sources of differences in the strength of trait‐based responding or other response processes. Simulation studies demonstrate generally accurate Type I error rates and sufficient power for metric, ordinal, and categorical person covariates and for different types of test statistics, with the potential to differentiate between different types of parameter heterogeneity. An empirical application illustrates the use of score‐based partitioning in the analysis of latent response processes with real data.
Article
Full-text available
In four survey experiments we show that people generally answer more extremely to survey items presented in vertical versus horizontal Likert formats. Our findings suggest that this effect may be at least partly driven by differences in the visual range spanned by the response scale (i.e. the visual distance between endpoint response categories is larger in horizontal than in a vertical format). In addition, compared to traditional horizontal Likert data, vertical Likert data contain more variance, which is mainly non-substantive. As a result, data obtained with scale formats that have different distances between response categories (as is typically the case for vertical vs. horizontal formats) may lead to differences in measurement model parameter estimates like residual terms, and in some cases factor loadings and construct correlations. Based on these results, we provide recommendations on the use of response scale formats in online surveys, bearing in mind that several online survey tool providers promote the use of vertical Likert formats and even automatically change traditional horizontal formats of Likert-type items to vertical Likert formats when viewed on small screens (e.g., on mobile phones).
Article
Full-text available
Response styles, the general tendency to use certain categories of rating scales over others, are a threat to the reliability and validity of self-report measures. The mixed partial credit model, the multidimensional nominal response model, and the item response tree model are three widely used models for measuring extreme and midpoint response styles and correcting their effects. This research aimed to examine and compare their validity by fitting them to empirical data and correlating the content-related factors and the response style-related factors in these models to extraneous criteria. The results showed that the content factors yielded by these models were moderately related to the content criterion and not related to the response style criteria. The response style factors were moderately related to the response style criteria and weakly related to the content criterion. Simultaneous analysis of more than one scale could improve their validity for measuring response styles. These findings indicate that the three models could control and measure extreme and midpoint response styles, though the validity of the mPCM for measuring response styles was not good in some cases. Overall, the multidimensional nominal response model performed slightly better than the other two models.
Article
Full-text available
Extreme response style (ERS) refers to the tendency to prefer responding using extreme endpoints on rating scales. We use meta-analysis to summarize the correlates of ERS. Our findings present how one's tendency to engage in extreme responding is related to demographic variables (e.g., age, gender, education, and race), intelligence, acquiescence, and number of points and items in a scale. We also identified a non-linear relationship between age and extreme responding. Thus, this article should be read by anyone using Likert type scales when using data from a diverse set of individuals.
Article
Full-text available
Two different item response theory model frameworks have been proposed for the assessment and control of response styles in rating data. According to one framework, response styles can be assessed by analysing threshold parameters in Rasch models for ordinal data and in mixture-distribution extensions of such models. A different framework is provided by multi-process item response tree models, which can be used to disentangle response processes that are related to the substantive traits and response tendencies elicited by the response scale. In this tutorial, the two approaches are reviewed, illustrated with an empirical data set of the two-dimensional ‘Personal Need for Structure’ construct, and compared in terms of multiple criteria. Mplus is used as a software framework for (mixed) polytomous Rasch models and item response tree models as well as for demonstrating how parsimonious model variants can be specified to test assumptions on the structure of response styles and attitude strength. Although both frameworks are shown to account for response styles, they differ on the quantitative criteria of model selection, practical aspects of model estimation, and conceptual issues of representing response styles as continuous and multidimensional sources of individual differences in psychological assessment.
Article
Full-text available
Respondents are often requested to provide a response to Likert-type or rating-scale items during the assessment of attitude, interest, and personality to measure a variety of latent traits. Extreme response style (ERS), which is defined as a consistent and systematic tendency of a person to locate on a limited number of available rating-scale options, may distort the test validity. Several latent trait models have been proposed to address ERS, but all these models have limitations. Mixture random-effect item response theory (IRT) models for ERS are developed in this study to simultaneously identify the mixtures of latent classes from different ERS levels and detect the possible differential functioning items that result from different latent mixtures. The model parameters can be recovered fairly well in a series of simulations that use Bayesian estimation with the WinBUGS program. In addition, the model parameters in the developed models can be used to identify items that are likely to elicit ERS. The results show that a long test and large sample can improve the parameter estimation process; the precision of the parameter estimates increases with the number of response options, and the model parameter estimation outperforms the person parameter estimation. Ignoring the mixtures and ERS results in substantial rank-order changes in the target latent trait and a reduced classification accuracy of the response styles. An empirical survey of emotional intelligence in college students is presented to demonstrate the applications and implications of the new models.
Article
Mixture item response theory (MixIRT) models combine IRT models with latent class model and assume that there exist latent subpopulations in the data. Identification of latent subpopulations via MixIRT models produces more detailed information. Detailed information about the response processing of examinees provides a better understanding of the construct being measured by the test. These features have enabled the MixIRT models to be used in many studies. Applications of MixIRT models have included item analysis, differential item functioning, multilevel analyses, detection of test speededness, identification of different personality styles and solution strategies. We reviewed each of these applications along with other MixIRT model applications in this study. Results showed reporting practices and trends in MixIRT applications. We provided possible explanations for these results and suggested how to advance the application and usage of MixIRT models.
Article
IRTree models decompose observed rating responses into sequences of theory‐based decision nodes, and they provide a flexible framework for analysing trait‐related judgements and response styles. However, most previous applications of IRTree models have been limited to binary decision nodes that reflect qualitatively distinct and unidimensional judgement processes. The present research extends the family of IRTree models for the analysis of response styles to ordinal judgement processes for polytomous decisions and to multidimensional parametrizations of decision nodes. The integration of ordinal judgement processes overcomes the limitation to binary nodes, and it allows researchers to test whether decisions reflect qualitatively distinct response processes or gradual steps on a joint latent continuum. The extension to multidimensional node models enables researchers to specify multiple judgement processes that simultaneously affect the decision between competing response options. Empirical applications highlight the roles of extreme and midpoint response style in rating judgements and show that judgement processes are moderated by different response formats. Model applications with multidimensional decision nodes reveal that decisions among rating categories are jointly informed by trait‐related processes and response styles.
Article
Several multidimensional item response models have been proposed for survey responses affected by response styles. Through simulation, this study compares three models designed to account for extreme response tendencies: the IRTree Model, the multidimensional nominal response model, and the modified generalized partial credit model. The modified generalized partial credit model results in the lowest item mean squared error (MSE) across simulation conditions of sample size (500, 1,000), survey length (10, 20), and number of response options (4, 6). The multidimensional nominal response model is equally suitable for surveys measuring one substantive trait using responses to 10 four-option, forced-choice Likert-type items. Based on data validation, comparison of item MSE, and posterior predictive model checking, the IRTree Model is hypothesized to account for additional sources of construct-irrelevant variance.
Article
Even though there is an increasing interest in response styles, the field lacks a systematic investigation of the bias that response styles potentially cause. Therefore, a simulation was carried out to study this phenomenon with a focus on applied settings (reliability, validity, scale scores). The influence of acquiescence and extreme response style was investigated, and independent variables were, for example, the number of reverse-keyed items. Data were generated from a multidimensional item response model. The results indicated that response styles may bias findings based on self-report data and that this bias may be substantial if the attribute of interest is correlated with response style. However, in the absence of such correlations, bias was generally very small, especially for extreme response style and if acquiescence was controlled for by reverse-keyed items. An empirical example was used to illustrate and validate the simulations. In summary, it is concluded that the threat of response styles may be smaller than feared.