Article

Taking the Test Taker’s Perspective: Response Process and Test Motivation in Multidimensional Forced-Choice Versus Rating Scale Instruments

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The multidimensional forced-choice (MFC) format has been proposed as an alternative to the rating scale (RS) response format. However, it is unclear how changing the response format may affect the response process and test motivation of participants. In Study 1, we investigated the MFC response process using the think-aloud technique. In Study 2, we compared test motivation between the RS format and different versions of the MFC format (presenting 2, 3, 4, and 5 items simultaneously). The response process to MFC item blocks was similar to the RS response process but involved an additional step of weighing the items within a block against each other. The RS and MFC response format groups did not differ in their test motivation. Thus, from the test taker’s perspective, the MFC format is somewhat more demanding to respond to, but this does not appear to decrease test motivation.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Aspects of item design refer to scale or dimension count (many vs. few; Schulte et al., 2019), item block size (two to multiple; Brown & Maydeu-Olivares, 2011), item context (single statement vs contextualized; Shaffer & Postlethwaite, 2012), permutations within blocks , the keyed direction of items (positive, negative, or mixed; Bürkner et al., 2019;Walton et al., 2020), and the question of full or partial ranking given the binary response options "most like me" and "least like me" (Brown & Maydeu-Olivares, 2011). The aforementioned aspects impact overall questionnaire length an induced test takers strain (Sass et al., 2020), trait score estimate reliability and validity (e.g., convergence with RS based trait measures or external validation criteria) Walton et al., 2020), and ultimately the MFC questionnaires potential to reduce susceptibility to SDR (low-stakes assessments) and faking (high-stakes assessments) (Cao & Drasgow, 2021). ...
... 30.5% (n = 166) of participants passed higher education entrance qualification while 45.3% (n = 247) of the sample had a university degree (Bachelor, Master, PhD). This limitation was imposed to account for the expected higher cognitive demand for "solving" MFC questionnaires as stated for example by Sass et al. (2020). Additionally, given the questionnaires setting being contextualized in organizational surroundings, prior work experience was required from all participants. ...
... Processing several binary comparisons in an MFC block turns out to be a more challenging task compared to single statement ratings known from rating scale response formats. An observation already mentioned by Brown (2012) and Sass et al. (2020), the increased cognitive strain has been reported by many test takers in this study. Consequently, this should be taken into account when designing MFC questionnaires, especially considering overall questionnaire length, item count within a block, and complexity of item content itself. ...
Preprint
Full-text available
The present paper features the adaptation of an existing Big Five questionnaire with a rating scale (RS) response format into a measure using a multidimensional forced choice (MFC) response format. Rating scale response formats have been criticized for their proneness to intentional and unintentional response distortions. Multidimensional forced choice response formats were suggested as a solution to mitigate several types of response sets and response styles by design. The “Big Five Inventory of Personality in Occupational Situations” (B5PS) is a situation-based questionnaire designed for personnel selection and development purposes which would benefit from “fake-proof” response formats. MFC response formats require special effort during test construction and calibration which will be laid out here. Changing the response format has severe consequences on item design and scoring. An inherent issue with MFC formats derives from their inability to yield interpersonal comparative results from standard (sum) scoring. This issue can be solved with IRT based calibration (TIRT) during test construction. Aspects of MFC item design and TIRT calibrations are explored in this paper and evidence on structural and construct validity are presented. Results support the feasibility of the concept and test construction process in a contextualized/situation-based item format.
... Whereas negations should be avoided in any questionnaire format, negatively keyed items might increase cognitive load in MFC questionnaires. In one study examining the response process to MFC items, participants sometimes reported difficulties in responding to blocks of mixed keyed items (Sass et al., 2020). ...
... The goal of our empirical study was to examine the differentiation of judgments in the MFC and the truefalse format by evaluating reliability and validity of person scores. The MFC format elicits relative judgments, as incorporated in choice models for ranking tasks (Brown, 2016a) and indicated by a think-aloud study (Sass et al., 2020). In contrast, single-stimulus formats should elicit absolute judgments. ...
... However, this might not happen in all cases. Participants sometimes report that multiple items describe them equally well or badly, i.e. their utility is subjectively identical (Bartram & Brown, 2003;Sass et al., 2020). This could either foster deeper retrieval or facilitate random responding, thereby diminishing validity. ...
Article
Full-text available
The Thurstonian item response model (Thurstonian IRT model) allows deriving normative trait estimates from multidimensional forced-choice (MFC) data. In the MFC format, persons must rank-order items that measure different attributes according to how well the items describe them. This study evaluated the normativity of Thurstonian IRT trait estimates both in a simulation and empirically. The simulation investigated normativity and compared Thurstonian IRT trait estimates to those using classical partially ipsative scoring, from dichotomous true-false (TF) data and rating scale data. The results showed that, with blocks of opposite keyed items, Thurstonian IRT trait estimates were normative in contrast to classical partially ipsative estimates. Unbalanced numbers of items per trait, few opposite keyed items, traits correlated positively or assessing fewer traits did not decrease measurement precision markedly. Measurement precision was lower than that of rating scale data. The empirical study investigated whether relative MFC responses provide a better differentiation of behaviors within persons than absolute TF responses. However, criterion validity was equal and construct validity (with constructs measured by rating scales) lower in MFC. Thus, Thurstonian IRT modeling of MFC data overcomes the drawbacks of classical scoring, but gains in validity may depend on eliminating common method biases from the comparison.
... At the same time, several studies have investigated different aspects of the model's validity. In a think-aloud task when completing FC items, respondents indeed reported making pairwise comparisons between all items of a triplet-that is, a block of three items (Sass, Frick, Reips, & Wetzel, 2018). Thus, one of the key components of T-IRT models, the transformation of rankings into pairwise comparisons, seems to describe respondents' underlying decision-making processes reasonably well. ...
... However, this study also draws attention to a limitation of this modeling technique: Only 76% of the participants reported having no difficulty in keeping the information for all statements in mind to appraise the utility of an item relative to the utilities of all other items in the block. Furthermore, as the Sass et al. (2018) study only used triplets, this limitation likely leads to severe violations of the model assumptions when blocks with more items are employed. ...
... While they are known to be problematic in single-stimulus items already, the issue becomes more severe in FC questionnaires due to the increased cognitive load they provoke: Assuming that the Law of Comparative Judgment describes the response processes in FC questionnaires reasonably well, a block of n items leads to the respondent having to makeñ = n n À 1 ð Þ=2 comparisons when answering a single block of items. Recall that only 76% of respondents reported having no difficulties in processing the three comparisons necessary for blocks with three items (Sass et al., 2018); but blocks with four items require participants to process six comparisons. The use of negatively keyed items can further increase this already high cognitive demand. ...
Article
Full-text available
Forced-choice questionnaires have been proposed to avoid common response biases typically associated with rating scale questionnaires. To overcome ipsativity issues of trait scores obtained from classical scoring approaches of forced-choice items, advanced methods from item response theory (IRT) such as the Thurstonian IRT model have been proposed. For convenient model specification, we introduce the thurstonianIRT R package, which uses Mplus, lavaan, and Stan for model estimation. Based on practical considerations, we establish that items within one block need to be equally keyed to achieve similar social desirability, which is essential for creating forced-choice questionnaires that have the potential to resist faking intentions. According to extensive simulations, measuring up to five traits using blocks of only equally keyed items does not yield sufficiently accurate trait scores and inter-trait correlation estimates, neither for frequentist nor for Bayesian estimation methods. As a result, persons’ trait scores remain partially ipsative and, thus, do not allow for valid comparisons between persons. However, we demonstrate that trait scores based on only equally keyed blocks can be improved substantially by measuring a sizable number of traits. More specifically, in our simulations of 30 traits, scores based on only equally keyed blocks were non-ipsative and highly accurate. We conclude that in high-stakes situations where persons are motivated to give fake answers, Thurstonian IRT models should only be applied to tests measuring a sizable number of traits.
... Faking on personality assessments remains an unsolved issue, raising major concerns regarding their validity and fairness. Although there is a large body of quantitative research investigating the response process of faking on personality assessments, for both rating scales (RS) and multidimensional forced choice (MFC), only a few studies have yet qualitatively investigated the faking cognitions when responding to MFC in a high-stakes context (e.g., Sass et al., 2020). Yet, it could be argued that only when we have a process model that adequately describes the response decisions in high stakes, can we begin to extract valid and useful information from assessments. ...
... Although there is a large body of quantitative research investigating the response process of faking on personality assessments, for both RS and MFC, only a few studies have yet qualitatively investigated the underlying processing mechanisms of faking when responding to MFC in a high-stakes context (e.g., Sass et al., 2020). Therefore, this qualitative study aims to explore faking behavior when completing a MFC personality assessment. ...
Article
Full-text available
Faking on personality assessments remains an unsolved issue, raising major concerns regarding their validity and fairness. Although there is a large body of quantitative research investigating the response process of faking on personality assessments, for both rating scales (RS) and multidimensional forced choice (MFC), only a few studies have yet qualitatively investigated the faking cognitions when responding to MFC in a high‐stakes context (e.g., Sass, Frick, Reips, & Wetzel, 2020). Yet, it could be argued that only when we have a process model that adequately describes the response decisions in high stakes, can we begin to extract valid and useful information from assessments. Thus, this qualitative study investigated the faking cognitions when responding to MFC personality assessment in a high‐stakes context. Through cognitive interviews with N=32 participants, we explored and identified factors influencing the test‐takers’ decisions regarding specific items and blocks, and factors influencing the willingness to engage in faking in general. Based on these findings, we propose a new response process model of faking forced‐choice items, the Activate‐Rank‐Edit‐Submit (A‐R‐E‐S) model. We also make five recommendations for practice of high‐stakes assessments using MFC. This article is protected by copyright. All rights reserved.
... Second, this study focused on the recovery of trait scores with homopolar FC blocks composed of only two items. In this sense, several new studies have been incorporating blocks of more than two items (e.g., Lee & Joo, 2021;Sass et al., 2020;Wetzel et al., 2021;, as they provide with more bits of information per block used. As pointed out in previous literature (e.g., Brown & Maydeu-Olivares, 2011), increasing the number of items per block should benefit even more the normativity of the trait scores. ...
... For instance, due to the dependencies between the different pairwise comparisons in each block (e.g., between items 1 and 2, 1 and 3, and 2 and 3), the score reliability was found to be overestimated with blocks of more than two items (Lin, 2021). Additionally, as found by Sass et al. (2020), subjects report performing pairwise comparisons to respond to FC questionnaires regardless of the number of items per block, thus being the number of pairwise comparisons an indicator of cognitive effort while responding. In this sense, FC questionnaires of pairs and triplets were found to provide with similar reliabilities when the number of pairwise comparisons was the same (i.e., 20 triplets versus 60 pairs, both with 60 pairwise comparisons; Frick et al., 2021). ...
Article
Full-text available
Multidimensional forced-choice (FC) questionnaires have been consistently found to reduce the effects of socially desirable responding and faking in non-cognitive assessments. Although FC has been considered problematic for providing ipsative scores under the classical test theory, IRT models enable the estimation of non-ipsative scores from FC responses. However, while some authors indicate that blocks composed of opposite-keyed items are necessary to retrieve normative scores, others suggest that these blocks may be less robust to faking, thus impairing the assessment validity. Accordingly, this article presents a simulation study to investigate whether it is possible to retrieve normative scores using only positively keyed items in pairwise FC computerized adaptive testing (CAT). Specifically, a simulation study addressed the effect of 1) different bank assembly (with a randomly assembled bank, an optimally assembled bank, and blocks assembled on-the-fly considering every possible pair of items), and 2) block selection rules (i.e., T, and Bayesian D and A-rules) over the estimate accuracy and ipsativity and overlap rates. Moreover, different questionnaire lengths (30 and 60) and trait structures (independent or positively correlated) were studied, and a non-adaptive questionnaire was included as baseline in each condition. In general, very good trait estimates were retrieved, despite using only positively keyed items. Although the best trait accuracy and lowest ipsativity were found using the Bayesian A-rule with questionnaires assembled on-the-fly, the T-rule under this method led to the worst results. This points out to the importance of considering both aspects when designing FC CAT.
... Triplets with full ranking are more informative than pairs because the ranks can be broken down into three pairwise comparisons (Item A vs. Item B, Item A vs. Item C, Item B vs. Item C). This is how MFC data are analyzed in the Thurstonian item response model (Brown & Maydeu-Olivares, 2011), which appears to correspond to the underlying response process (Sass, Frick, Reips, & Wetzel, 2018). Thus, with triplets, three bits of binary information on participants' trait levels are obtained with each item block whereas only one bit of binary information is obtained with a pair. ...
... As Feldman and Corah (1960, p. 480) put it: "single statements may acquire contextual meaning when paired; hence their SD [social desirability] values may be somewhat altered." The process of responding to MFC triplets involves weighing the items in the triplet against each other before assigning them ranks (Sass et al., 2018), and fine-grained differentiations regarding desirability could be a part of this comparison (Kahneman, 2011). As Lin and Brown (2017) showed, differences in perceived social desirability can occur with different arrangements of items to blocks and this can influence the items' psychometric properties. ...
Preprint
Full-text available
A common concern with self-reports of personality traits in selection contexts is faking. The multidimensional forced-choice (MFC) format has been proposed as an alternative to rating scales (RS) that could prevent faking. The goal of this study was to compare the susceptibility of the MFC and RS format to faking in a simulated high-stakes setting. Participants were randomly assigned to three groups (total N = 1,867) and filled out the Big Five Triplets once under an honest instruction and once under a fake-good instruction. Latent mean differences between the honest and fake-good administrations indicated that the Big Five domains were faked in the expected direction. Faking effects for all traits were larger for RS compared to MFC. Faking effects were also larger for the MFC version with mixed triplets compared to the MFC version with triplets that were fully matched regarding their social desirability. The MFC format does not prevent faking completely, but it reduces faking substantially. Faking can be further reduced in the MFC format by matching the items presented in a block regarding their social desirability.
... However, as implied by its name, the forced-choice format also introduces constraints onto the respondent that may have negative sideeffects. In studies of participants completing forced-choice and single-stimulus formatted questionnaires for low-stake research purposes, Sass et al. (2018) and Zhang et al. (2020) found that test-taking motivation and affect were not affected by the different formats. Zhang et al. (2020), however, found that the forced-choice questionnaire was perceived as more difficult by respondents than the single-stimulus counterpart. ...
Article
Full-text available
There is a wealth of evidence justifying the use of personality assessments for selection. Nonetheless, some reluctance to use these assessments stems from their perceived vulnerability to response distortion (i.e., faking) and the somewhat negative applicant reactions they elicit, when compared to other assessments. Adopting a forced-choice personality assessment format appears to alleviate the former problem but exacerbates the latter. In this study, we introduce basic psychological needs as a theoretical foundation to develop interventions to improve reactions to forced-choice personality assessments. We propose that the forced-choice format impedes respondents’ desire to respond to items in a preferred way, interfering with autonomy need satisfaction, and constrains respondents’ opportunity to show their capabilities, interfering with competence need satisfaction. In this pre-registered between-subjects experiment (N = 1565), we investigated two modifications to a ranked forced-choice personality questionnaire and compared these to traditional forced-choice and single-stimulus (Likert) formatted questionnaires. One modification, where participants could write a free-text response following the assessment, did not show significant effects on reactions. The second modification allowed participants to view all items they had ranked last (first) and then identify any the participant believed in fact described them well (poorly). That modification positively affected perceived autonomy- and competence-support, and fairness perceptions, bridging approximately half of the gap between reactions to forced-choice and single-stimulus assessment formats. This study suggests that a modification to forced-choice personality questionnaires may improve applicant reactions and that basic psychological needs theory may be a fruitful lens through which to further understand reactions to assessments.
... Forced-choice (FC) scales are widely used as an alternative to Likert scales for non-cognitive tests in high stakes situations because they are effective in preventing faking (Jackson et al., 2000;Saville & Willson, 1991;Hurtz & Donovan, 2001) and controlling for various response biases associated with Likert methods, such as halo effects, extremity/midpoint bias, leniency/severity tendencies and acquiescence. Meanwhile, FC scales are effective in reducing score inflation due to social desirability effects (Cao & Drasgow, 2019), do not significantly reduce test motivation and have adverse emotional or cognitive effects on individuals (Sass et al., 2020;Zhang et al., 2020). ...
... Forced-choice scales do have their own special considerations and assumptions (cf. Sass et al., 2020). Practitioners may instead want to focus on other means to reduce rater bias, such as enhancing their rater error training and emphasizing to employees that their ratings are for developmental as opposed to evaluative purposes (Aguinis, 2019). ...
Article
Climate strength is often included in organizational climate models, however, its role in such models remains unclear. We propose that the inconsistent findings regarding the effects of climate strength are due in part to its complicated relationship with climate level. Specifically, we propose that the relationship between level and strength is heteroscedastic and nonlinear due to restricted variance (RV) and potential leniency bias in climate ratings. We examine how this relationship between level and strength affects relations between climate strength and work-related outcomes, as well as the implications that this has for bilinear interactions between level and strength. In this meta-analysis, we analyzed 81 independent samples from 77 articles and find support for a heteroscedastic, curvilinear relationship between climate level and climate strength, consistent with the notion that variance compression and leniency bias are present in climate ratings. With regard to the three proposed roles of climate strength in organizational models, we find some support for an additive effect of strength on outcomes, but only at high levels of climate level, and little support for strength as a bilinear moderator of level–outcome relations or for strength as a nonlinear predictor of outcomes. We do find, however, some support for nonlinear interaction effects between level and strength. We discuss implications of our findings for the role of climate strength in future research and for multilevel theory in general.
... Another important factor is the size of the block. Increasing their size (e.g., using triplets) may reduce ipsativity, but it also increases cognitive load by requiring more comparisons per block (Sass et al., 2020). In fact, Frick et al. (2021) held constant. ...
Article
Full-text available
The new methodological and technological developments of the last decade make it possible to resolve or, at least, attenuate the psychometric problems of forced-choice (FC) tests for the measurement of personality. In these tests, the person being tested is shown blocks of two or more sentences of similar social desirability, from which he or she must choose which one best represents him or her. Thus, FC tests aim to reduce response bias in self-report questionnaires. However, their use is not without risks and complications if they are not created properly. Fortunately, new psychometric models make it possible to model responses in this type of test and to optimize their construction. Moreover, they allow the construction of “on the fly” computerized adaptive FC tests (CAT-FC), in which each item is constructed on the spot, optimally matching sentences from a previously calibrated bank.
... In this sense, Bürkner et al. (2019) outline four main reasons for not using unequally keyed blocks. First, judging one's agreement with negatively keyed items can be cognitively demanding, compounded with the fact that the forced-choice format itself is already somewhat challenging (Sass et al., 2020), may affect the response process and compromise the construct validity. Second, negatively keyed items may add methodological variance (Dueber et al., 2019), forming a separate method factor. ...
Article
Full-text available
The use of multidimensional forced-choice questionnaires has been proposed as a means of improving validity in the assessment of non-cognitive attributes in high-stakes scenarios. However, the reduced precision of trait estimates in this questionnaire format is an important drawback. Accordingly, this article presents an optimization procedure for assembling pairwise forced-choice questionnaires while maximizing posterior marginal reliabilities. This procedure is performed through the adaptation of a known genetic algorithm (GA) for combinatorial problems. In a simulation study, the efficiency of the proposed procedure was compared with a quasi-brute-force (BF) search. For this purpose, five-dimensional item pools were simulated to emulate the real problem of generating a forced-choice personality questionnaire under the five-factor model. Three factors were manipulated: 1) the length of the questionnaire, 2) the relative item pool size with respect to the questionnaire’s length, and 3) the true correlations between traits. The recovery of the person parameters for each assembled questionnaire was evaluated through the squared correlation between estimated and true parameters, the root mean square error between the estimated and true parameters, the average difference between the estimated and true inter-trait correlations, and the average standard error for each trait level. The proposed GA offered more accurate trait estimates than the BF search within a reasonable computation time in every simulation condition. Such improvements were especially important when measuring correlated traits and when the relative item pool sizes were higher. A user-friendly online implementation of the algorithm was made available to the users.
... This survey was "forced-choice": Sass et al. revealed that most researchers have chosen "forced-choice" surveys to ensure fully filled responses. The first challenge was to complete the MOOC from start to finish while the "test motivation" served as the second challenge to get data [25]. These results could be considered as only the data of the "motivated participants" who, similar to the MOOC itself, are more likely to complete the entire survey. ...
Article
Full-text available
Objectives In 2018, Harvard University provided a 10-week online course titled “Improving Global Health: Focusing on Quality and Safety” as using Massive Online Open Courses (MOOCs) web-based platform. The course was designed for those who care about health and healthcare and wish to learn more about how to measure and improve that care – for themselves, for their institutions, or for their countries. The goal of this course was to provide visual and written education tools for different countries and different age groups. In respect to the aim of this study is to evaluate the impressions and benefits of group learning activity and educational needs after this “Improving Global Health” courses experience with an online survey among the participants. Methods Sixty-six family medicine practitioners and trainees who were among the participants of the course were the universe of the study. These young General Practitioners/Family Physicians (GPs/FPs) from different countries were organized among themselves to follow the course as a group activity. Two weeks after the course, an online survey was sent to all the participants of this group activity. Results Twenty-eight out of 66 participants (42.4%) completed the survey and provided feedback on their perspectives and experience. Most of them were female (70.4%), and have not attended any MOOC course before (63%). This international group achieved a completion rate of approximately 65% by the deadline and nearly 90% including those finishing afterward. The majority felt that the group activity proved beneficial and supportive in nature. Conclusions Well-structured, sustainable e-learning platforms will be the near futures’ medical learning devices in a world without borders. Future studies should further explore facilitators and barriers among FPs for enrolling and completing MOOCs. Furthermore, there is a need to evaluate how these group-learning initiatives may help participants incorporate lessons learned from the course into their daily practice.
... With respect to the theoretical assumptions on the response process, respondents have indeed reported in a think-aloud task that they make pairwise comparisons between all items of a block (Sass et al., 2018). In that study, where blocks of three items were used, 76% of participants reported to have no difficulty keeping in mind the information related to all statements in order to appraise the relative utility of all items. ...
Article
Full-text available
Forced-choice questionnaires can prevent faking and other response biases typicallyassociated with rating scales. However, the derived trait scores are often unreliableand ipsative, making interindividual comparisons in high-stakes situations impossible.Several studies suggest that these problems vanish if the number of measured traits ishigh. To determine the necessary number of traits under varying sample sizes, factorloadings, and intertrait correlations, simulations were performed for the two mostwidely used scoring methods, namely the classical (ipsative) approach and Thurstonianitem response theory (IRT) models. Results demonstrate that while especiallyThurstonian IRT models perform well under ideal conditions, both methods yield insufficient reliabilities in most conditions resembling applied contexts. Moreover, notonly the classical estimates but also the Thurstonian IRT estimates for questionnaireswith equally keyed items remain (partially) ipsative, even when the number of traits isvery high (i.e., 30). This result not only questions earlier assumptions regarding theuse of classical scores in high-dimensional questionnaires, but it also raises doubtsabout many validation studies on Thurstonian IRT models because correlations of(partially) ipsative scores with external criteria cannot be interpreted in a usual way.
... With respect to the theoretical assumptions on the response process, respondents have indeed reported in a think-aloud task that they make pairwise comparisons between all items of a block (Sass et al., 2018). In that study, where blocks of three items were used, 76% of participants reported to have no difficulty keeping in mind the information related to all statements in order to appraise the relative utility of all items. ...
Article
Full-text available
Forced-choice questionnaires can prevent faking and other response biases typically associated with rating scales. However, the derived trait scores are often unreliable and ipsative, making interindividual comparisons in high-stakes situations impossible. Several studies suggest that these problems vanish if the number of measured traits is high. To determine the necessary number of traits under varying sample sizes, factor loadings, and intertrait correlations, simulations were performed for the two most widely used scoring methods, namely the classical (ipsative) approach and Thurstonian item response theory (IRT) models. Results demonstrate that while especially Thurstonian IRT models perform well under ideal conditions, both methods yield insufficient reliabilities in most conditions resembling applied contexts. Moreover, not only the classical estimates but also the Thurstonian IRT estimates for questionnaires with equally keyed items remain (partially) ipsative, even when the number of traits is very high (i.e., 30). This result not only questions earlier assumptions regarding the use of classical scores in high-dimensional questionnaires, but it also raises doubts about many validation studies on Thurstonian IRT models because correlations of (partially) ipsative scores with external criteria cannot be interpreted in a usual way.
... With respect to the theoretical assumptions on the response process, respondents have indeed reported in a think-aloud task that they make pairwise comparisons between all items of a block (Sass et al., 2018). In that study, where blocks of three items were used, 76% of participants reported to have no difficulty keeping in mind the information related to all statements in order to appraise the relative utility of all items. ...
Preprint
Full-text available
Forced-choice questionnaires can prevent faking and other response biasestypically associated with rating scales. However, the derived trait scoresare often unreliable and ipsative, making inter-individual comparisons inhigh-stakes situations impossible. Several studies suggest that these problemsvanish if the number of measured traits is high. To determine the necessarynumber of traits under varying sample sizes, factor loadings, and intertrait-correlations, simulations were performed for the two most widely usedscoring methods, namely the classical (ipsative) approach and ThurstonianIRT models. Results demonstrate that while especially Thurstonian IRTmodels perform well under ideal conditions, both methods yield insufficientreliabilities in most conditions resembling applied contexts. Moreover, notonly the classical estimates but also the Thurstonian IRT estimates remain(partially) ipsative, even when the number of traits is very high (i.e., 30).This result not only questions earlier assumptions regarding the use ofclassical scores in high dimensional questionnaires, but it also raises doubtsabout validation studies on Thurstonian IRT models because correlations of(partially) ipsative
... A recent meta-analysis found that the overall effect size for faking of FC measures was d = 0.06 (Cao & Drasgow, 2019), which is substantially smaller than the average effects size reported in a meta-analysis of faking of RS measures (d = 0.26; Birkeland, Manson, Kisamore, Brannick, & Smith, 2006). In addition, Sass, Frick, Reips, and Wetzel (2018) did not find any significant differences in the self-reported test motivation of participants between four different variants of the MFC format (pairs, triplets, quads, pentads) and the RS format. ...
Preprint
The multidimensional forced-choice (MFC) format has been proposed as an alternative to rating scales (RS) that may be less susceptible to response biases. The goal of this study was to compare the validity of trait estimates from the MFC and the RS format when using normative scoring for both formats. We focused on construct validity and criterion-related validity. In addition, we investigated test-retest reliability over a period of six months. Participants were randomly assigned the MFC (N = 593) or the RS (N = 622) version of the Big Five Triplets. In addition to self-ratings on the Big Five Triplets and other personality questionnaires and criteria, we also obtained other-ratings (N = 770) for the Big Five Triplets. The Big Five in the Big Five Triplets corresponded well with the Big Five in the Big Five Inventory except for agreeableness in the MFC version. The majority of the construct validity coefficients differed between the MFC and the RS version whereas criterion-related validities were very similar. The self- and other-rated Big Five Triplets showed higher correlations in the MFC format than in the RS format. The reliability of test scores on the Big Five and test-retest reliabilities were lower for MFC compared to RS. For the MFC format to be able to replace the RS format, more research on how to obtain ideal constellations of items that are matched in their desirability is needed.
... It is not surprising that the FC measure was considered harder to complete than the SS measure. This finding is in line with what was found with think-aloud techniques (Sass et al., 2018). ...
Article
Full-text available
Forced-choice (FC) measures are gaining popularity as an alternative assessment format to single-statement (SS) measures. However, a fundamental question remains to be answered: do FC and SS instruments measure the same underlying constructs? In addition, FC measures are theorized to be more cognitively challenging, so how would this feature influence respondents' reactions to FC measures compared to SS? We used both between-and within-subjects designs to examine the equivalence of the FC format and the SS format. As the results illustrate, FC measures scored by the Multi-unidimensional Pairwise Preference Model (MUPP) and SS measures scored with the Generalized Graded Unfolding Model (GGUM) showed strong equivalence. Specifically, both formats demonstrated similar marginal reliabilities and test-retest reliabilities, high convergent validities, good discriminant validities, and similar criterion-related validities with theoretically relevant criteria. In addition, the formats had little differential impact on respondents' general emotional and cognitive reactions except that the FC format was perceived to be slightly more difficult and more time-saving.
Article
The aim of the current research was to provide recommendations to facilitate the development and use of anchoring vignettes (AVs) for cross-cultural comparisons in education. Study 1 identified six factors leading to order violations and ties in AV responses based on cognitive interviews with 15-year-old students. The factors were categorized into three domains: varying levels of AV format familiarity, differential interpretations of content, and individual differences in processing information. To inform the most appropriate approach to treat order violations and re-scaling method, Study 2 conducted Monte Carlo simulations with the manipulation of three factors and incorporation of five responses styles. Study 2 found that the AV approach improved accuracy in score estimation by successfully controlling for response styles. The results also revealed the reordering approach to treat order violations produced the most accurate estimation combined with the re-scaling method assigning the middle value among possible scores. Along with strategies to develop AVs, strengths and limitations of the implemented nonparametric AV approach were discussed in comparison to item response theory modeling for response styles.
Article
There has been a growing interest in psychological measurements that use the multiple-alternative forced-choice (MAFC) response format for its resistance to response biases. Although several models have been proposed for the data obtained from such measurements, none have succeeded in incorporating the response time information. Given that currently, many psychological measurements are performed via computers, it would be beneficial to develop a joint model involving an MAFC item response and response time. The present study proposes the first model that combines a cognitive process model that underlies the observed response time and the forced-choice item response model. Specifically, the proposed model is based on the linear ballistic accumulator model of response time, which is substantially extended by reformulating its parameters so as to incorporate the MAFC item responses. The model parameters are estimated by the Markov chain Monte Carlo (MCMC) algorithm. A simulation study confirmed that the proposed approach could appropriately recover the parameters. Two empirical applications are reported to demonstrate the use of the proposed model and compare it with existing models. The results showed that the proposed model could be a useful tool for jointly modeling the MAFC item responses and response times.
Article
Full-text available
A common concern with self-reports of personality traits in selection contexts is faking. The multidimensional forced-choice (MFC) format has been proposed as an alternative to rating scales (RS) that could prevent faking. The goal of this study was to compare the susceptibility of the MFC format and the RS format to faking in a simulated high-stakes setting when using normative scoring for both formats. Participants were randomly assigned to 3 groups (total N = 1,867) and filled out the Big Five Triplets once under an honest instruction and once under a fake-good instruction. Latent mean differences between the honest and fake-good administrations indicated that the Big Five domains were faked in the expected direction. Faking effects for all traits were larger for RS compared with MFC. Faking effects were also larger for the MFC version with mixed triplets compared with the MFC version with triplets that were fully matched regarding their social desirability. The MFC format does not prevent faking completely, but it reduces faking substantially. Faking can be further reduced in the MFC format by matching the items presented in a block regarding their social desirability. (PsycInfo Database Record (c) 2020 APA, all rights reserved).
Article
Although modern item response theory (IRT) methods of test construction and scoring have overcome ipsativity problems historically associated with multidimensional forced choice (MFC) formats, there has been little research on MFC differential item functioning (DIF) detection, where item refers to a block, or group, of statements presented for an examinee’s consideration. This research investigated DIF detection with three-alternative MFC items based on the Thurstonian IRT (TIRT) model, using omnibus Wald tests on loadings and thresholds. We examined constrained and free baseline model comparisons strategies with different types and magnitudes of DIF, latent trait correlations, sample sizes, and levels of impact in an extensive Monte Carlo study. Results indicated the free baseline strategy was highly effective in detecting DIF, with power approaching 1.0 in the large sample size and large magnitude of DIF conditions, and similar effectiveness in the impact and no-impact conditions. This research also included an empirical example to demonstrate the viability of the best performing method with real examinees and showed how a DIF and a DTF effect size measure can be used to assess the practical significance of MFC DIF findings.
Preprint
Full-text available
Forced-choice questionnaires have been proposed to avoid common response biases typically associated with rating scale questionnaires. To overcome ipsativity issues of trait scores obtained from classical scoring approaches of forced-choice items, advanced methods from item response theory (IRT) such as the Thurstonian IRT model have been proposed. For convenient model specification, we introduce the thurtonianIRT R package, which uses Mplus, lavaan, and Stan for model estimation. Based on practical considerations, we establish that items within one block need to be equally keyed to achieve similar social desireability, which is essential for creating force-choice questionnaires that have the potential to resist faking intentions. However, according to extensive simulations, persons' trait scores obtained from Thurstonian IRT models applied to equally keyed items are not sufficiently accurate neither for classical nor Bayesian estimation methods. Moreover, inter-trait correlations are estimated with considerable bias. As a result, trait scores remain partially ipsative and, thus, do not allow for valid comparisons between persons. We conclude that Thurstonian IRT models should not be applied in high-stakes situations where persons are motivated to give fake answers.
Article
Full-text available
The success of Amazon Mechanical Turk (MTurk) as an online research platform has come at a price: MTurk has suffered from slowing rates of population replenishment, and growing participant non-naivety. Recently, a number of alternative platforms have emerged, offering capabilities similar to MTurk but providing access to new and more naïve populations. After surveying several options, we empirically examined two such platforms, CrowdFlower (CF) and Prolific Academic (ProA). In two studies, we found that participants on both platforms were more naïve and less dishonest compared to MTurk participants. Across the three platforms, CF provided the best response rate, but CF participants failed more attention-check questions and did not reproduce known effects replicated on ProA and MTurk. Moreover, ProA participants produced data quality that was higher than CF's and comparable to MTurk's. ProA and CF participants were also much more diverse than participants from MTurk.
Article
Full-text available
Response styles are a source of contamination in questionnaire ratings, and therefore they threaten the validity of conclusions drawn from marketing research data. In this article, the authors examine five forms of stylistic responding (acquiescence and disacquiescence response styles, extreme response style/response range, midpoint responding, and noncontingent responding) and discuss their biasing effects on scale scores and correlations between scales. Using data from large, representative samples of consumers from 11 countries of the European Union, the authors find systematic effects of response styles on scale scores as a function of two scale characteristics (the proportion of reverse-scored items and the extent of deviation of the scale mean from the midpoint of the response scale) and show that correlations between scales can be biased upward or downward depending on the correlation between the response style components. In combination with the apparent lack of concern with response styles evidenced in a secondary analysis of commonly used marketing scales, these findings suggest that marketing researchers should pay greater attention to the phenomenon of stylistic responding when constructing and using measurement instruments.
Article
Full-text available
We examined the effects of response biases on 360-degree feedback using a large sample (N=4,675) of organizational appraisal data. Sixteen competencies were assessed by peers, bosses and subordinates of 922 managers, as well as self-assessed, using the Inventory of Management Competencies (IMC) administered in two formats – Likert scale and multidimensional forced choice. Likert ratings were subject to strong response biases, making even theoretically unrelated competencies correlate highly. Modeling a latent common method factor, which represented non-uniform distortions similar to those of “ideal-employee” factor in both self- and other assessments, improved validity of competency scores as evidenced by meaningful second-order factor structures, better inter-rater agreement, and better convergent correlations with an external personality measure. Forced-choice rankings modelled with Thurstonian IRT yielded as good construct and convergent validities as the bias-controlled Likert ratings, and slightly better rater agreement. We suggest that the mechanism for these enhancements is finer differentiation between behaviors in comparative judgements, and advocate the operational use of the multidimensional forced-choice response format as an effective bias prevention method.
Article
Full-text available
This study investigated the stability of extreme response style (ERS) and acquiescence response style (ARS) over a period of 8 years. ERS and ARS were measured with item sets drawn randomly from a large pool of items used in an ongoing German panel study. Latent-trait-state-occasion and latent-state models were applied to test the relationship between time-specific (state) response style behaviors and time-invariant trait components of response styles. The results show that across different random item samples, on average between 49% and 59% of the variance in the state response style factors was explained by the trait response style factors. This indicates that the systematic differences respondents show in their preferences for certain response categories are remarkably stable over a period of 8 years. The stability of ERS and ARS implies that it is important to consider response styles in the analysis of self-report data from polytomous rating scales, especially in longitudinal studies aimed at investigating stability in substantive traits. Furthermore, the stability of response styles raises the question in how far they might be considered trait-like latent variables themselves that could be of substantive interest. © The Author(s) 2015.
Article
Full-text available
This article reports a comprehensive meta-analysis of the criterion-oriented validity of the Big Five personality dimensions assessed with forced-choice (FC) inventories. Six criteria (i.e., performance ratings, training proficiency, productivity, grade-point average, global occupational performance, and global academic performance) and three types of FC scores (i.e., normative, quasi-ipsative, and ipsative) served for grouping the validity coefficients. Globally, the results showed that the Big Five assessed with FC measures have similar or slightly higher validity than the Big Five assessed with single-stimulus (SS) personality inventories. Quasi-ipsative measures of conscientiousness (K=44, N=8794, =.40) are found to be better predictors of job performance than normative and ipsative measures. FC inventories also showed similar reliability coefficients to SS inventories. Implications of the findings for theory and practice in academic and personnel decisions are discussed, and future research is suggested.
Article
Full-text available
Three socially aversive traits-Machiavellianism, narcissism, and psychopathy-have been studied as an overlapping constellation known as the Dark Triad. Here, we develop and validate the Short Dark Triad (SD3), a brief proxy measure. Four studies (total N = 1,063) examined the structure, reliability, and validity of the subscales in both community and student samples. In Studies 1 and 2, structural analyses yielded three factors with the final 27 items loading appropriately on their respective factors. Study 3 confirmed that the resulting SD3 subscales map well onto the longer standard measures. Study 4 validated the SD3 subscales against informant ratings. Together, these studies indicate that the SD3 provides efficient, reliable, and valid measures of the Dark Triad of personalities.
Article
Full-text available
Although the purpose of questionnaire items is to obtain a person’s opinion on a certain matter, a respondent’s registered opinion may not reflect his or her “true” opinion because of random and systematic errors. Response styles (RSs) are a respondent’s tendency to respond to survey questions in certain ways regardless of the content, and they contribute to systematic error. They affect univariate and multivariate distributions of data collected by rating scales and are alternative explanations for many research results. Despite this, RS are often not controlled in research. This article provides a comprehensive summary of the types of RS, lists their potential sources, and discusses ways to diagnose and control for them. Finally, areas for further research on RS are proposed.
Article
Full-text available
OVERVIEW Socially desirable responding (SDR) is typically defined as the tendency to give positive self-descriptions . Its status as a response style rests on the clarification of an underlying psychological construct. A brief history of such attempts is provided . Despite the growing consensus that there are two dimensions of SDR, their interpretation has varied over the years from minimalist operationalizations to elaborate construct validation . I argue for the necessity of demonstrating departure-from-reality in the self-reports of high SDR scorers : This criterion is critical for distinguishing SDR from related constructs. An appropriate methodology that operationalizes SDR directly in terms of self-criterion discrepancy is described. My recent work on this topic has evolved into a two-tiered taxonomy that crosses degree of awareness (conscious vs . unconscious) with content (agentic vs . communal qualities) . Sufficient research on SDR constructs has accumulated to propose a broad reconciliation and integration .
Article
Full-text available
In multidimensional forced-choice (MFC) questionnaires, items measuring different attributes are presented in blocks, and participants have to rank order the items within each block (fully or partially). Such comparative formats can reduce the impact of numerous response biases often affecting single-stimulus items (aka rating or Likert scales). However, if scored with traditional methodology, MFC instruments produce ipsative data, whereby all individuals have a common total test score. Ipsative scoring distorts individual profiles (it is impossible to achieve all high or all low scale scores), construct validity (covariances between scales must sum to zero), criterion-related validity (validity coefficients must sum to zero), and reliability estimates. We argue that these problems are caused by inadequate scoring of forced-choice items and advocate the use of item response theory (IRT) models based on an appropriate response process for comparative data, such as Thurstone's law of comparative judgment. We show that when Thurstonian IRT modeling is applied (Brown & Maydeu-Olivares, 2011), even existing forced-choice questionnaires with challenging features can be scored adequately and that the IRT-estimated scores are free from the problems of ipsative data. (PsycINFO Database Record (c) 2012 APA, all rights reserved).
Article
Full-text available
We begin this article with the assumption that attitudes are best understood as structures in long-term memory, and we look at the implications of this view for the response process in attitude surveys. More specifically, we assert that an answer to an attitude question is the product of a four-stage process. Respondents first interpret the attitude question, determining what attitude the question is about. They then retrieve relevant beliefs and feelings. Next, they apply these beliefs and feelings in rendering the appropriate judgment. Finally, they use this judgment to select a response. All four of the component processes can be affected by prior items. The prior items can provide a framework for interpreting later questions and can also make some responses appear to be redundant with earlier answers. The prior items can prime some beliefs, making them more accessible to the retrieval process. The prior items can suggest a norm or standard of comparison for making the judgment. Finally, the prior items can create consistency pressures or pressures to appear moderate. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
The effects of motivated distortion on forced-choice (FC) and normative inventories were examined in three studies. Study 1 examined the effects of distortion on the con-struct validity of the two item formats in terms of convergent and discriminant valid-ity. The results showed that both types of measures were susceptible to motivated dis-tortion, however the FC items were better indicators of personality and less related to socially desirable responding when participants were asked to respond as if applying for a job. Study 2 considered the criterion-related validity of the inventories in terms of predicting supervisors' ratings of job performance, finding that distortion had a more deleterious effect on the validity of the normative inventory with some en-hancement of the validity of the FC inventory being observed. Study 3 investigated whether additional constructs are introduced into the measurement process when motivated respondents attempt to increase scores on FC items. Results of Study 3 in-dicated that individuals higher in cognitive ability tend to have more accurate theo-ries about which traits are job-related and therefore are more successful at improving scores on FC inventories. Implications for using personality inventories in personnel selection are discussed. Despite optimism that motivated distortion does not represent a serious threat to personality tests used to aid organizational decision making (Barrick & Mount,
Article
Full-text available
In this article, I show how item response models can be used to capture multiple response processes in psychological applications. Intuitive and analytical responses, agree-disagree answers, response refusals, socially desirable responding, differential item functioning, and choices among multiple options are considered. In each of these cases, I show that the response processes can be measured via pseudoitems derived from the observed responses. The estimation of these models via standard software programs that allow for missing data is also discussed. The article concludes with two detailed applications that illustrate the prevalence of multiple response processes. (PsycINFO Database Record (c) 2012 APA, all rights reserved).
Article
Full-text available
In a web experiment, participants were randomly assigned to two semantic differentials either made from discrete 5-point ordinal rating scales or from continuous visual analogue scales (VASs) with 250 gradations. Respondents adjusted their ratings with VASs more often to maximize the precision of answers, which had a beneficial effect on data quality. No side effects like differences in means, higher dropout, more nonresponse, or higher response times were observed. Overall, the combination of semantic differentials and VASs results in a number of advantages. Potential for further research is discussed.
Article
Full-text available
The use of online questionnaires is rapidly increasing. Contrary to manifold advantages, not much is known about user behavior that can be measured outside the boundaries set by standard web technologies like HTML form elements. To show how the lack of knowledge about the user setting in web studies can be accounted for, we present a tool called UserActionTracer, with which it is possible to collect more behavior information than with any other paradata gathering tool, in order to (1) gather additional data unobtrusively from the process of answering questions and (2) to visualize individual user behavior on web pages. In an empirical study on a large web sample (N = 1046) we observed and analysed online behaviors (e.g., clicking through). We found that only 10.5% of participants showed more than five single behaviors with highly negative influence on data quality in the whole online questionnaire (out of 132 possible single behavior judgments). Furthermore, results were validated by comparison with data from online address books. With the UserActionTracer it is possible to gain further insight into the process of answering online questionnaires.
Article
Full-text available
The comparative format used in ranking and paired comparisons tasks can significantly reduce the impact of uniform response biases typically associated with rating scales. Thurstone's (1927, 1931) model provides a powerful framework for modeling comparative data such as paired comparisons and rankings. Although Thurstonian models are generally presented as scaling models, that is, stimuli-centered models, they can also be used as person-centered models. In this article, we discuss how Thurstone's model for comparative data can be formulated as item response theory models so that respondents' scores on underlying dimensions can be estimated. Item parameters and latent trait scores can be readily estimated using a widely used statistical modeling program. Simulation studies show that item characteristic curves can be accurately estimated with as few as 200 observations and that latent trait scores can be recovered to a high precision. Empirical examples are given to illustrate how the model may be applied in practice and to recommend guidelines for designing ranking and paired comparisons tasks in the future.
Article
Full-text available
Multidimensional forced-choice formats can significantly reduce the impact of numerous response biases typically associated with rating scales. However, if scored with classical methodology, these questionnaires produce ipsative data, which lead to distorted scale relationships and make comparisons between individuals problematic. This research demonstrates how item response theory (IRT) modeling may be applied to overcome these problems. A multidimensional IRT model based on Thurstone’s framework for comparative data is introduced, which is suitable for use with any forced-choice questionnaire composed of items fitting the dominance response model, with any number of measured traits, and any block sizes (i.e., pairs, triplets, quads, etc.). Thurstonian IRT models are normal ogive models with structured factor loadings, structured uniquenesses, and structured local dependencies. These models can be straightforwardly estimated using structural equation modeling (SEM) software Mplus. A number of simulation studies are performed to investigate how latent traits are recovered under various forced-choice designs and provide guidelines for optimal questionnaire design. An empirical application is given to illustrate how the model may be applied in practice. It is concluded that when the recommended design guidelines are met, scores estimated from forced-choice questionnaires with the proposed methodology reproduce the latent traits well.
Article
Full-text available
Intelligence tests are widely assumed to measure maximal intellectual performance, and predictive associations between intelligence quotient (IQ) scores and later-life outcomes are typically interpreted as unbiased estimates of the effect of intellectual ability on academic, professional, and social life outcomes. The current investigation critically examines these assumptions and finds evidence against both. First, we examined whether motivation is less than maximal on intelligence tests administered in the context of low-stakes research situations. Specifically, we completed a meta-analysis of random-assignment experiments testing the effects of material incentives on intelligence-test performance on a collective 2,008 participants. Incentives increased IQ scores by an average of 0.64 SD, with larger effects for individuals with lower baseline IQ scores. Second, we tested whether individual differences in motivation during IQ testing can spuriously inflate the predictive validity of intelligence for life outcomes. Trained observers rated test motivation among 251 adolescent boys completing intelligence tests using a 15-min "thin-slice" video sample. IQ score predicted life outcomes, including academic performance in adolescence and criminal convictions, employment, and years of education in early adulthood. After adjusting for the influence of test motivation, however, the predictive validity of intelligence for life outcomes was significantly diminished, particularly for nonacademic outcomes. Collectively, our findings suggest that, under low-stakes research conditions, some individuals try harder than others, and, in this context, test motivation can act as a third-variable confound that inflates estimates of the predictive validity of intelligence for life outcomes.
Article
Full-text available
Personalidad, Evaluación de Personal y Motivación para la Realización de Tests. Los tests de habilidades son instrumentos utilizados con frecuencia en la evaluación de personal, pero la motivación de los sujetos para realizarlos (TTM) puede afectar a su validez. La investigación previa sobre este tema ha mostrado que la TTM está relacionada con algunas variables personales, tales como las habilidades, los factores de rendimiento, la raza, etc. Sin embargo, la personalidad no fue estudiada como factor explicativo de la TTM. En esta investigación se estudió la relación entre la TTM y la personalidad y la distorsión motivacional. La personalidad fue conceptuada usando el modelo de cinco factores. Nosotros hipotetizamos que el Neuroticismo, la Extraversión, la Apertura y la Escrupulosidad correlacionarían con las actitudes positivas y negativas hacia la realización de tests. Sobre la relación entre la TTM y la distorsión motivacional, formulamos la hipótesis de que las actitudes positivas hacia la realización de tests correlacionarán con la distorsión motivacional. Se utilizaron dos muestras de sujetos. La muestra A esta compuesta por 145 estudiantes (desempleados) y la muestra B por 187 sujetos, todos empleados, la mayoría en empleos administrativos. Los resultados indican que los factores de personalidad están relacionados con la TTM del modo que se hipotetizó. Además, la situación laboral de los sujetos no tiene efectos sobre la TTM ni la distorsion motivacional aparece relacionada con la TTM. En la discusión, se comentan las implicaciones que estos hallazgos tienen para la evaluación de personal
Article
Full-text available
Interest in the problem of method biases has a long history in the behavioral sciences. Despite this, a comprehensive summary of the potential sources of method biases and how to control for them does not exist. Therefore, the purpose of this article is to examine the extent to which method biases influence behavioral research results, identify potential sources of method biases, discuss the cognitive processes through which method biases influence responses to measures, evaluate the many different procedural and statistical techniques that can be used to control method biases, and provide recommendations for how to select appropriate procedural and statistical remedies for different types of research settings.
Article
Full-text available
Recent research suggests multidimensional forced-choice (MFC) response formats may provide resistance to purposeful response distortion on personality assessments. It remains unclear, however, whether these formats provide normative trait information required for selection contexts. The current research evaluated score correspondences between an MFC format measure and 2 Likert-type measures in honest and instructed-faking conditions. In honest response conditions, scores from the MFC measure appeared valid indicators of normative trait standing. Under faking conditions, the MFC measure showed less score inflation than the Likert measure at the group level of analysis. In the individual-level analyses, however, the MFC measure was as affected by faking as was the Likert measure. Results suggest the MFC format is not a viable method to control faking.
Article
Full-text available
How do youths' personality reports differ from those of adults? To identify the year-by-year timing of developmental trends from late childhood (age 10) to early adulthood (age 20), the authors examined Big Five self-report data from a large and diverse Internet sample. At younger ages within this range, there were large individual differences in acquiescent responding, and acquiescence variability had pronounced effects on psychometric characteristics. Beyond the effects of acquiescence, self-reports generally became more coherent within domains, and better differentiated across domains, at older ages. Importantly, however, different Big Five domains showed different developmental trends. Extraversion showed especially pronounced age gains in coherence but no gains in differentiation. In contrast, Agreeableness and Conscientiousness showed large age gains in differentiation but only trivial gains in coherence. Neuroticism and Openness showed moderate gains in both coherence and differentiation. Comparisons of items that were relatively easy versus difficult to comprehend indicated that these patterns were not simply due to verbal comprehension. These findings have important implications for the study of personality characteristics and other psychological attributes in childhood and adolescence.
Article
Visual analogue scales (VASs) have shown superior measurement qualities in comparison to traditional Likert-type response scales in previous studies. The present study expands the comparison of response scales to properties of Internet-based personality scales in a within-subjects design. A sample of 879 participants filled out an online questionnaire measuring Conscientiousness, Excitement Seeking, and Narcissism. The questionnaire contained all instruments in both answer scale versions in a counterbalanced design. Results show comparable reliabilities, means, and SDs for the VAS versions of the original scales, in comparison to Likert-type scales. To assess the validity of the measurements, age and gender were used as criteria, because all three constructs have shown non-zero correlations with age and gender in previous research. Both response scales showed a high overlap and the proposed relationships with age and gender. The associations were largely identical, with the exception of an increase in explained variance when predicting age from the VAS version of Excitement Seeking (B10 =125.1, ΔR²=.025). VASs showed similar properties to Likert-type response scales in most cases.
Book
Examines the psychological processes involved in answering different types of survey questions. The book proposes a theory about how respondents answer questions in surveys, reviews the relevant psychological and survey literatures, and traces out the implications of the theories and findings for survey practice. Individual chapters cover the comprehension of questions, recall of autobiographical memories, event dating, questions about behavioral frequency, retrieval and judgment for attitude questions, the translation of judgments into responses, special processes relevant to the questions about sensitive topics, and models of data collection. The text is intended for: (1) social psychologists, political scientists, and others who study public opinion or who use data from public opinion surveys; (2) cognitive psychologists and other researchers who are interested in everyday memory and judgment processes; and (3) survey researchers, methodologists, and statisticians who are involved in designing and carrying out surveys. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Interest in the problem of method biases has a long history in the behavioral sciences. Despite this, a comprehensive summary of the potential sources of method biases and how to control for them does not exist. Therefore, the purpose of this article is to examine the extent to which method biases influence behavioral research results, identify potential sources of method biases, discuss the cognitive processes through which method biases influence responses to measures, evaluate the many different procedural and statistical techniques that can be used to control method biases, and provide recommendations for how to select appropriate procedural and statistical remedies for different types of research settings.
Book
1. Introduction 2. Respondents' understanding of survey questions 3. The role of memory in survey responding 4. Answering questions about date and durations 5. Attitude questions 6. Factual judgments and numerical estimates 7. Attitude judgments and context effects 8. Mapping and formatting 9. Survey reporting of sensitive topics 10. Mode of data collection 11. Impact of the application of cognitive models to survey measurement.
Article
The consistency of extreme response style (ERS) and non-extreme response style (NERS) across the latent variables assessed in an instrument is investigated. Analyses were conducted on several PISA 2006 attitude scales and the German NEO-PI-R. First, a mixed partial credit model (PCM) and a constrained mixed PCM were compared regarding model fit. If the constrained mixed PCM fit better, latent classes differed only in their response styles but not in the latent variable. For scales where this was the case, participants’ membership to NERS or ERS on each scale was entered into a latent class analysis (LCA). For both instruments, this second order LCA revealed that the response style was consistent for the majority of the participants across latent variables.
Article
In this study we assessed whether the predictive validity of personality scores is stronger when respondent test-taking motivation (TTM) is higher rather than lower. Results from a field sample comprising 269 employees provided evidence for this moderation effect for one trait, Steadfastness. However, for Conscientiousness, valid criterion prediction was only obtained at low levels of TTM. Thus, it appears that TTM relates to the criterion validity of personality testing differently depending on the personality trait assessed. Overall, these and additional findings regarding the nomological net of TTM suggest that it is a unique construct that may have significant implications when personality assessment is used in personnel selection.
Article
This study investigated the relation of the "Big Five" personality di- mensions (Extraversion, Emotional Stability, Agreeableness, Consci- entiousness, and Openness to Experience) to three job performance criteria (job proficiency, training proficiency, and personnel data) for five occupational groups (professionals, police, managers, sales, and skilled/semi-skilled). Results indicated that one dimension of person- ality. Conscientiousness, showed consistent relations with all job per- formance criteria for all occupational groups. For the remaining per- sonality dimensions, the estimated true score correlations varied by occupational group and criterion type. Extraversion was a valid pre- dictor for two occupations involving social interaction, managers and sales (across criterion types). Also, both Openness to Experience and Extraversion were valid predictors of the training proficiency criterion (across occupations). Other personality dimensions were also found to be valid predictors for some occupations and some criterion types, but the magnitude of the estimated true score correlations was small (p < .10). Overall, the results illustrate the benefits of using the 5- factor model of personality to accumulate and communicate empirical findings. The findings have numerous implications for research and practice in personnel psychology, especially in the subfields of person- nel selection, training and development, and performance appraisal.
Article
We evaluated the effects of faking on mean scores and correlations with self-reported counterproductive behavior of integrity-related personality items administered in sin- gle-stimulus and forced-choice formats. In laboratory studies, we found that respon- dents instructed to respond as if applying for a job scored higher than when given stan- dard or "straight-take" instructions. The size of the mean shift was nearly a full standard deviation for the single-stimulus integrity measure, but less than one third of a standard deviation for the same items presented in a forced-choice format. The cor- relation between the personality questionnaire administered in the single-stimulus condition and self-reported workplace delinquency was much lower in the job appli- cant condition than in the straight-take condition, whereas the same items adminis- tered in the forced-choice condition maintained their substantial correlations with workplace delinquency.
Article
When a student takes a test, his or her performance may be expected to be influenced by the perceived consequence of the test to the student. Motivation, anxiety, and ultimately performance will be affected by what the test means to the student in terms of results. This research investigates the relationships of test consequence, motivation, anxiety, and performance on a sample of undergraduates taking a child development course. An hourly examination was given under two experimental conditions; in one condition the exam counted as part of the course grade, and in the other condition it did not. Results indicated that the consequence of the test had a strong influence on motivation and a modest, but significant influence on performance. Motivation and anxiety were found to have opposite effects on performance. Results are discussed in terms of their relation to classroom practice and technical issues in measurement.
Article
This study explores the reported level of test-taking motivation and the relation between test-taking motivation and mathematics achievement in a sample of Swedish eighth-grade students (n = 343) participating in Trends in International Mathematics and Science Study (TIMSS) 2003. A majority of students reported that they were motivated to do their best in TIMSS. Test-taking motivation was positively related to mathematics achievement, but the effect was small and not statistically significant when other relevant variables were held constant. Further, gender comparisons showed that test-taking motivation was positively, but not significantly related to achievement in boys, and was unrelated to achievement in girls. This result was probably due to the larger variability in boys' ratings. It was concluded that the Swedish TIMSS mathematics result is unlikely to be affected by a lack of student motivation, but that further research on the relation between test-taking motivation and test achievement is warranted.
Article
Students (N = 416) viewed a videotaped lecture and then completed an objective examination based upon the lecture. The lecture was experimentally manipulated to vary in content coverage (the number of test questions covered) and the expressiveness with which it was delivered. Different groups of students received no external incentive to do well, were told before the lecture that they would receive money for each correctly answered question (incentive-to-learn-and-perform), or were told of the added incentive after the lecture but before the examination (incentive-to-perform). Better student performance was associated with added student incentive, better content coverage, and more lecturer expressiveness. However, the level of incentive interacted with both content coverage and expressiveness. Lecturer expressiveness had a large effect (eta2 = 9.4 per cent) when there was no added incentive, but no significant effect when an external incentive was added. Content coverage had a small effect (eta2 = 5.2 per cent) with no added incentive, a larger effect (eta2 = 13.0 per cent) for the incentive-to-perform condition, and the largest effect (eta2 = 26.5 per cent) for the incentive-to-learn-and-perform conditions. These findings suggest that lecturer expressiveness has a substantial impact when extrinsic motivation is low, and that added incentives have separate effects on motivation to learn and motivation to perform.
Article
The relative validities of forced-choice (ipsative) and Likert rating-scale item formats as criterion measures are examined. While there has been much debate about the relative technical and psychometric merits and demerits of ipsative instruments, the present research focused on the crucial question of whether the use of this format has any practical benefit in terms of improved validity. An analysis is reported from a meta-analysis data set. This demonstrates that higher operational validity coefficients (prediction of line-manager ratings of competencies) are associated with the use of forced-choice (r=.38) rather than rating scale (r=.25) item formats for the criterion measurement instrument when performance is rated by the same line managers on both formats and where the predictor is held constant. Thus the apparent criterion-related validity of a predictor can increase by 50% simply by changing the format of the criterion measurement instrument. The implications of this for practice are discussed.
Article
The first phase of this research effort describes an effort to directly measure the attitudes and opinions of employment test takers toward the tests they just took; the instrument is called the Test Attitude Survey (TAS). Nine factors were developed which reflect test takers' expressed effort and motivation on the test, the degree of concentration, perceived test ease, and the like. Several studies were conducted showing that TAS factors were significantly sensitive to differences in test types and administration permitting the inference that the TAS possessed construct validity. The second phase of this study tested several propositions and hypotheses. In one study, it is shown that the applicants report significantly higher effort and motivation on the employment tests compared to incumbents, even when ability is held constant. A second study showed that a small but significant relationship exists between TAS factor scores, test performances, and the person factors. Moreover, some of the racial differences on test performances can be accounted for via the TAS factor scores; it is observed that after holding these TAS factors constant, racial differences on the employment tests scores diminished. In a third study, very limited evidence was found for the incremental and moderating effects of these attitudes, but there were several limitations to the study associated with small sample sizes, unknown reliabilities in the criterion scales, and so forth. Discussion focussed on the potential practical applications of the TAS instrument and factor scores. It is suggested that further research could have some utility in this domain.
Article
This paper proposes that when optimally answering a survey question would require substantial cognitive effort, some repondents simply provide a satisfactory answer instead. This behaviour, called satisficing, can take the form of either (1) incomplete or biased information retrieval and/or information integration, or (2) no information retrieval or integration at all. Satisficing may lead respondents to employ a variety of response strategies, including choosing the first response alternative that seems to constitute a reasonable answer, agreeing with an assertion made by a question, endorsing the status quo instead of endorsing social change, failing to differentiate among a set of diverse objects in ratings, saying ‘don't know’ instead of reporting an opinion, and randomly choosing among the response alternatives offered. This paper specifies a wide range of factors that are likely to encourage satisficing, and reviews relevant evidence evaluating these speculations. Many useful directions for future research are suggested.
Article
The role of test-taking attitudes in the decisions of applicants to withdraw from a selection process was examined. Measures of test-taking attitudes were administered to 3,290 police officer applicants. Interviews were conducted with 618 applicants who withdrew from the selection process. Comparative anxiety, motivation, and literacy scales were found to predict withdrawal, but the effects were quite small. African-Americans were more likely to withdraw. Small race differences were found on test attitude scales. The requirement of taking a test was not a major factor in applicant withdrawal; procedural fairness and several other factors appeared to play a greater role. A model of applicant withdrawal is proposed based on the qualitative data from applicants who withdrew.
Article
There is widespread concern that assessments which have no direct consequences for students, teachers or schools underestimate student ability, and that the extent of this underestimation increases as the students become ever more familiar with such tests. This issue is particularly relevant for international comparative studies such as the IEA’s Third International Mathematics and Science Study (TIMSS) and the OECD’s Programme for International Student Assessment (PISA). In the present experimental study, a short form of the PISA mathematical literacy test is used to explore whether the levels of test motivation and test performance observed in the context of the standard PISA assessment situation can be improved by raising the stakes of testing. The impact of (1) informational feedback, (2) grading, and (3) performance-contingent financial rewards on the personal value of performing well, perceived utility of participating in the test, intended and invested effort, task-irrelevant cognitions, and test performance are investigated. The central finding of the study is that the different treatment conditions make the various value components of test motivation equally salient. Consequently, no differences were found either with respect to intended and invested effort or to test performance.
Article
Incl. app., bibliographical notes and references, glossary, index
Occupational Personality Questionnaires: Concept model manual and user's guide
  • P Saville
  • R Holdsworth
  • G Nyfield
  • L Cramp
  • W Mabey
Saville, P., Holdsworth, R., Nyfield, G., Cramp, L., & Mabey, W. (1993). Occupational Personality Questionnaires: Concept model manual and user's guide. Esher, England: Saville & Holdsworth Ltd.
The Big Five Triplets-Development of a multidimensional forced-choice questionnaire
  • E Wetzel
  • S Frick
Wetzel, E., & Frick, S. (2017). The Big Five Triplets-Development of a multidimensional forced-choice questionnaire. Manuscript in preparation.