Article

Comparing the validity of trait estimates from the multidimensional forced-choice format and the rating scale format

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The multidimensional forced-choice (MFC) format has been proposed as an alternative to rating scales (RS) that may be less susceptible to response biases. The goal of this study was to compare the validity of trait estimates from the MFC and the RS format when using normative scoring for both formats. We focused on construct validity and criterion-related validity. In addition, we investigated test-retest reliability over a period of six months. Participants were randomly assigned the MFC (N = 593) or the RS (N = 622) version of the Big Five Triplets. In addition to self-ratings on the Big Five Triplets and other personality questionnaires and criteria, we also obtained other-ratings (N = 770) for the Big Five Triplets. The Big Five in the Big Five Triplets corresponded well with the Big Five in the Big Five Inventory except for agreeableness in the MFC version. The majority of the construct validity coefficients differed between the MFC and the RS version, whereas criterion-related validities were very similar. The self- and other-rated Big Five Triplets showed higher correlations in the MFC format than in the RS format. The reliability of trait estimates on the Big Five and test-retest reliabilities were lower for MFC compared to RS. For the MFC format to be able to replace the RS format, more research on how to obtain ideal constellations of items that are matched in their desirability is needed. (PsycINFO Database Record (c) 2019 APA, all rights reserved).

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Brown and Maydeu-Olivares (2011) suggested that MFC measures should be created by mixing positively and negatively keyed statements within a block, called a heteropolar block, to improve parameter estimation accuracy in the TIRT model application. Following their suggestion, recent empirical studies developed MFC measures by mixing negatively keyed statements within blocks (e.g., Lee et al., 2018Ng et al., 2021;Walton et al., 2020;Wetzel & Frick, 2019). Although the recommendation of the inclusion of negatively keyed statements may bolster the estimation accuracy of MFC measures, researchers have raised concerns about the negatively keyed statements in MFC measures based on practical and psychometric reasons (e.g., Bürkner et al., 2019;Fisher et al., 2019;Lin & Brown, 2017;Ng et al., 2021). ...
... There have been mixed research findings of criterion-related validities between two MFC scoring methods (i.e., PI and TIRT). One set of the literature showed that the TIRT scoring method yields similar or better criterion-related validities than the PI scoring method (e.g., Lee et al., 2018;Walton et al., 2020;Wetzel & Frick, 2019). In contrast, another study showed that the PI scoring method yielded better criterion-related validity evidence than the TIRT scoring method (Fisher et al., 2019). ...
... Specifically, 14 out of 20 blocks were heteropolar blocks. Wetzel and Frick (2019) also constructed a 20-triplet MFC measure of Big Five personality, and 19 out of 20 triplets were designed as heteropolar blocks. Walton et al. (2020) developed a 20-triplet MFC measure of Big Five personality and designed all 20 blocks as heteropolar blocks by including at least one negatively keyed statement in each block. ...
Article
Multidimensional forced choice (MFC) personality tests have recently come to light as important personnel assessments in industrial and organizational psychology. For developing MFC measures, researchers have recommended including heteropolar blocks (i.e., both negatively and positively keyed statements are mixed within a block) to improve the scoring estimation accuracy. However, very few studies have explored the impact of heteropolar blocks on psychometric properties within the MFC context. In this study, we 1) explored how heteropolar blocks influence reliability and validity of MFC tests through Monte Carlo simulations and 2) empirically demonstrated how MFC test designs associated with heteropolar blocks affect criterion-related validity using real examinees. Result shows the Thurstonian Item Response Theory (TIRT) scoring method and higher intrablock discrimination yielded better reliability and criterion-related validity. In addition, result suggests that one can achieve sufficient reliability (i.e., 0.87–0.90 on average) and validity (i.e., 0.40–0.45 on average) by using highly discriminating 20–40% heteropolar blocks with TIRT model. Our empirical demonstration showed that criterion-related validity results can be different depending on the test designs of heteropolar blocks. Practical implications and future research topics are discussed.
... In line with this idea, some researchers have argued that desirability should be viewed as a property of response options rather than of items (Kuncel & Tellegen, 2009). More generally, several researchers have observed changes in item parameters or slight changes in constructs between single-stimulus and MFC formats (e.g., Guenole et al., 2018;Wetzel & Frick, 2020) and even changes in item parameters within the MFC format, depending on which items were combined into blocks (Lin & Brown, 2017). To improve the construction of fake-proof MFC questionnaires, a method is needed to estimate the fakability of each MFC block (i.e., the extent to which it can be faked). ...
... The sample consisted of two subsamples: one laboratory sample and one sample from an online access panel. In both subsamples, participants were remunerated for their participation and some participants were excluded due to data quality checks (for details, see Wetzel & Frick, 2020). The final sample consisted of 1244 participants. ...
... The instructions detailed which attributes the university was looking for in their students, which amounted to low levels of neuroticism and high levels on the other Big Five Traits (i.e., extraversion, openness, agreeableness and conscientiousness). For a detailed description of the faking instructions, see Wetzel et al. (2021); for more information about the sample and other measures, see Wetzel and Frick (2020). ...
Article
Full-text available
The multidimensional forced-choice (MFC) format has been proposed to reduce faking because items within blocks can be matched on desirability. However, the desirability of individual items might not transfer to the item blocks. The aim of this paper is to propose a mixture item response theory model for faking in the MFC format that allows to estimate the fakability of MFC blocks, termed the Faking Mixture model. Given current computing capabilities, within-subject data from both high- and low-stakes contexts are needed to estimate the model. A simulation showed good parameter recovery under various conditions. An empirical validation showed that matching was necessary but not sufficient to create an MFC questionnaire that can reduce faking. The Faking Mixture model can be used to reduce fakability during test construction.
... In both subsamples, we excluded some participants based on data quality checks (response time 2 SD below the average, incorrect responses to instructed response items), which led to the final sample sizes of 910 for the laboratory sample and 957 for the access panel sample, respectively. More detailed information on data exclusions and the demographic make-up of the subsamples is available in Wetzel and Frick (2020) 4 . The complete sample across all three response format groups thus consisted of 1,867 persons. ...
... We only describe the measures relevant to this study in the following. For a description of other administered measures (personality questionnaires assessing the Big Five, HEXACO, and Dark Triad), see Wetzel and Frick (2020). ...
... Wetzel and Frick (2020) only analyzed data from the MFC-matched and RS groups from the first administration of the BFT with an honest instruction. There is no overlap in research questions or analyses with this study with the exception of the comparison of the predictive validity between honest and fake-good condition, which also uses the criterion-related validities from the honest instruction for MFC-matched and RS (see also below). ...
Article
Full-text available
A common concern with self-reports of personality traits in selection contexts is faking. The multidimensional forced-choice (MFC) format has been proposed as an alternative to rating scales (RS) that could prevent faking. The goal of this study was to compare the susceptibility of the MFC format and the RS format to faking in a simulated high-stakes setting when using normative scoring for both formats. Participants were randomly assigned to 3 groups (total N = 1,867) and filled out the Big Five Triplets once under an honest instruction and once under a fake-good instruction. Latent mean differences between the honest and fake-good administrations indicated that the Big Five domains were faked in the expected direction. Faking effects for all traits were larger for RS compared with MFC. Faking effects were also larger for the MFC version with mixed triplets compared with the MFC version with triplets that were fully matched regarding their social desirability. The MFC format does not prevent faking completely, but it reduces faking substantially. Faking can be further reduced in the MFC format by matching the items presented in a block regarding their social desirability. (PsycInfo Database Record (c) 2020 APA, all rights reserved).
... In both subsamples, we excluded some participants based on data quality checks (response time 2 SD below the average, incorrect responses to instructed response items), which led to the final sample sizes of 910 for the laboratory sample and 957 for the access panel sample, respectively. More detailed information on data exclusions and the demographic make-up of the subsamples is available in Wetzel and Frick (2020) 4 . The complete sample across all three response format groups thus consisted of 1,867 persons. ...
... We only describe the measures relevant to this study in the following. For a description of other administered measures (personality questionnaires assessing the Big Five, HEXACO, and Dark Triad), see Wetzel and Frick (2020). ...
... Wetzel and Frick (2020) only analyzed data from the MFC-matched and RS groups from the first administration of the BFT with an honest instruction. There is no overlap in research questions or analyses with this study with the exception of the comparison of the predictive validity between honest and fake-good condition, which also uses the criterion-related validities from the honest instruction for MFC-matched and RS (see also below). ...
Preprint
Full-text available
A common concern with self-reports of personality traits in selection contexts is faking. The multidimensional forced-choice (MFC) format has been proposed as an alternative to rating scales (RS) that could prevent faking. The goal of this study was to compare the susceptibility of the MFC and RS format to faking in a simulated high-stakes setting. Participants were randomly assigned to three groups (total N = 1,867) and filled out the Big Five Triplets once under an honest instruction and once under a fake-good instruction. Latent mean differences between the honest and fake-good administrations indicated that the Big Five domains were faked in the expected direction. Faking effects for all traits were larger for RS compared to MFC. Faking effects were also larger for the MFC version with mixed triplets compared to the MFC version with triplets that were fully matched regarding their social desirability. The MFC format does not prevent faking completely, but it reduces faking substantially. Faking can be further reduced in the MFC format by matching the items presented in a block regarding their social desirability.
... We endeavored to create and validate an MFC measure of character from the CIVIC (Ng et al., 2018) given that the traits assessed are generally considered cross-culturally desirable . To the best of our knowledge, we are the first empirical investigation a) comparing the MFC and SS format b) using normative scores for both (employing the TIRT approach for the former) that c) used an MFC scale created specifically to meet best practice recommendations for normative score recovery via TIRT (Brown et al., 2017;Brown & Maydeu-Olivares, 2011) and d) administered both formats under honest and faking conditions to assess how well the MFC format prevents attempts at faking (Wetzel & Frick, 2020). The results for the CIVIC-MFC under honest conditions indicate some success in creating a valid measure of character. ...
... This same pattern was also found with all but two concurrent validity coefficients: correlations with criteria were generally stronger for CIVIC-SS versus CIVIC-MFC scores. However, as mentioned in prior literature, the convergent and concurrent validities may be stronger for the CIVIC-SS than the CIVIC-MFC simply due to two methodological issues: a) common method variance and b) reliability (Heggestad et al., 2006;Wetzel & Frick, 2020). Since the validity scales themselves were comprised of single statement, Likert-type items, this artificially inflates correlations with the CIVIC-SS scores because of the common scale format. ...
... The current study does shed some light on the applicability of the MFC format, specifically using unequally keyed item blocks for latent trait estimation, and the TIRT scoring approach when attempting to create a valid measure of character that prevents faking. TIRT in theory can be used to derive normative scores from the MFC format that purportedly reduces faking (Dueber, Love, Toland, & Turner, 2019;Fisher et al., 2019;Walton, Cherkasova, & Roberts, 2019;Wetzel & Frick, 2020), but attaining both ends in practice requires scale construction considerations that may be difficult to address. Our initial results raise further questions that point to future directions in research on this topic. ...
Article
There has been reemerging interest within psychology in the construct of character, yet assessing it can be difficult due to social desirability of character traits. Forced-choice formats offer one way to address response bias, but traditional scoring methods (i.e., ipsative) associated with this format makes comparing scores between people problematic. Nevertheless, recent advances in modeling item responding (Thurstonian IRT) enable scoring that recovers absolute standing on latent traits and allows for score comparisons between people. Based on recent work in character measurement (CIVIC), we developed a multidimensional forced-choice measure of character (CIVIC-MFC) and scored it using Thurstonian IRT. Initial validation using a sample of 798 participants demonstrated good support for factorial, convergent, and concurrent validity for scores on the CIVIC-MFC, although they did not demonstrate more faking resistance than scores on a Likert-type format version. Potential explanations are discussed.
... Traditionally, subjects rate the social desirability of the content of the item on a quasicontinuous scale or classify it on a discrete scale (rating method; e.g., Edwards, 1953;Jackson, 1964). These social desirability ratings can, for instance, be used to match items within blocks in multidimensional forced-choice (MFC) questionnaires (see Wetzel & Frick, 2020). ...
... The null effect on predictive validity is however in line with most previous studies investigating how SDR affects the criterion-related validity of personality measures. Regardless of whether SDR scale scores were partialed out (e.g., Barrick & Mount, 1996;Ones et al., 1996;Piedmont et al., 2000) or validity coefficients were compared between the rating scale and MFC format (e.g., Brown & Maydeu-Olivares, 2013;Wetzel & Frick, 2020;Wetzel et al., 2021), the result has often been that predictive validity remains roughly unchanged. The present master thesis, with its findings based on item response theory (IRT) modeling of SDR, hence adds evidence to the assumption that SDR undermines the construct validity of personality measures but does not have detrimental effects on their criterion-related validity. ...
Thesis
Many psychometric models have been developed in recent years to account for the influence of response biases in rating scale data. Previous research on response bias modeling has mainly focused on response styles, which reflect preferences of respondents for certain rating scale categories irrespective of item content. The present master thesis, however, aims at modeling a different kind of response bias, namely socially desirability responding (SDR). The multidimensional nominal response model (NRM; e.g., Falk & Cai, 2016), which is a flexible item response theory (IRT) model that allows to model response biases whose effect patterns vary between items, served as the modeling framework. For an empirical demonstration, responses from N = 3046 job applicants taking a personality test under high-stakes conditions were modeled. Effect patterns of SDR were specified by fixing scoring weights of SDR in the multidimensional NRM to appropriate values that were collected in a pilot study with N = 63 participants. Results indicated that modeling SDR improved model fit over and above response styles and led to effectively adjusted estimates of substantive personality traits. Furthermore, the modeling of SDR was validated in a sample of N = 365 job incumbents taking the personality test under low-stakes conditions. However, while relationships between the degree of SDR and several covariates were found, modeling SDR did not alter the predictive validity of personality measures. Implications for theory and practice as well as limitations and future research directions concerning the modeling of SDR are discussed.
... Roberts, 2015;Dueber, Love, Toland, & Turner, 2019;Guenole, Brown, & Cooper, 2018;Murano et al., 2020;Ng et al., 2020;Stark et al., 2014;Watrin, Geiger, Spengler, & Wilhelm, 2019;Wetzel & Frick, 2020). Despite growing popularity of the forced-choice personality measures for high-stakes assessments, matching items in terms of desirability has received considerably less research attention. ...
... Personality items. We used items composing the Big Five Triplets (BFT) forced-choice measure developed by Wetzel and Frick (2020). The BFT consists of 60 public-domain 7 personality items adopted from the International Personality Item Pool (http://ipip.ori.org; ...
Preprint
Full-text available
The effectiveness of forced-choice personality measures in preventing applicant faking may depend on how closely items comprising each forced-choice item-block are matched in terms of their desirability for the job. Item desirability matching is routinely performed on empirically obtained item desirability ratings and different approaches have been used interchangeably to obtain them. On a set of item desirability ratings obtained with the two most commonly used approaches, we show that the choice of collection approach matters and may play an important role in the forced-choice block assembly.
... In this article, we seek to address this outstanding need by discussing existing methods and proposing a novel desirability matching approach. The current work seems particularly timely given recent breakthroughs in modeling and scoring forced-choice responses (Brown & Maydeu-Olivares, 2011;Stark et al., 2005) and an emerging popularity of forced-choice measures in organizational, educational, clinical, and other substantive domains (e. g., Anguiano-Carrasco et al., 2015;Dueber et al., 2019;Guenole et al., 2018;Murano et al., 2021;Ng et al., 2021;Stark et al., 2014;Walton et al., 2020;Watrin et al., 2019;Wetzel & Frick, 2020). ...
... Our focus on pairwise similarity was thus natural to provide a direct comparison with the traditional mean difference index. Recently, a format consisting of three items (i.e., triads or triplets) per forced-choice block has been gaining in popularity (e.g., Guenole et al., 2018;Lee et al., 2019;Murano et al., 2021;Ng et al., 2021;Walton et al., 2020;Watrin et al., 2019;Wetzel & Frick, 2020), because it seems to provide an optimal balance between the information gained and the cognitive burden placed on test takers. That said, future work should also consider similarity indices involving more than two items. ...
Article
Full-text available
The forced-choice method has been proposed as a viable strategy to prevent socially desirable responding (SDR) on self-report non-cognitive measures. The ability of the method to eliminate SDR stems from matching items that are perceived as equally desirable into forced-choice item-blocks. The gold standard in quantifying similarity between items in terms of desirability has been the “mean difference index”, that is, the absolute difference between items' mean desirability ratings. This index relies on the assumption that items have one true desirability value, as efficiently and unbiasedly estimated by their respective means, and may fail if this assumption does not hold. To circumvent this issue, we propose indexing similarity between items in terms of desirability with several robust measures of absolute agreement (i.e., inter-item agreement indices). Using an empirical example, we show that relying on the mean difference index may lead to suboptimal forced-choice item-block assembly by matching items with a relatively poor inter-item agreement with respect to desirability. R code for computing the proposed agreement indices on a set of desirability ratings is provided, as are recommendations for applied researchers.
... Such scales share commonmethod variance with the rating version of the evaluated questionnaire, and the horse-race approach is therefore biased in favor of the rating scale. Only one study incorporated both FC and rating scale versions of the validated (Big Five) and validating (HEXACO) questionnaires as well as other-ratings (Wetzel & Frick, 2019). In this study, some inter-trait correlations of the FC questionnaire differed drastically from those of the rating scale version of the same questionnaire and from meta-analytic estimates for Big Five inter-correlations, suggesting potential issues in the estimation of inter-trait correlations. ...
... T-IRT estimates are not completely ipsative, but their correlation with external criteria might still be biased due to their partially ipsative nature. The published validity estimates for T-IRT estimates are typically comparable with rating scales (Anguiano-Carrasco et al., 2015;Brown & Maydeu-Olivares, 2013;Lee et al., 2018;Watrin et al., 2019;Wetzel & Frick, 2019). However, when effect sizes in favor of one method are small, even limited bias can change the preference of one method over the other. ...
Article
Full-text available
Forced-choice questionnaires can prevent faking and other response biases typically associated with rating scales. However, the derived trait scores are often unreliable and ipsative, making interindividual comparisons in high-stakes situations impossible. Several studies suggest that these problems vanish if the number of measured traits is high. To determine the necessary number of traits under varying sample sizes, factor loadings, and intertrait correlations, simulations were performed for the two most widely used scoring methods, namely the classical (ipsative) approach and Thurstonian item response theory (IRT) models. Results demonstrate that while especially Thurstonian IRT models perform well under ideal conditions, both methods yield insufficient reliabilities in most conditions resembling applied contexts. Moreover, not only the classical estimates but also the Thurstonian IRT estimates for questionnaires with equally keyed items remain (partially) ipsative, even when the number of traits is very high (i.e., 30). This result not only questions earlier assumptions regarding the use of classical scores in high-dimensional questionnaires, but it also raises doubts about many validation studies on Thurstonian IRT models because correlations of (partially) ipsative scores with external criteria cannot be interpreted in a usual way.
... Such scales share common-method variance with the rating version of the evaluated questionnaire, and the horse-race approach is therefore biased in favor of the rating scale. Only one study incorporated both FC and rating scale versions of the validated (Big Five) and validating (HEXACO) questionnaires as well as other-ratings (Wetzel & Frick, 2019). In this study, some inter-trait correlations of the FC questionnaire differed drastically from those of the rating scale version of the same questionnaire and from meta-analytic estimates for Big Five inter-correlations, raising doubts about the factor validity of T-IRT models. ...
... T-IRT estimates are not completely ipsative, but their correlation with external criteria might still be biased due to their partially ipsative nature. The published validity estimates for T-IRT estimates are typically comparable with rating scales (Anguiano-Carrasco et al., 2015;Brown & Maydeu-Olivares, 2013;Lee et al., 2018;Watrin et al., 2019;Wetzel & Frick, 2019). However, when effect sizes in favor of one method are small, even limited bias can change the preference of one method over the other. ...
Preprint
Full-text available
Forced-choice questionnaires can prevent faking and other response biasestypically associated with rating scales. However, the derived trait scoresare often unreliable and ipsative, making inter-individual comparisons inhigh-stakes situations impossible. Several studies suggest that these problemsvanish if the number of measured traits is high. To determine the necessarynumber of traits under varying sample sizes, factor loadings, and intertrait-correlations, simulations were performed for the two most widely usedscoring methods, namely the classical (ipsative) approach and ThurstonianIRT models. Results demonstrate that while especially Thurstonian IRTmodels perform well under ideal conditions, both methods yield insufficientreliabilities in most conditions resembling applied contexts. Moreover, notonly the classical estimates but also the Thurstonian IRT estimates remain(partially) ipsative, even when the number of traits is very high (i.e., 30).This result not only questions earlier assumptions regarding the use ofclassical scores in high dimensional questionnaires, but it also raises doubtsabout validation studies on Thurstonian IRT models because correlations of(partially) ipsative
... Models such as the multi-unidimensional pairwise preference model (MUPP; Stark et al., 2005), multidimensional Zinnes and Griggs pairwise preference model (ZG-MUPP; Joo et al., 2021), and Thurstonian item response theory model (TIRT; Brown & Maydeu-Olivares, 2011) have been shown to produce scores with normative properties (Brown & Maydeu-Olivares, 2011;Joo et al., 2020;Lee et al., 2019. Meta-analysis (Cao & Drasgow, 2019) and a recent primary study (Wetzel et al., 2020) have shown that MFC measures are less susceptible to faking than rating scale measures, and MFC measures have criterion-related validity similar to rating scale measures in research contexts (Wetzel & Frick, 2019;Zhang et al., 2020). In addition, MUPP personality tests, in particular, have been shown to predict citizenship behaviors, counterproductive work behaviors, attrition in personnel screening environments Stark et al., 2014), and utility for job classification . ...
Article
Full-text available
Multidimensional forced-choice (MFC) testing has been proposed as a way of reducing response biases in noncognitive measurement. Although early item response theory (IRT) research focused on illustrating that person parameter estimates with normative properties could be obtained using various MFC models and formats, more recent attention has been devoted to exploring the processes involved in test construction and how that influences MFC scores. This research compared two approaches for estimating multi-unidimensional pairwise preference model (MUPP; Stark et al., 2005) parameters based on the generalized graded unfolding model (GGUM; Roberts et al., 2000). More specifically, we compared the efficacy of statement and person parameter estimation based on a “two-step” process, developed by Stark et al. (2005), with a more recently developed “direct” estimation approach (Lee et al., 2019) in a Monte Carlo study that also manipulated test length, test dimensionality, sample size, and the correlations between generating person parameters for each dimension. Results indicated that the two approaches had similar scoring accuracy, although the two-step approach had better statement parameter recovery than the direct approach. Limitations, implications for MFC test construction and scoring, and recommendations for future MFC research and practice are discussed.
... Aside from using statistical models to control for response styles in the measurement of latent propensities, recent developments in psychometrics have considered alternative response formats as means to reduce biasing effects, such as response formats that distinguish categorical agreement judgments from ratings of response intensity (Böckenholt, 2017) or rankings of items pertaining to different latent constructs (so-called multidimensional forced-choice format; Wetzel & Frick, 2020). The combination of statistical modeling in data analysis and innovative response formats in test construction allows researchers to enhance the validity of educational measurement and to better specify and test theoretical assumptions on latent response processes. ...
Preprint
Chapter accepted for the International Encyclopedia of Education, 4th edition.
... Whenever the Big5/FFM or HEXACO dimensions were assessed via an ad hoc created subset of items (from an established inventory), we required that the items used to assess a dimension covered all corresponding facets to ensure that the scales appropriately captured the breadth of the respective dimension(s). We also required that items were rated on rating scales, thus excluding effects based on alternative response formats such as a multidimensional forced-choice format (O'Neill et al., 2011;Wetzel & Frick, 2019)given that the resulting data are likely not directly comparable with data from rating scales. For both the Big5/FFM and the HEXACO model, we only considered operationalizations reflecting the original conceptualizations of the corresponding five/six dimensions (e.g., Ashton & Lee, 2007;Goldberg, 1990;McCrae & Costa, 1987). ...
Article
Full-text available
Models of basic personality structure are among the most widely used frameworks in psychology and beyond, and they have considerably advanced the understanding of individual differences in a plethora of consequential outcomes. Over the past decades, two such models have become most widely used: the Five Factor Model (FFM) or Big Five, respectively, and the HEXACO Model of Personality. However, there is no large-scale empirical evidence on the general comparability of these models. Here, we provide the first comprehensive meta-analysis on (i) the correspondence of the FFM/Big Five and HEXACO dimensions, (ii) the scope of trait content the models cover, and (iii) the orthogonality (i.e., degree of independence) of dimensions within the models. Results based on 152 (published and unpublished) samples and 6,828 unique effects showed that the HEXACO dimensions incorporate notable conceptual differences compared to the Big Five, resulting in a broader coverage of the personality space and less redundancy between dimensions. Moreover, moderator analyses revealed substantial differences between operationalizations of the Big Five. Taken together, these findings have important theoretical and practical implications for the understanding of basic personality dimensions and their assessment.
... The utilization of FC allowed us to conduct analyses without needing to handle missing data. According to previous studies, FC generally produces similar results in relation to other methods when measuring sensitive issues (Wetzel & Frick, 2020). ...
Article
Full-text available
Based on lifestyle exposure theory (LET), this study examined online dating application (ODA) use and victimization experiences among adolescents using large cross-national samples of Finnish, American, Spanish, and South Korean young people between ages 15 and 18. According to logistic regression analyses in two substudies, ODA use was associated with more likely victimization to online harassment, online sexual harassment, and other cybercrimes and sexual victimization by adults and peers. According to mediation analyses, this relationship was mainly accounted for by the fact that ODA users engage in more risky activities in online communication and information sharing. Attention should be paid to the risks ODAs pose to vulnerable groups, such as young people, with insufficient skills to regulate their social relationships online.
... It is also possible to (5) present participants with choices between equally desirable items that have different substantive content, or with items that assess the same substantive content but differ in desirability, or even with larger sets of items that vary systematically along both dimensions (Borkenau & Ostendorf, 1989;Peabody, 1967;Pettersson et al., 2012;Rojas et al., 2019;Wetzel & Frick, 2020). Again, each item's standing on both dimensions would have to be derived empirically (instead of just being assumed), which makes this approach very effortful as well. ...
Article
This paper presents a series of pre-registered analyses testing the same theoretically derived hypothesis: If (a) the attitudes that perceivers have toward targets contribute to the variance of judgments on most items, and (b) items’ rated social desirability values align very closely with the extent to which that is the case, then the product of two items’ mid-point-centered social desirability values should predict the amount of shared variance, and thus the correlation, between these items. This hypothesis applies equally to other ratings and self-ratings. Across samples, effect sizes ranged from r = .36 to r = .80 (average r = .61) and were statistically significant in every single case. We also found that the average effect is much larger for other-ratings (r = .71) than for self-ratings (r = .49). This difference was also replicable and is likely rooted in the greater relative importance of the attitude factor in other-ratings, as compared to self-ratings. An exploratory item resampling analysis suggested that scales may achieve good internal consistency, and correlate substantially with other scales, based solely on shared attitude variance. We discuss the relevance of these findings across different domains of psychological assessment, and possible ways of dealing with the issue.
... We are not aware of any study that has ever applied such an approach. The possible influence of perceiver attitudes on judgments may be neutralized by having perceivers choose the most fitting item from a set of items with equal social (un-)desirability, but different substantive base (Wetzel & Frick, 2020), or by using items that describe the same substantive base but are balanced in terms of evaluative tone (Borkenau & Ostendorf, 1992;Pettersson et al., 2012). One may also attempt to rephrase items in a way that they reflect only substance, but no evaluation anymore (Bäckström, Björklund & Larsson, 2009). ...
Preprint
Full-text available
Most psychometric research relies heavily on patterns of correlations between items on which perceivers describe targets. Apart from the targets’ actual (“substantive”) characteristics, this type of data has been shown to also reflect a number of other, non-substantive sources of variation. This is problematic because each of these (semantic redundancy, attitudes, formal response styles) may all by itself account for correlations among items, which may then be misinterpreted in terms of substantive effects. We present an integrative theoretical account of how these non-substantive influences may affect the pattern of relationships between items. We also point out how this is relevant to the validity of conclusions drawn in research on “general factors” (of personality, psychopathology, and personality pathology) and on “network models”. Furthermore, we discuss various ways of dealing with the problem, which is necessary before any correlations between items may be interpreted in terms of substantive associations between the targets’ actual characteristics.
Article
Full-text available
The aim of this study is to evaluate the convergent and operational validity of a modified Latvian personality inventory (LPA-3, Perepjolkina, 2014) with a multidimensional forced-choice answer format (LMFI). Using three samples, a validity study of the LMFI was conducted. Convergent validity was evaluated by examining the relations between LMFI on one side and the Big Five Inventory (BFI) and the Machiavellianism scale on the other. Operational validity was evaluated by examining the relations with assessments of subjective job performance, counterproductive work performance and with a measure of scholarly significance. The results show good convergent and operational validity for five of the six LMFI factors. The Honesty-Humility measure still needs to be improved. In the future, predictive and discriminant validation studies should be conducted with more representative Latvian samples.
Article
Forced‐choice format tests have been suggested as an alternative to Likert‐scale measures for personnel selection due to robustness to faking and response styles. This study compared degrees of faking occurring in Likert‐scale and forced‐choice five‐factor personality tests between South Korea and the United States. Also, it was examined whether the forced‐choice format was effective at reducing faking in both countries. Data were collected from 396 incumbents participating in both honest and applicant conditions (NSK = 179, NUS = 217). Cohen's d values for within‐subjects designs (dswithin) for between the two conditions were utilized to measure magnitudes of faking occurring in each format and country. In both countries, the degrees of faking occurring in the Likert‐scale were larger than those from the forced‐choice format, and the magnitudes of faking across five personality traits were larger in South Korea by from 0.07 to 0.12 in dswithin. The forced‐choice format appeared to successfully reduce faking for both countries as the average dswithin decreased by 0.06 in both countries. However, the patterns of faking occurring in the forced‐choice format varied between the two countries. In South Korea, degrees of faking in Openness and Conscientiousness increased, whereas those in Extraversion and Agreeableness were substantially decreased. Potential factors leading to trait‐specific faking under the forced‐choice format were discussed in relation to cultural influence on the perception of personality traits and score estimation in Thurstonian item response theory (IRT) models. Finally, the adverse impact of using forced‐choice formats on multicultural selection settings was elaborated. The benefit of using forced‐choice formats for cross‐cultural selection settings is not yet clear because there is a lack of scholarly evidence on the performance of forced‐choice formats with respect to faking occurring in different cultures. With the use of a forced‐choice format personality test, the magnitudes of faking decreased in the United States for all five personality traits, whereas the magnitudes of faking occurring in Openness and Conscientiousness increased in South Korea. Potential factors leading to trait‐specific faking under the forced‐choice format can be related to cultural influences on the perception of personality traits and score estimation in Thurstonian item response theory (IRT) models. Practitioners should consider cultural differences in how applicants view target constructs if a forced‐choice format is considered for cross‐cultural/international personnel selection settings. The benefit of using forced‐choice formats for cross‐cultural selection settings is not yet clear because there is a lack of scholarly evidence on the performance of forced‐choice formats with respect to faking occurring in different cultures. With the use of a forced‐choice format personality test, the magnitudes of faking decreased in the United States for all five personality traits, whereas the magnitudes of faking occurring in Openness and Conscientiousness increased in South Korea. Potential factors leading to trait‐specific faking under the forced‐choice format can be related to cultural influences on the perception of personality traits and score estimation in Thurstonian item response theory (IRT) models. Practitioners should consider cultural differences in how applicants view target constructs if a forced‐choice format is considered for cross‐cultural/international personnel selection settings.
Article
Full-text available
Multidimensional forced-choice (FC) questionnaires have been consistently found to reduce the effects of socially desirable responding and faking in non-cognitive assessments. Although FC has been considered problematic for providing ipsative scores under the classical test theory, IRT models enable the estimation of non-ipsative scores from FC responses. However, while some authors indicate that blocks composed of opposite-keyed items are necessary to retrieve normative scores, others suggest that these blocks may be less robust to faking, thus impairing the assessment validity. Accordingly, this article presents a simulation study to investigate whether it is possible to retrieve normative scores using only positively keyed items in pairwise FC computerized adaptive testing (CAT). Specifically, a simulation study addressed the effect of 1) different bank assembly (with a randomly assembled bank, an optimally assembled bank, and blocks assembled on-the-fly considering every possible pair of items), and 2) block selection rules (i.e., T, and Bayesian D and A-rules) over the estimate accuracy and ipsativity and overlap rates. Moreover, different questionnaire lengths (30 and 60) and trait structures (independent or positively correlated) were studied, and a non-adaptive questionnaire was included as baseline in each condition. In general, very good trait estimates were retrieved, despite using only positively keyed items. Although the best trait accuracy and lowest ipsativity were found using the Bayesian A-rule with questionnaires assembled on-the-fly, the T-rule under this method led to the worst results. This points out to the importance of considering both aspects when designing FC CAT.
Article
Full-text available
The Thurstonian item response model (Thurstonian IRT model) allows deriving normative trait estimates from multidimensional forced-choice (MFC) data. In the MFC format, persons must rank-order items that measure different attributes according to how well the items describe them. This study evaluated the normativity of Thurstonian IRT trait estimates both in a simulation and empirically. The simulation investigated normativity and compared Thurstonian IRT trait estimates to those using classical partially ipsative scoring, from dichotomous true-false (TF) data and rating scale data. The results showed that, with blocks of opposite keyed items, Thurstonian IRT trait estimates were normative in contrast to classical partially ipsative estimates. Unbalanced numbers of items per trait, few opposite keyed items, traits correlated positively or assessing fewer traits did not decrease measurement precision markedly. Measurement precision was lower than that of rating scale data. The empirical study investigated whether relative MFC responses provide a better differentiation of behaviors within persons than absolute TF responses. However, criterion validity was equal and construct validity (with constructs measured by rating scales) lower in MFC. Thus, Thurstonian IRT modeling of MFC data overcomes the drawbacks of classical scoring, but gains in validity may depend on eliminating common method biases from the comparison.
Preprint
Forced choice (FC) personality measures are increasingly popular in research and applied contexts. To date however, no method for detecting faking behavior on this format has been both proposed and empirically tested. We introduce a new methodology for faking detection on FC measures, based on the assumption that individuals engaging in faking try to approximate the ideal response on each block of items. Individuals’ responses are scored relative to the ideal using a model for rank-order data not previously applied to FC measures (Generalized Mallows Model). Scores are then used as predictors of faking in a regularized logistic regression. In Study 1, we test our approach using cross-validation, and contrast generic and job-specific ideal responses. Study 2 replicates our methodology on two measures matched and mismatched on item desirability. We achieved between 80 – 92% balanced accuracy in detecting instructed faking, and predicted probabilities of faking correlated with self-reported faking behavior. We discuss how this approach, driven by trying to capture the faking process, differs methodologically and theoretically to existing faking detection paradigms, and measure and context-specific factors impacting accuracy.
Article
This research developed a new ideal point-based item response theory (IRT) model for multidimensional forced choice (MFC) measures. We adapted the Zinnes and Griggs (ZG; 1974) IRT model and the multi-unidimensional pairwise preference (MUPP; Stark et al., 2005 Stark, S., Chernyshenko, O. S., & Drasgow, F. (2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise-preference model. Applied Psychological Measurement, 29(3), 184–203. https://doi.org/10.1177/0146621604273988[Crossref], [Web of Science ®] , [Google Scholar]) model, henceforth referred to as ZG-MUPP. We derived the information function to evaluate the psychometric properties of MFC measures and developed a model parameter estimation algorithm using Markov chain Monte Carlo (MCMC). To evaluate the efficacy of the proposed model, we conducted a simulation study under various experimental conditions such as sample sizes, number of items, and ranges of discrimination and location parameters. The results showed that the model parameters were accurately estimated when the sample size was as low as 500. The empirical results also showed that the scores from the ZG-MUPP model were comparable to those from the MUPP model and the Thurstonian IRT (TIRT) model. Practical implications and limitations are further discussed.
Article
Forced-choice (FC) assessments of noncognitive psychological constructs (e.g., personality, behavioral tendencies) are popular in high-stakes organizational testing scenarios (e.g., informing hiring decisions) due to their enhanced resistance against response distortions (e.g., faking good, impression management). The measurement precisions of FC assessment scores used to inform personnel decisions are of paramount importance in practice. Different types of reliability estimates are reported for FC assessment scores in current publications, while consensus on best practices appears to be lacking. In order to provide understanding and structure around the reporting of FC reliability, this study systematically examined different types of reliability estimation methods for Thurstonian IRT-based FC assessment scores: their theoretical differences were discussed, and their numerical differences were illustrated through a series of simulations and empirical studies. In doing so, this study provides a practical guide for appraising different reliability estimation methods for IRT-based FC assessment scores.
Article
Although modern item response theory (IRT) methods of test construction and scoring have overcome ipsativity problems historically associated with multidimensional forced choice (MFC) formats, there has been little research on MFC differential item functioning (DIF) detection, where item refers to a block, or group, of statements presented for an examinee’s consideration. This research investigated DIF detection with three-alternative MFC items based on the Thurstonian IRT (TIRT) model, using omnibus Wald tests on loadings and thresholds. We examined constrained and free baseline model comparisons strategies with different types and magnitudes of DIF, latent trait correlations, sample sizes, and levels of impact in an extensive Monte Carlo study. Results indicated the free baseline strategy was highly effective in detecting DIF, with power approaching 1.0 in the large sample size and large magnitude of DIF conditions, and similar effectiveness in the impact and no-impact conditions. This research also included an empirical example to demonstrate the viability of the best performing method with real examinees and showed how a DIF and a DTF effect size measure can be used to assess the practical significance of MFC DIF findings.
ResearchGate has not been able to resolve any references for this publication.