Article

Judge Response Theory? A Call to Upgrade Our Psychometrical Account of Creativity Judgments.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The Consensual Assessment Technique (CAT)-more generally, using product creativity judgments-is a central and actively debated method to assess product and individual creativity. Despite a constant interest in strategies to improve its robustness, we argue that most psychometric investigations and scoring strategies for CAT data remain constrained by a flawed psychometrical framework. We first describe how our traditional statistical account of multiple judgments, which largely revolves around Cronbach's and sum/average scores, poses conceptual and practical problems-such as misestimating the construct of interest, misestimating reliability and structural validity, underusing latent variable models, and reducing judge characteristics as a source of error-that are largely imputable to the influence of classical test theory. Then, we propose that the item-response theory framework, traditionally used for multi-item situations, be transposed to multiple-judge CAT situations in Judge Response Theory (JRT). After defining JRT, we present its multiple advantages, such as accounting for differences in individual judgment as a psychological process-rather than as random error-giving a more accurate account of the reliability and structural validity of CAT data and allowing the selection of complementary not redundant-judges. The comparison of models and their availability in statistical packages are notably discussed as further directions.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The authors use up-to-date methods, mainly relying on human rater judgments, a classical paradigm in the field [1]. Nevertheless, their work suffers from the scatteredness and inconsistency of the recommendations on rater-mediated assessment in creativity research [5,11]. Notably, their methods, based on linear mixed models (with intraclass correlation coefficients) make unrealistic assumptions regarding their data. ...
... While I do not advance that the validity of their findings is jeopardized here, it is clear that our field has yet to develop a consistent and robust methodological approach for ratermediated measurement. As I have argued [8,11], item-response theory (IRT) provides a unified framework that makes realistic assumptions and allows to account for item and rater effects (varying severities, difficulties, discrimination, etc.), along with their interactions [11,13,14]. Further, seeing how dispersed the literature is on reliability estimation with human raters in our field [5], and how it is usually disjointed from scoring procedures (e.g., the same average score is used regardless of the reliability estimate) [11], having a framework that uses a single model to estimate attributes (e.g., to measure an artwork's quality) and their uncertainty (i.e., standard error/reliability) is a key advance. ...
... While I do not advance that the validity of their findings is jeopardized here, it is clear that our field has yet to develop a consistent and robust methodological approach for ratermediated measurement. As I have argued [8,11], item-response theory (IRT) provides a unified framework that makes realistic assumptions and allows to account for item and rater effects (varying severities, difficulties, discrimination, etc.), along with their interactions [11,13,14]. Further, seeing how dispersed the literature is on reliability estimation with human raters in our field [5], and how it is usually disjointed from scoring procedures (e.g., the same average score is used regardless of the reliability estimate) [11], having a framework that uses a single model to estimate attributes (e.g., to measure an artwork's quality) and their uncertainty (i.e., standard error/reliability) is a key advance. ...
... However, it remains unknown how factors such as the nature of the creative task and the personality of the rater can affect how novelty and usefulness contribute to evaluations of creativity. Providing answers to these questions is of central importance to our understanding of how creativity is evaluated, defined, and perceived, and may inform the development of subjective creativity assessments that can account for variance across raters (Barbot, Hass, & Reiter-Palmon, 2019;Myszkowski & Storme, 2019). As a brief but important note, this study is concerned with the evaluation of exogenous ideas (i.e., ideas generated by others) as opposed to the evaluation of one's own ideas, which is likely to be a related but distinct evaluative process (Karwowski, Czerwonka, & Kaufman, 2020;Rodriguez, Cheban, Shah, & Watts, 2020;Runco & Smith, 1992). ...
... First, creativity research relies heavily on subjective assessments of creativity, and so understanding the interpersonal factors that cause variation in these assessments is key to developing strong and reliable measures. Indeed, the most common subjective assessment method, the CAT, has recently been criticized for not accounting for variation across raters (Barbot et al., 2019;Myszkowski & Storme, 2019). The limitations intrinsic to subjective assessments of creativity are well-known, and have stimulated the development of objective assessments including distributional semantics methods (Acar, Berthiaume, Grajzel, Dumas, & Flemister, 2021;Beaty & Johnson, 2021) and machine learning techniques (Cropley & Marrone, 2021;Edwards, Peng, Miller, & Ahmed, 2021). ...
... Our findings help extend a growing body of work that has examined how individuals consider novelty and usefulness when evaluating creativity (Acar et al., 2017;Caroff & Besançon, 2008;Diedrich et al., 2015;Runco & Charles, 1993;Storme & Lubart, 2012), and how variations in the evaluation of creativity relate to individual differences (Herman & Reiter-Palmon, 2011;Karwowski et al., 2020;Lee et al., 2017;Mastria et al., 2019;Mueller et al., 2012). Overall, our findings highlight the importance of considering contextual and interpersonal factors when researchers examine how creativity is evaluated, defined, and perceived, strengthening recent calls for creativity assessments that can account for variation across raters (Barbot et al., 2019;Myszkowski & Storme, 2019). Indeed, it seems likely that both the generation and evaluation of creative ideas may involve markedly different processes depending on both the individual in question and the context of the problem. ...
Article
Full-text available
According to the standard definition, creative ideas must be both novel and useful. While a handful of recent studies suggest that novelty is more important than usefulness to evaluations of creativity, little is known about the contextual and interpersonal factors that affect how people weigh these two components when making an overall creativity judgment. We used individual participant regressions and mixed-effects modeling to examine how the contributions of novelty and usefulness to ratings of creativity vary according to the context of the idea (i.e., how relevant it is to the real world) and the personality of the rater. Participants (N = 121) rated the novelty, usefulness, and creativity of ideas from two contexts: responses to the alternative uses task (AUT) and genuine suggestions for urban planning projects. We also assessed three personality traits of participants: openness, intellect, and risk-taking. We found that novelty contributed more to evaluations of creativity among AUT ideas than projects, while usefulness contributed more among projects than AUT ideas. Further, participants with higher openness and higher intellect placed a greater emphasis on novelty when evaluating AUT ideas, but a greater emphasis on usefulness when evaluating projects. No significant effects were found for the risk-taking trait.
... With some exceptions (Akbari Chermahini, Hickendorff, & Hommel, 2012;Barbot, Tan, Randi, Santa-Donato, & Grigorenko, 2012;Forthmann, Celik, Holling, Storme, & Lubart, 2018;Forthmann et al., 2016;Myszkowski, 2019;Myszkowski & Storme, 2017;Sen, 2016;Silvia et al., 2008;Wang, Ho, Cheng, & Cheng, 2014), IRT procedures are rarely used in creativity research. Although the reason for this is unclear, it was advanced (Myszkowski & Storme, 2019) that a central reason behind the underuse of such models in creativity research, like in psychological research in general, is that training in IRT is rarely found in Psychology education, and that IRT modelling software is often less easily available than traditional CTT applications (Borsboom, 2006). As we will here briefly discuss, there are however several ways that one can use IRT modeling. ...
... Using IRT modelling on CAT data is rarely executed (Myszkowski & Storme, 2019), and thus, there may be issues in such applications that are not yet anticipated. However, at this stage, a few may be anticipated, and specific attention was dedicated to offering solutions to them directly in "jrt". ...
... Recent perspectives on psychological measurement in general (Borsboom, 2006), and on the CAT specifically (Myszkowski & Storme, 2019) have extensively discussed the multiple conceptual and practical advantages of using measurement models -whether they take the form and name of item-response models, latent variable models, or factor analysis models (Mellenbergh, 1994) -over the regular practice of sum scoring. Sum scores are certainly more popular than estimates of attributes achieved with measurement models. ...
Article
Full-text available
Although the Consensual Assessment Technique (CAT; Amabile, 1982) is considered a gold standard in the measurement of product attributes-including creativity (Baer & McKool, 2009)-considerations on how to improve its scoring and psychometric modeling are rare. Recently, it was advanced (Myszkowski & Storme, 2019) that the framework of Item-Response Theory (IRT) is appropriate for CAT data, and would provide several practical and conceptual benefits to both the psychometric investigation of the CAT and the scoring of creativity. However, the packages recommended for IRT modeling of ordinal data are hardly accessible for researchers non-familiar with IRT, or offer minimal possibility for adaptation of outputs to judgment data. Thus, the package "jrt" was developed for the open source programming language R, and available on the Comprehensive R-Archive Network (CRAN). Its main aim is to make IRT analyses easily applicable to CAT data, by automating model selections, by diagnosing and dealing with issues related to model-data incompatibilities, by providing quick, customizable and publication-ready outputs for communication, and by guiding researchers new to IRT as to the different methods available. We provide brief tutorials and examples for the main functions-which are further detailed in the online vignette and documentation on CRAN. We finally discuss the current limitations and anticipated extensions of the "jrt" package, and invite researchers to take advantage of its practicality.
... These changes are triggered notably through inputs from the neuroscience of creativity (Benedek, Christensen, Fink, & Beaty, 2019), the emergence of digital assessments and new assessment paradigms (Barbot, 2018b;Hass, 2017;Loesche, Goslin, & Bugmann, 2018), and new statistical 1 http://www.oecd.org/pisa/. development and approaches (e.g., Fürst, 2018;Myszkowski & Storme, 2019;Primi, Silvia, Jauk, & Benedek, 2019). These developments have provided new solutions and interpretations to enduring measurement problems (e.g., Forthmann, Szardenings, & Holling, 2018;Hornberg & Reiter-Palmon, 2017) and spurred renewed interest in the dynamization of creativity assessments (e.g., Jankowska, Czerwonka, Lebuda, & Karwowski, 2018). ...
... Measures of creative performance call upon the whole creative potential as engaged in a given simulated domain-based context ("product-based" tasks), such as producing a drawing, a short story, or a musical composition in standardized condition. Resulting productions are generally rated by external observers using the CAT (Amabile, 1982) or related techniques well represented in this issue (Cseh & Jeffries, 2019;Glaveanu, 2019;Myszkowski & Storme, 2019;Primi, Silvia, Jauk, & Benedek, 2019) and further discussed later. Consistent with decades of findings, these measures are more domain-specific, in that they often engage a different combination of domain-relevant resources for the task at hand, including domain-specific knowledge (Barbot, 2018a). ...
... Several articles in the special issue directly discuss the CAT approach (Cseh and Jeffries, 2019;Glaveanu, 2019;Myszkowski & Storme, 2019). Cseh and Jeffries (2019) cover what we know about the CAT from these last 35 years of research. ...
Article
This commentary discusses common relevant themes that have been highlighted across contributions in this special issue on "Creativity Assessment: Pitfalls, Solutions, and Standards." We first highlight the challenges of operationalizing creativity through the use of a range of measurement approaches that are simply not tapping into the same aspect of creativity. We then discuss pitfalls and challenges of the three most popular measurement methods employed in the field, namely divergent thinking tasks, product-based assessment using the consensual assessment techniques, and self-report methodology. Finally, we point to two imperative standards that emerged across contributions in this collection of articles, namely transparency (need to accurately define, operationalize, and report on the specific aspect[s] of creativity studied) and homogenization of creativity assessment (identification and consistent use of an optimal "standard" measure for each major aspect of creativity). We conclude by providing directions on how the creativity research community and the field can meet these standards.
... While researchers have cautioned against using other than expert samples, empirical work suggests that laypersons may provide valid and reliable ratings when adequately prepared for the rating task (Hass et al., 2018;Storme et al., 2014). Either way, rater-characteristics should be taken into account when considering the provided responses in sophisticated statistical models (Myszkowski & Storme, 2019;Primi et al., 2019;Robitzsch & Steinfeld, 2018). ...
... Judge response theory (JRT) refers to the adaptation of polytomous item response theory models [e.g., the graded response model (Samejima, 1969) or the generalized partial credit model (Muraki, 1992)] for human ratings in the context of creativity research (Myszkowski, 2021;Myszkowski & Storme, 2019). JRT explicitly models differences in the rating behavior of human judges as reflected by severity (or leniency) effects and differences between raters with respect to their discrimination parameter. ...
Preprint
Full-text available
Human ratings are ubiquitous in creativity research. Yet the process of rating responses to creativity tasks—typically several hundred or thousands of responses, per rater—is often time consuming and expensive. Planned missing data designs, where raters only rate a subset of the total number of responses, have been recently proposed as one possible solution to decrease overall rating time and monetary costs. However, researchers also need ratings that adhere to psychometric standards, such as a certain degree of reliability, and psychometric work with planned missing designs is currently lacking in the literature. In this work, we introduce how judge response theory and simulations can be used to fine-tune planning of missing data designs. We provide open code for the community and illustrate our proposed approach by a cost-effectiveness calculation based on a realistic example. We clearly show that fine tuning helps to save time (to perform the ratings) and monetary costs, while simultaneously targeting expected levels of reliability.
... Nevertheless, the information matrix of the 4PL model could not be inverted in order to compute the parameter standard errors-decreasing the convergence tolerance and changing the estimation method did not solve this issue-which may be a sign that the estimates were unstable. Item characteristic curve plots, which, for binary models, present the expected probability of a correct response as a function of the latent ability θ j were plotted using the package for R "jrt" [24]. To keep the paper concise, only models with appropriate fit were plotted. ...
... However, as with the binary models, the information matrix of the 4PNL model could not be inverted in order to compute the parameter standard errors, which may be a sign that the estimates were unstable. As for the binary models, item category curve plots, which present the expected probability of selecting a category as a function of the latent ability θ j were computed using "jrt" [24]. Again, to keep the paper concise, only models with appropriate fit were reported. ...
Article
Full-text available
Assessing job applicants' general mental ability online poses psychometric challenges due to the necessity of having brief but accurate tests. Recent research (Myszkowski & Storme, 2018) suggests that recovering distractor information through Nested Logit Models (NLM; Suh & Bolt, 2010) increases the reliability of ability estimates in reasoning matrix-type tests. In the present research, we extended this result to a different context (online intelligence testing for recruitment) and in a larger sample (N = 2949 job applicants). We found that the NLMs outperformed the Nominal Response Model (Bock, 1970) and provided significant reliability gains compared with their binary logistic counterparts. In line with previous research, the gain in reliability was especially obtained at low ability levels. Implications and practical recommendations are discussed.
... IRT assumes manifest variables (i.e., the response behavior to test items) and a latent variable (i.e., an underlying characteristic of the subjects). Despite its clear advantages (e.g., items with different difficulty levels and sample independence of test characteristics), IRT approaches usually assume only one latent variable, which is reflected in the correlation between the manifest variables [10][11][12][13]. ...
Article
Full-text available
Parallel test versions require a comparable degree of difficulty and must capture the same characteristics using different items. This can become challenging when dealing with multivariate items, which are for example very common in language or image data. Here, we propose a heuristic to identify and select similar multivariate items for the generation of equivalent parallel test versions. This heuristic includes: 1. inspection of correlations between variables; 2. identification of outlying items; 3. application of a dimension-reduction method, such as for example principal component analysis (PCA); 4. generation of a biplot, in case of PCA of the first two principal components (PC), and grouping the displayed items; 5. assigning of the items to parallel test versions; and 6. checking the resulting test versions for multivariate equivalence, parallelism, reliability, and internal consistency. To illustrate the proposed heuristic, we applied it exemplarily on the items of a picture naming task. From a pool of 116 items, four parallel test versions were derived, each containing 20 items. We found that our heuristic can help to generate parallel test versions that meet requirements of the classical test theory, while simultaneously taking several variables into account.
... However, more contemporary latent variable approaches, such as Judge Response Theory (JRT), model evaluations as an underlying latent construct not only driven by the creativity of the product, but also the properties of the judges (Myszkowski & Storme, 2019). When applied to divergent thinking tests, subjective scoring typically involves training raters (often undergraduate or graduate students) to rate the originality of responses, using a continuous or ordinal scale (e.g., Silvia et al., 2008Silvia et al., , 2009. ...
Preprint
Full-text available
The visual modality is central to both reception and expression of human creativity. Creativity assessment paradigms, such as structured drawing tasks (Barbot, 2018), seek to characterize this key modality of creative ideation. However, visual creativity assessment paradigms often rely on cohorts of expert or naïve raters to gauge the level of creativity of the outputs. This comes at the cost of substantial human investment in both time and labor. To address these issues, recent work has leveraged the power of machine learning techniques to automatically extract creativity scores in the verbal domain (e.g., SemDis; Beaty & Johnson, 2021). Yet, a comparably well-vetted solution for the assessment of visual creativity is missing. Here, we introduce AuDrA—an Automated Drawing Assessment platform to extract visual creativity scores from simple drawing productions. Using a collection of line drawings and human creativity ratings, we trained AuDrA and tested its generalizability to untrained drawing sets, raters, and tasks. Across 4 datasets, nearly 60 raters, and over 13,000 drawings, we found AuDrA scores to be highly correlated with human creativity ratings for new drawings on the same drawing task (r = .64 - .93; mean = .81). Importantly, correlations between AuDrA scores and human raters surpassed those between drawings’ elaboration (i.e., ink on the page) and human creativity raters, suggesting that AuDrA is sensitive to features of drawings beyond simple degree of complexity. We discuss future directions, limitations, and link the trained AuDrA model and a tutorial (https://osf.io/kqn9v/) to enable researchers to efficiently assess new drawings.
... All analyses were performed in R (version 4.1.0), using the following packages: psych (Revelle, 2021), TAM (Robitzsch, Kiefer, & Wu, 2021), mirt (Chalmers, 2012), jrt (Myszkowski & Storme, 2019), lordif (Choi, Gibbons, & Crane, 2016), and lavaan (Rosseel, 2012). R script and datasets used are available for re-analysis in the Open Science Framework (OSF) archive (https://osf.io/r89za/ ...
Article
Self-report scales have become the most widely used instruments to capture people’s self-perception of creativity. Previous studies, however, provided only a limited insight into the psychometric properties of such measures. This paper reports an extensive item response theory (IRT) analysis of the Short Scale of Creative Self (SSCS): one of the most frequently used scales of creative self-concept. Based on samples from 14 studies (overall N > 26,000), we report IRT parameters of creative self-efficacy and creative personal identity scales’ items. We examined whether the scores obtained in the SSCS depend on the length of the response scale (5-versus-7-point Likert scales) and whether latent scores are comparable across different data collection methods (online, paper-and-pencil, phone), age, and gender. The results confirmed the two-factor structure of the SSCS, good psychometric properties of its items, as well as invariance regarding response scales, age, gender, and method of data collection. At the same time, the items (and consequently—scales) were easy in the psychometrical sense, thus providing much more reliable scores among individuals who scored low or medium in creative self-concept. Longer (7-point) and shorter (5-point) Likert scales performed similarly, with some psychometric arguments favoring fewer points on the scale. Gender differences were negligible (Cohen’s d between 0.00 and 0.01). We discuss potential ways of further improvement and development of the SSCS.
... Numerous studies have further validated the CAT and examined some of its most prominent features, including implicit biases induced by creator characteristics (such as the ethnicity or gender of product creators; see, e.g., Kaufman, Baer, Agars, & Loomis, 2010;Kaufman, Niu, Sexton, & Cole, 2010). In spite of its excellent reputation, Cseh and Jeffries (2019) pointed out that the CAT methodology still lacked clear procedural standards and a full understanding of the effects of procedural choices, including the impact of rater characteristics (see, e.g., Myszkowski & Storme, 2019). In this light, Cseh and Jeffries highlighted "the need for new debate and a program of research to clarify, evidence, and harmonize CAT methodology" (Cseh & Jeffries, 2019, p. 159). ...
Article
Full-text available
Rationale : Creativity assessment can be influenced by rater characteristics, including social group membership, such as gender. As raters are often male, the gender composition of rater panels in the Consensual Assessment Technique (CAT) could introduce unintended implicit biases into this measurement methodology. The present study analyzed such biases by examining gender differences in creativity assessment. Method : We applied the CAT and asked male ( n = 26) and female ( n = 39) judges to rate the creativity of fashion outfits presented on Instagram . We then examined gender differences in mean creativity ratings and rater consistency (inter-rater reliability). In an additional qualitative analysis, we analyzed implicit theories of creativity of female and male raters by comparing the criteria that these raters applied when assessing creativity. Results : We found no systematic support for gender differences in the level of creativity ratings, but observed that rating consistency was significantly higher for female than for male judges. Additional content analysis suggested that female and male raters attached different relative importance to various assessment criteria, indicating gender differences in rating criteria. Discussion : Our study suggests that rater panel composition can indeed affect aspects of creativity assessment, although we do not obtain strong support for a gender-related bias in the CAT methodology.
... This sophistication is also evident in conceptual analyses (Cerrato, Siano, Marco, & Ricci, 2019;Kaufman, 2019), such as reviews of research on the consensual assessment technique (Cseh & Jefferies, 2019;Myszkowski & Storme, 2019). In particular, Myszkowski and Storme propose the use of item response theory models to improve the CAT's psychometric framework. ...
Article
Full-text available
In 1998, Plucker and Runco provided an overview of creativity assessment, noting current issues (fluency confounds, generality vs. specificity), recent advances (predictive validity, implicit theories), and promising future directions (moving beyond divergent thinking measures, reliance on batteries of assessments, translation into practice). In the ensuing quarter century, the field experienced large growth in the quantity, breadth, and depth of assessment work, suggesting another analysis is timely. The purpose of this paper is to review the 1998 analysis and identify current issues, advances, and future directions for creativity measurement. Recent advances include growth in assessment quantity and quality and use of semantic distance as a scoring technique. Current issues include mismatches between current conceptions of creativity and those on which many measures are based, the need for psychometric quality standards, and a paucity of predictive validity evidence. The paper concludes with analysis of likely future directions, including use of machine learning to administer and score assessments and refinement of our conceptual frameworks for creativity assessment. Although the 1998 paper was written within an academic climate of harsh criticism of creativity measurement, the current climate is more positive, with reason for optimism about the continued growth of this important aspect of the field.
... 159), such as careful and comprehensive reviews of the relevant psychometric evidence of the approach. Among many issues that creativity researchers have identified about the SCA, methodological issues received increased attention during the past decade (Wang & Long, 2022), most notably the indistinct description of interjudge reliability and construct validity, the confusion between different aspects of validity evidence Kaufman & Baer, 2012), and the lack of connection between empirical evidence and psychometric and measurement theories (Myszkowskil & Storme, 2019;Wang & Long, 2022). All these issues require the application of the most recent conceptual frameworks and theories in psychometrics and measurement, such as the 2014 Standards for Educational and Psychological Testing (the Standards thereafter, AERA et al.) and rater-mediated assessments, to guide our empirical review and ensure that the empirical evidence is understood and interpreted correctly, rigorously, and scientifically (Barbot et al., 2019). ...
Article
Full-text available
The importance of creativity for learning, an equitable education, and a competitive nation warrants a broader and deeper understanding of this topic, including how creativity is assessed. This review focuses on subjective creativity assessment, a popular assessment approach that uses judges’ subjective definitions of creativity, and examines its reliability and validity evidence collected from 84 empirical studies under the theoretical frameworks of the 2014 Standards for Educational and Psychological Testing and rater-mediated assessments. The main findings include: 1) The reviewed studies vary across domains, characteristics of subjects/objects and raters, and rating instructions and scales; 2) The major reliability evidence was provided by Cronbach’s alpha and correlations of rating scores, and the major validity evidence came from the evidence based on relationships with other variables through the use of correlations; 3) Cronbach’s alpha values differed through an interaction between domains and judges’ expertise level, and correlations of rating scores differed by domain and judges’ expertise level; 4) There was strong convergent validity evidence between creativity and novelty but a weak discriminant validity evidence between creativity and technical goodness and liking. These findings suggest that the subjective creativity assessment approach shows a good level of reliability and validity but has some degrees of unreliability and invalidity that need to be addressed with good research practices and more advanced measurement theories and methods.
... Second, Myszkowski and Storme (2019) recently proposed the judge response theory (JRT) as an alternative framework for judge analysis instead of the classical test theory (CTT). In JRT, latent attributes-trait(s) and/or class(es)of a product and of a judge are used as predictors of observed judgments. ...
Article
History is replete with cases in which people have failed to recognize creative ideas generated by others. In various settings, people are responsible for evaluating ideas generated by others while not being involved in the idea generation process, and thus not exposed to the task. However, little is known on how this lack of task exposure affects creative forecasting. This study therefore examines the effect of task exposure on creative idea evaluation using 1864 German students who evaluated ideas on their creativity , originality and feasibility. Their ratings were compared to ratings by content and creativity experts. The students were randomly assigned to 1 of the following conditions: task exposure (i.e., they had to generate and evaluate ideas for the same task) or no task exposure (i.e., they had to generate ideas for a different task than the idea evaluation task). The results show that task exposure improves stu-dents' ability to accurately recognize creative and original ideas, and their ability to discriminate between highly feasible and unfeasible ideas. As such, these findings suggest that task exposure is beneficial to creative idea forecasting. Together, the results highlight the importance of carefully reconsidering whether people should be exposed to a task before evaluating others' ideas.
... Future research should build on this study and further investigate the subject of temporal assessment stability and test-retest reliability of CAT ratings. A promising way to do so would be the application of more advanced techniques such as Many-Facet Rasch Modeling (MFRM; e.g., Eckes, 2015), an approach recently suggested (Myszkowski & Storme, 2019) and used (e.g., Primi et al., 2019) in the context of creativity assessment. MFRM has the potential to measure several rater and situational effects (including a time facet), assess the degree of rater agreement, and detect interaction effects. ...
Article
Full-text available
The consensual assessment technique (CAT) is a reliable and valid method to measure (product) creativity and often considered the gold standard of creativity assessment. The reliability measure traditionally applied in CAT studies—inter‐rater reliability—cannot capture time‐sampling error, which is a particular relevant source of error for specific applications of the CAT. Therefore, the present study intended to investigate the test–retest reliability of CAT ratings. We asked raters (N = 61) for their creativity assessment of the same set of 90 fashion outfits at an initial rating session and a follow‐up session either 2 or 4 weeks later. We found that mean product ratings—the actual focus of interest in the CAT—were highly stable over time, as evidenced by consistency and agreement ICCs clearly exceeding levels of .90. However, individual raters (partially) lacked temporal stability, indicating a drift in rater tendencies over time. Our findings support the CAT’s reputation as a highly reliable measurement method, but question the temporal rating stability of the CAT’s actual “measurement instrument,” namely individual judges.
... Before analyzing the effects of our variables on originality, we checked the inter-rater consistency of our measure using Cronbach's alpha. This coefficient posits that the measure is unidimensional (Myszkowski & Storme, 2019). An EFA indicated that our measure was unidimensional. ...
Preprint
Creativity is a crucial 21st century skill. Thus, finding ways to improve the creative potential of adults is essential. Games are an effective learning tool, and some studies have investigated the effects of video games and role-playing games on creative potential. However, less is known about the potential benefits of board games. The aim of the present study was to compare the effects of creative and non-creative board games. The sample consisted of 55 university students. We used a within-subject repeated-measurement design. We assessed creative potential using a divergent thinking task, using fluency and originality as indicators. We controlled the potential effects of mood states and enjoyment. Results indicate a positive effect, for participants with low creative potential, for both types of games.
... As a general measure of internal consistency, the Rasch average reliability among the ten AUT items in the MFRM was .90, and therefore MFRM based scores were generated and saved in our dataset for future analysis. As an alternative to MFRM not applied here, interested readers should also see the application of item-response models to data from multiple human-raters recently posited by Myszkowski and Storme (2019). ...
Article
Full-text available
Within creativity research, interest and capability in utilizing text-mining models to quantify the Originality of participant responses to Divergent Thinking tasks has risen sharply over the last decade, with many extant studies fruitfully using such methods to uncover substantive patterns among creativity-relevant constructs. However, no systematic psychometric investigation of the reliability and validity of human-rated Originality scores, and scores from various freely available text-mining systems, exists in the literature. Here we conduct such an investigation with the Alternate Uses Task. We demonstrate that, despite their inherent subjectivity, human-rated Originality scores displayed the highest reliability at both the composite and latent factor levels. However, the text-mining system GloVe 840B was highly capable of approximating human-rated scores both in its measurement properties and its correlations to various creativity-related criteria including ideational Fluency, Elaboration, Openness, Intellect, and self-reported Creative Activities. We conclude that, in conjunction with other salient indicators of creative potential, text-mining models (and especially the GloVe 840B system) are capable of supporting reliable and valid inferences about Divergent Thinking. An open access system for producing the Originality scores that were psychometrically examined in this paper is available for free at our website: https://openscoring.du.edu/. Please use for your research and let us know if you encounter any bugs!
Article
Immersive virtual reality (IVR) takes advantage of exponential growth in our technological abilities to offer an array of new forms of entertainment, learning opportunities, and even psychological interventions and assessments. The field of creativity is a driving force in both large-scale innovations and everyday progress, and imbedding creativity assessment in IVR programs has important practical implications for future research and interventions in this field. Creativity assessment, however, tends to either rely on traditional concepts or newer, yet cumbersome methods. Can creativity be measured within IVR? This study introduces the VIVA, a new IVR-based visual arts creativity assessment paradigm in which user create 3D drawings in response to a prompt. Productions are then rated with modern extensions of a classic product-based approach to creativity assessment. A sample of 67 adults completed the VIVA, further scored using item-response modeling. Results demonstrated the strong psychometric properties of the VIVA assessment, including its structural validity, internal reliability, and criterion validity with relevant criterion measures. Together, this study established a solid proof-of-concept of the feasibility of measuring creativity in IVR. We conclude by discussing directions for future studies and the broader importance and impact of this line of work for the field of creativity and virtual reality.
Article
Full-text available
Juries are a high-stake practice in higher education to assess complex competencies. However common, research remains behind in detailing the psychometric qualities of juries, especially when using rubrics or rating scales as an assessment tool. In this study, I analyze a case of a jury assessment (N=191) of product development where both internal teaching staff and external judges assess and fill in an analytic rating scale. Using polytomous Item Response Theory (IRT) analysis developed for the analysis of heterogeneous juries (i.e., Jury Response Theory or JRT), this study attempts to provide insight into the validity and reliability of the used assessment tool. The results indicate that JRT helps detect unreliable response patterns that indicate an excellence bias, i.e., a tendency not to score in the highest response category. This article concludes with a discussion on how to counter such bias when using rating scales or rubrics for summative assessment.
Article
This chapter provides a systematic, synthesizing, and critical review of the literature related to assessments of creativity in education from historical, theoretical, empirical, and practical standpoints. We examined the assessments used in the articles focusing on education that are published from January 2010 to May 2021 in eight creativity, psychological, and educational journals. We found that the assessments of creativity in education are split between psychological and education research and have increased international participation. Additionally, these assessments are more general than specific and focus more on cognitive than noncognitive aspects. Like previous reviews of assessments of creativity in general, this review showed that creativity in education is still mainly assessed by divergent thinking or creativity tests, self-report questionnaires, and product-based subjective techniques. We analyzed the benefits and drawbacks of each approach and highlighted many innovations in the assessment. We further discussed how the major assessment approaches address race, ethnicity, class, and gender issues in education. We concluded the review with recommendations for how to better assess creativity in education and how assessments of creativity in education contribute to our understanding of the creative educational experience and democratizing education.
Article
The assessment of creative problem solving (CPS) is challenging. Elements of an assessment procedure, such as the tasks that are used and the raters who assess those tasks, introduce variation in student scores that do not necessarily reflect actual differences in students’ creative problem solving abilities. When creativity researchers evaluate assessment procedures, they often inspect these elements such as tasks and raters separately. We show the use of Generalizability Theory allows researchers to investigate CPS assessments in a comprehensive and integrated way. In this paper, we first introduce this statistical framework and the choices creativity researchers need to make before applying Generalizability Theory to their data. Then, Generalizability Theory is applied in an analysis of a CPS assessment procedure. We highlight how alterations in the nature of the assessment procedure, such as changing the number of tasks or raters, may affect the quality of CPS scores. Besides this, we present implications for the assessment of CPS and for creativity research in general.
Article
Full-text available
In this article, we discuss some limits of methods being used most frequently for testing creativity and describe the process of developing a new diagnostic tool – a self-reporting inventory of creativity and innovativeness based on a complex set of skills and competencies identified in the review of existing research, relevant literature, and available diagnostic tools. Our complex applied creativity construct involves imagination, idea creation, openness, flexibility, and features like braveness, analytical skills, assertiveness, and even empathy, which are all necessary to transform a new idea into practical application beneficial for a company or society. We first gained data from 125 respondents and analyzed the structure of the first version of the questionnaire (reduction from 80 to 77 items had to be done based on correlation analysis). Then we undertook a series of exploratory factor analyses. After the iterative process of eliminating all the items with low loadings or problematic cross-loadings, we suggested the six-factor solution for the new, second version of the Creatixo inventory consisting of 47 items. On the new sample of 106 respondents confirmatory factor analysis supported the six-factor structure of the inventory (GOF indices: CFI = .990, GFI = .909, RMSEA = .025, SRMR = 0.101). The Creatixo tool has shown good results in internal consistency measures – e.g., McDonald's omega of all individual factors varies from .740 to .887. Now we are about to gather the bigger sample of the quota-representative population and cross-validate the psychometric analysis current outputs. We discuss the further validation studies and some limits of the self-reporting inventory approach to assessing creativity, innovativeness, and productivity.
Article
Full-text available
Creativity has emerged as an important 21st-century competency. Although it is traditionally associated with arts and literature, it can also be developed as part of computing education. Therefore, this article -presents a systematic mapping of approaches for assessing creativity based on the analysis of computer programs created by the students. As result, only ten approaches reported in eleven articles have been encountered. These reveal the absence of a commonly accepted definition of product creativity customized to computer education, confirming only originality as one of the well-established characteristics. Several approaches seem to lack clearly defined criteria for effective, efficient and useful creativity assessment. Diverse techniques are used including rubrics, mathematical models and machine learning, supporting manual and automated approaches. Few performed a comprehensive evaluation of the proposed approach regarding their reliability and validity. These results can help instructors to choose and adopt assessment approaches and guide researchers by pointing out shortcomings.
Article
The current study aimed to investigate whether board games could be used to improve creative potential. Games have proven to be effective learning tools, and some studies have indicated positive links between creativity and other types of games, namely video games and role‐playing games. However, less is known regarding board games’ potential benefits on creativity. This exploratory study compared two types of board games: creative and non‐creative board games. We used a within‐subject repeated‐measurement design, in which participants played to both types of games, across two sessions separated by one week. We assessed creative potential with a divergent thinking task, using fluency and originality as indicators. We controlled for openness, mood states, and enjoyment. Results suggest an improvement of originality after playing both types of games, whereas no differences were observed for fluency. Considering the base level of participants, we found improvement for low‐performing participants specifically, in both fluency and originality, although analyses’ limited statistical power may have impacted the findings. These findings provide a first step in the study of creativity and board games and suggest they could help temporarily improve one’s divergent thinking capacity.
Article
Full-text available
Despite decades of extensive research on creativity, the field still combats psychometric problems when measuring individual differences in creative ability and people’s potential to achieve real-world outcomes that are both original and useful. We think these seemingly technical issues have a conceptual origin. We therefore propose a minimal theory of creative ability (MTCA) to create a consistent conceptual theory to guide investigations of individual differences in creative ability. Building on robust theories and findings in creativity and individual differences research, our theory argues that creative ability, at a minimum, must include two facets: intelligence and expertise. So, the MTCA simply claims that whenever we do something creative, we use most of our cognitive abilities combined with relevant expertise to be creative. MTCA has important implications for creativity theory, measurement, and practice. However, the MTCA isn’t necessarily true; it is a minimal theory. We discuss and reject several objections to the MTCA.
Article
Much attention has been given to the development and validation of measures of growth mindset and its impact on learning, but the previous work has largely been focused on general measures of growth mindset. This research was focused on establishing the psychometric properties of a reading mindset (RM) measure among a sample of upper elementary school students and validating the measure through its relations with standardized measures of word reading and comprehension. The RM measure was developed to capture student’s beliefs about their ability, learning goals, and effort during reading. Item response theory was used to select items that optimally measured the RM measure from a pool of existing items from previous research. The final five-item RM measure predicted reading comprehension outcomes above and beyond the effects of word reading, indicating that this measure may be an important tool for diagnosing noncognitive areas of improvement for developing readers. The implications, limitations, and future directions for expanding upon the measure were discussed.
Technical Report
Full-text available
Psychometric analysis and scoring of judgment data using polytomous Item-Response Theory (IRT) models, as described in Myszkowski and Storme (2019). A convenience function is used to automatically compare and select models, as well as to present a variety of model-based statistics. Plotting functions are used to present category curves, as well as information, reliability and standard error functions.
Article
Full-text available
Amabile's consensual assessment technique (CAT)-taking the consensus opinions of domain experts-is considered a "gold standard" of creativity assessment for research purposes. While several studies have identified how specific procedural choices impact the CAT's reliability as a measure, researchers' depth of knowledge about procedures and their effects still remains incomplete. This article explores gaps in the research by reviewing CAT and creativity literature and aims to explore to what extent the creativity research community needs to revisit and reflect on the CAT and solidify protocols for its implementation. The conclusion highlights the need for new debate and a program of research to clarify, evidence, and harmonize CAT methodology while simultaneously preserving the CAT's flexibility. This would enable the development and sophistication of the CAT, including possible new assistive technologies, to further strengthen its use within the science of creativity.
Preprint
Amabile's Consensual Assessment Technique (CAT) – taking the consensus opinions of domain experts – is considered a 'gold standard' of creativity assessment for research purposes. While several studies have identified how specific procedural choices impact on the CAT's reliability as a measure, researchers’ depth of knowledge about procedures and their effects still remains incomplete. This paper explores gaps in the research by reviewing CAT and creativity literature, and aims to explore to what extent the creativity research community needs to revisit and reflect on the CAT and solidify protocols for its implementation. The conclusion highlights the need for new debate and a program of research to clarify, evidence, and harmonize CAT methodology, while simultaneously preserving the CAT’s flexibility. This would enable the development and sophistication of the CAT, including possible new assistive technologies, to further strengthen its use within the science of creativity.
Article
Full-text available
Empirical studies in psychology commonly report Cronbach's alpha as a measure of internal consistency reliability despite the fact that many methodological studies have shown that Cronbach's alpha is riddled with problems stemming from unrealistic assumptions. In many circumstances, violating these assumptions yields estimates of reliability that are too small, making measures look less reliable than they actually are. Although methodological critiques of Cronbach's alpha are being cited with increasing frequency in empirical studies, in this tutorial we discuss how the trend is not necessarily improving methodology used in the literature. That is, many studies continue to use Cronbach's alpha without regard for its assumptions or merely cite methodological papers advising against its use to rationalize unfavorable Cronbach's alpha estimates. This tutorial first provides evidence that recommendations against Cronbach’s alpha have not appreciably changed how empirical studies report reliability. Then, we summarize the drawbacks of Cronbach's alpha conceptually without relying on mathematical or simulation-based arguments so that these arguments are accessible to a broad audience. We continue by discussing several alternative measures that make less rigid assumptions which provide justifiably higher estimates of reliability compared to Cronbach’s alpha.. We conclude with empirical examples to illustrate advantages of alternative measures of reliability including omega total, Revelle’s omega total, the greatest lower bound, and Coefficient H. A detailed software appendix is also provided to help researchers implement alternative methods.
Article
Full-text available
Intelligence assessment is often viewed as a narrow and ever-narrowing field, defined (as per IQ) by the measurement of finely distinguished cognitive processes. It is instructive, however, to remember that other, broader conceptions of intelligence exist and might usefully be considered for a comprehensive assessment of intellectual functioning. This article invokes a more holistic, systems theory of intelligence—the theory of successful intelligence—and examines the possibility of including in intelligence assessment a similarly holistic measure of creativity. The time and costs of production-based assessments of creativity are generally considered prohibitive. Such barriers may be mitigated by applying the consensual assessment technique using novice raters. To investigate further this possibility, we explored the question: how much do demographic factors such as age and gender and psychological factors such as domain-specific expertise, personality or self-perceived creativity affect novices’ unidimensional ratings of creativity? Fifty-one novice judges from three undergraduate programs, majoring in three disparate expertise domains (i.e., visual art, psychology and computer science) rated 40 child-generated Lego creatures for creativity. Results showed no differences in creativity ratings based on the expertise domains of the judges. However, judges’ personality and self-perception of their own everyday creativity appeared to influence the way they scored the creatures for creativity.
Article
Full-text available
The Consensual Assessment Technique is a powerful tool used by creativity researchers in which panels of expert judges are asked to rate the creativity of creative products such as stories, collages, poems, and other artifacts. Experts in the domain in question serve as judges; thus, for a study of creativity using stories and poems, a panel of writers and/or teachers of creative writing might judge the creativity of the stories, and a separate panel of poets and/or poetry critics might judge the creativity of the poems. The Consensual Assessment Technique is based on the idea that the best measure of the creativity of a work of art, a theory, a research proposal, or any other artifact is the combined assessment of experts in that field. Unlike other measures of creativity, such as divergent-thinking tests, the Consensual Assessment Technique is not based on any particular theory of creativity, which means that its validity (which has been well established empirically) is not dependent upon the validity of any particular theory of creativity. This chapter explains the Consensual Assessment Technique, discusses how it has been used in research, and explores ways it might be employed in assessment in higher education.
Article
Full-text available
The purpose of this study was to explore the reliability of measures of both individual and group creative work using the consensual assessment technique (CAT). CAT was used to measure individual and group creativity among a population of pre-service music teachers enrolled in a secondary general music class (n = 23) and was evaluated from multiple perspectives for reliability. Consistency was calculated using Cronbach's alpha. Judges were found to be highly consistent for individual creativity (α = .90), individual craftsmanship (α = .87), group creativity (α = .86) and group craftsmanship (α = 81). Judges were much less consistent with their ratings of aesthetic sensitivity for individual compositions (α = .67) or group performances (α = .69). Absolute agreement was calculated by using intraclass correlation coefficient (ICC). Judges were found to be highly in agreement for individual creativity (α = .79), individual craftsmanship (α = .83), group creativity (α = .87) and group craftsmanship (α = 83). Judges were much less in agreement with their ratings of aesthetic sensitivity for individual compositions (α = .57) or group performances (α = .71). Judges ratings for individual creativity were consistent over time, as evidenced by test-retest reliabilities of .89 (creativity), .83 (craftsmanship) and .79 (aesthetic sensibility). Results indicate, in agreement with prior research, that CAT is a reliable measure of creativity. The researchers introduce the idea that absolute agreement might be a worthwhile construct to explore in future work in the measurement of creativity in music education.
Article
Full-text available
a b s t r a c t The aim of this work was to gather different perspectives on the "key ingredients" involved in creative writing by children – from experts of diverse disciplines, including teachers, linguists, psychologists, writers and art educators. Ultimately, we sought in the experts' convergence or divergence insights on the relative importance of the relevant factors that may aid writing instruction, particularly for young children. We present a study using an expert knowledge elicitation method in which representatives from five domains of expertise pertaining to writing rated 28 factors (i.e., individual skills and attributes) cov-ering six areas (general knowledge and cognition, creative cognition, conation, executive functioning, linguistic and psychomotor skills), according to their importance for creative writing. A Many-Facets Rasch Measurement (MFRM) model permitted us to quantify the relative importance of these writing factors across domain-specific expertise, while control-ling for expert severity and other systematic evaluation biases. The identified similarities and domain-specific differences in the expert views offer a new basis for understanding the conceptual gaps between the scientific literature on creative writing, the writer's self-reflection on the act of writing creatively, and educators' practices in teaching creative writing. Bridging such diverse approaches–that are, yet, relatively homogeneous within areas of expertise – appears to be useful in view of formulating process-oriented writing pedagogy that may, above all, better target the skills needed to improve children's creative writing development.
Article
Full-text available
The Consensual Assessment Technique (CAT) is one of the most highly regarded assessment tools in creativity, but it is often difficult and/or expensive to assemble the teams of experts required by the CAT. Some researchers have tried using nonexpert raters in their place, but the validity of replacing experts with nonexperts has not been adequately tested. Expert (n = 10) and nonexpert (n = 106) creativity ratings of 205 poems were compared and found to be quite different, making the simple replacement of experts by nonexpert raters suspect. Nonexpert raters' judgments of creativity were inconsistent (showing low interrater reliability) and did not match those of the expert raters. Implications are discussed, including the appropriate selection of expert raters for different kinds of creativity assessment.
Article
Full-text available
Approaches to adaptive (tailored) testing based on item response theory are described and research results summarized. Through appropriate combinations of item pool design and use of different test termination criteria, adaptive tests can be designed (1) to improve both measurement quality and measurement efficiency, resulting in measurements of equal precision at all trait levels; (2) to improve measurement efficiency for test batteries using item pools designed for conventional test administration; and (3) to improve the accuracy and efficiency of testing for classification (e.g., mastery testing). Research results show that tests based on item response theory (IRT) can achieve measurements of equal precision at all trait levels, given an adequately designed item pool; these results contrast with those of conventional tests which require a tradeoff of bandwidth for fidelity/precision of measurements. Data also show reductions in bias, inaccuracy, and root mean square error of ability estimates. Improvements in test fidelity observed in simulation studies are supported by live-testing data, which showed adaptive tests requiring half the number of items as that of conventional tests to achieve equal levels of reliability, and almost one-third the number to achieve equal levels of validity. When used with item pools from conventional tests, both simulation and live-testing results show reductions in test battery length from conventional tests, with no reductions in the quality of measurements. Adaptive tests designed for dichotomous classification also represent improvements over conventional tests designed for the same purpose. Simulation studies show reductions in test length and improvements in classification accuracy for adaptive vs. conventional tests; live-testing studies in which adaptive tests were compared with "optimal" conventional tests support these findings. Thus, the research data show that IRT-based adaptive testing takes advantage of the capabilities of IRT to improve the quality and/or efficiency of measurement for each examinee.
Article
Full-text available
Markov chain Monte Carlo (MCMC) methods enable a fully Bayesian approach to parameter estimation of item response models. In this simulation study, the authors compared the recovery of graded response model parameters using marginal maximum likelihood (MML) and Gibbs sampling (MCMC) under various latent trait distributions, test lengths, and sample sizes. Sample size and test length explained the largest amount of variance in item and person parameter estimates, respectively. There was little difference in item parameter recovery between MML and MCMC in samples with 300 or more respondents. MCMC recovered some item threshold parameters better in samples with 75 or 150 respondents. Bias in threshold parameter estimates depended on the generating value and the type of threshold. Person parameters were comparable between MCMC and MML/expected a posteriori for all test lengths.
Article
Full-text available
Creativity assessment commonly uses open-ended divergent thinking tasks. The typical methods for scoring these tasks (uniqueness scoring and subjective ratings) are time-intensive, however, so it is impractical for researchers to include divergent thinking as an ancillary construct. The present research evaluated snapshot scoring of divergent thinking tasks, in which the set of responses receives a single holistic rating. We compared snapshot scoring to top-two scoring, a time-intensive, detailed scoring method. A sample of college students (n=226) completed divergent thinking tasks and measures of personality and art expertise. Top-two scoring had larger effect sizes, but snapshot scoring performed well overall. Snapshot scoring thus appears promising as a quick and simple approach to assessing creativity.
Article
Full-text available
An examinee-level (or conditional) reliability is proposed for use in both classical test theory (CTT) and item response theory (IRT). The well-known group-level reliability is shown to be the average of conditional reliabilities of examinees in a group or a population. This relationship is similar to the known relationship between the square of the conditional standard error of measurement (SEM) and the square of the group-level SEM. The proposed conditional reliability is illustrated with an empirical data set in the CTT and IRT frameworks.
Article
Full-text available
Divergent thinking is central to the study of individual differences in creativity, but the traditional scoring systems (assigning points for infrequent responses and summing the points) face well-known problems. After critically reviewing past scoring methods, this article describes a new approach to assessing divergent thinking and appraises its reliability and validity. In our new Top 2 scoring method, participants complete a divergent thinking task and then circle the 2 responses that they think are their most creative responses. Raters then evaluate the responses on a 5-point scale. Regarding reliability, a generalizability analysis showed that subjective ratings of unusual-uses tasks and instances tasks yield dependable scores with only 2 or 3 raters. Regarding validity, a latent-variable study (n=226) predicted divergent thinking from the Big Five factors and their higher-order traits (Plasticity and Stability). Over half of the variance in divergent thinking could be explained by dimensions of personality. The article presents instructions for measuring divergent thinking with the new method. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
This paper comments on an article by Schmidt and Hunter [Intelligence 27 (1999) 183.], who argue that the correction for attenuation should be routinely used in theory testing. It is maintained that Schmidt and Hunter's arguments are based on mistaken assumptions. We discuss our critique of Schmidt and Hunter in terms of two arguments against a routine use of the correction for attenuation within the classical test theory framework: (1) corrected correlations do not, as Schmidt and Hunter claim, provide correlations between constructs, and (2) corrections for measurement error should be made using modern test theory models instead of the classical model. The arguments that Schmidt and Hunter advance in favor of the correction for attenuation can be traced to an implicit identification of true scores with construct scores. First, we show that this identification confounds issues of validity and issues of reliability. Second, it is pointed out that equating true scores with construct scores is logically inconsistent with the classical test theory model itself. Third, it is argued that the classical model is not suited for detecting the dimensionality of test scores, which severely limits the interpretation of the corrected correlation coefficients. It is concluded that most measurement problems in psychology concern issues of validity, and that the correction for attenuation within classical test theory does not help in solving them.
Article
Full-text available
Item response theory (IRT) is widely used in assessment and evaluation research to explain how participants respond to item level stimuli. Several R packages can be used to estimate the parameters in various IRT models, the most flexible being the ltm (Ri-zopoulos 2006), eRm (Mair and Hatzinger 2007), and MCMCpack (Martin, Quinn, and Park 2011) packages. However these packages have limitations in that ltm and eRm can only analyze unidimensional IRT models effectively and the exploratory multidimensional extensions available in MCMCpack requires prior understanding of Bayesian estimation convergence diagnostics and are computationally intensive. Most importantly, multidi-mensional confirmatory item factor analysis methods have not been implemented in any R package. The mirt package was created for estimating multidimensional item response theory parameters for exploratory and confirmatory models by using maximum-likelihood meth-ods. The Gauss-Hermite quadrature method used in traditional EM estimation (e.g., Bock and Aitkin 1981) is presented for exploratory item response models as well as for confirmatory bifactor models (Gibbons and Hedeker 1992). Exploratory and confirma-tory models are estimated by a stochastic algorithm described by Cai (2010a,b). Various program comparisons are presented and future directions for the package are discussed.
Article
Full-text available
The consensual assessment technique (CAT) is a measurement tool for creativity research in which appropriate experts evaluate creative products [Amabile, T. M. (1996). Creativity in context: Update to the social psychology of creativity. Boulder, CO: Westview]. However, the CAT is hampered by the time-consuming nature of the products (asking participants to write stories or draw pictures) and the ratings (getting appropriate experts). This study examined the reliability of ratings of sentence captions. Specifically, four raters evaluated 12 captions written by 81 undergraduates. The purpose of the study was to see whether the CAT could provide reliable ratings of captions across raters and across multiple captions and, if so, how many such captions would be required to generate reliable scores, and how many judges would be needed? Using generalizability theory, we found that captions appear to be a useful way of measuring creativity with a reasonable level of reliability in the frame of CAT.
Article
Full-text available
This discussion paper argues that both the use of Cronbach's alpha as a reliability estimate and as a measure of internal consistency suffer from major problems. First, alpha always has a value, which cannot be equal to the test score's reliability given the interitem covariance matrix and the usual assumptions about measurement error. Second, in practice, alpha is used more often as a measure of the test's internal consistency than as an estimate of reliability. However, it can be shown easily that alpha is unrelated to the internal structure of the test. It is further discussed that statistics based on a single test administration do not convey much information about the accuracy of individuals' test performance. The paper ends with a list of conclusions about the usefulness of alpha.
Article
Full-text available
This paper analyzes the theoretical, pragmatic, and substantive factors that have hampered the integration between psychology and psychometrics. Theoretical factors include the operationalist mode of thinking which is common throughout psychology, the dominance of classical test theory, and the use of "construct validity" as a catch-all category for a range of challenging psychometric problems. Pragmatic factors include the lack of interest in mathematically precise thinking in psychology, inadequate representation of psychometric modeling in major statistics programs, and insufficient mathematical training in the psychological curriculum. Substantive factors relate to the absence of psychological theories that are sufficiently strong to motivate the structure of psychometric models. Following the identification of these problems, a number of promising recent developments are discussed, and suggestions are made to further the integration of psychology and psychometrics.
Article
Full-text available
A rating response mechanism for ordered categories, which is related to the traditional threshold formulation but distinctively different from it, is formulated. In addition to the subject and item parameters two other sets of parameters, which can be interpreted in terms of thresholds on a latent continuum and discriminations at the thresholds, are obtained. These parameters are identified with the category coefficients and the scoring function of the Rasch model for polychotomous responses in which the latent trait is assumed uni-dimensional. In the case where the threshold discriminations are equal, the scoring of successive categories by the familiar assignment of successive integers is justified. In the case where distances between thresholds are also equal, a simple pattern of category coefficients is shown to follow.
Article
Creativity assessment with open-ended production tasks relies heavily on scoring the quality of a subject's ideas. This creates a faceted measurement structure involving persons, tasks (and ideas within tasks), and raters. Most studies, however, do not model possible systematic differences among raters. The present study examines individual rater differences in the context of a planned-missing design and its association with reliability and validity of creativity assessments. It applies the many-facet Rasch model (MFRM) to model and correct for these differences. We reanalyzed data from 2 studies (Ns = 132 and 298) where subjects produced metaphors, alternate uses for common objects, and creative instances. Each idea was scored by several raters. We simulated several conditions of reduced load on raters where they scored subsets of responses. We then compared the reliability and validity of IRT estimated scores (original vs. IRT adjusted scores) on various conditions of missing data. Results show that (a) raters vary substantially on the lenient-severity dimension, so rater differences should be modeled; (b) when different combinations of raters assess different subsets of ideas, systematic rater differences confound subjects' scores, increasing measurement error and lowering criterion validity with external variables; and (c) MFRM adjustments effectively correct for rater effects, thus increasing correlations of scores obtained from partial with scores obtained with full data. We conclude that MFRM is a powerful means to model rater differences and reduce rater load in creativity research.
Article
Three estimation methods with robust corrections—maximum likelihood (ML) using the sample covariance matrix, unweighted least squares (ULS) using a polychoric correlation matrix, and diagonally weighted least squares (DWLS) using a polychoric correlation matrix—have been proposed in the literature, and are considered to be superior to normal theory-based maximum likelihood when observed variables in latent variable models are ordinal. A Monte Carlo simulation study was carried out to compare the performance of ML, DWLS, and ULS in estimating model parameters, and their robust corrections to standard errors, and chi-square statistics in a structural equation model with ordinal observed variables. Eighty-four conditions, characterized by different ordinal observed distribution shapes, numbers of response categories, and sample sizes were investigated. Results reveal that (a) DWLS and ULS yield more accurate factor loading estimates than ML across all conditions; (b) DWLS and ULS produce more accurate interfactor correlation estimates than ML in almost every condition; (c) structural coefficient estimates from DWLS and ULS outperform ML estimates in nearly all asymmetric data conditions; (d) robust standard errors of parameter estimates obtained with robust ML are more accurate than those produced by DWLS and ULS across most conditions; and (e) regarding robust chi-square statistics, robust ML is inferior to DWLS and ULS in controlling for Type I error in almost every condition, unless a large sample is used (N = 1,000). Finally, implications of the findings are discussed, as are the limitations of this study as well as potential directions for future research.
Article
Coefficient alpha, the most commonly used estimate of internal consistency, is often considered a lower bound estimate of reliability, though the extent of its underestimation is not typically known. Many researchers are unaware that coefficient alpha is based on the essentially tau-equivalent measurement model. It is the violation of the assumptions required by this measurement model that are often responsible for coefficient alpha's underestimation of reliability. This article presents a hierarchy of measurement models that can be used to estimate reliability and illustrates a procedure by which structural equation modeling can be used to test the fit of these models to a set of data. Test and data characteristics that can influence the extent to which the assumption of tau-equivalence is violated are discussed. Both heuristic and applied examples are used to augment the discussion.
Article
A structural equation model is described that permits estimation of the reliability index and coefficient of a composite test for congeneric measures. The method is also helpful in exploring the factorial structure of an item set, and its use in scale reliability estimation and development is illustrated. The modeling. estimator of composite reliability it yields does not possess the general underestimation property of Cronbach's coefficient a.
Article
This study examined the application of the MML-EM algorithm to the parameter estimation problems of the normal ogive and logistic polytomous response models for Likert-type items. A rating-scale model was devel oped based on Samejima's (1969) graded response model. The graded response model includes a separate slope parameter for each item and an item response parameter. In the rating-scale model, the item re sponse parameter is resolved into two parameters: the item location parameter, and the category threshold parameter characterizing the boundary between re sponse categories. For a Likert-type questionnaire, where a single scale is employed to elicit different re sponses to the items, this item response model is ex pected to be more useful for analysis because the item parameters can be estimated separately from the threshold parameters associated with the points on a single Likert scale. The advantages of this type of model are shown by analyzing simulated data and data from the General Social Surveys. Index terms: EM algorithm, General Social Surveys, graded response model, item response model, Likert scale, marginal maximum likelihood, polytomous item response model, rating-scale model.
Article
The partial credit model (PCM) with a varying slope parameter is developed and called the generalized partial credit model (GPCM). The item step parameter of this model is decomposed to a location and a threshold parameter, following Andrich's (1978) rating scale formulation. The EM algorithm for estimating the model parameters is derived. The performance of this generalized model is compared on both simulated and real data to a Rasch family of polytomous item response models. Simulated data were generated and then analyzed by the various polytomous item response models. The results demonstrate that the rating formulation of the GPCM is quite adaptable to the analysis of polytomous item responses. The real data used in this study consisted of the National Assessment of Educational Progress (Johnson & Allen, 1992) mathematics data that used both dichotomous and polytomous items. The PCM was applied to these data using both constant and varying slope parameters. The GPCM, which provides for varying slope parameters, yielded better fit to the data than did the PCM.
Article
Generalized linear item response theory is discussed, which is based on the following assumptions: (1) A distribution of the response occurs according to given item format; (2) the item responses are explained by 1 continuous or nominal latent variable and p latent as well as observed variables that are continuous or nominal; (3) the responses to the different items of a test are independently distributed given the values of the explanatory variables; and (4) a monotone differentiable function g of the expected item response τ is needed such that a linear combination of the explanatory variables is a predictor of g(τ). It is shown that most of the well-known psychometric models are special cases of the generalized theory and that concepts such as differential item functioning, specific objectivity, reliability, and information can be subsumed under the generalized theory. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
States that both the popular creativity tests, such as the Torrance Tests of Creative Thinking, and the subjective assessment techniques used in some previous creativity studies are ill-suited to social psychological studies of creativity. A consensual definition of creativity is presented, and as a refinement of previous subjective methods, a reliable subjective assessment technique based on that definition is described. The results of 8 studies testing the methodology in elementary school and undergraduate populations in both artistic and verbal domains are presented, and the advantages and limitations of this technique are discussed. The present methodology can be useful for the development of a social psychology of creativity because of the nature of the tasks employed and the creativity assessments obtained. Creativity assessment is discussed in terms of the divergent aims and methods of personality psychology and social psychology. (46 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Generalizability theory consists of a conceptual framework and a methodology that enable an investigator to disentangle multiple sources of error in a measurement procedure. The roots of generalizability theory can be found in classical test theory and analysis of variance (ANOVA), but generalizability theory is not simply the conjunction of classical theory and ANOVA. In particular, the conceptual framework in generalizability theory is unique. This framework and the procedures of generalizability theory are introduced and illustrated in this instructional module using a hypothetical scenario involving writing proficiency.
Article
A number of models for categorical item response data have been proposed in recent years. The models appear to be quite different. However, they may usefully be organized as members of only three distinct classes, within which the models are distinguished only by assumptions and constraints on their parameters. “Difference models” are appropriate for ordered responses, “divide-by-total” models may be used for either ordered or nominal responses, and “left-side added” models are used for multiple-choice responses with guessing. The details of the taxonomy and the models are described in this paper.
Article
A unidimensional latent trait model for responses scored in two or more ordered categories is developed. This “Partial Credit” model is a member of the family of latent trait models which share the property of parameter separability and so permit “specifically objective” comparisons of persons and items. The model can be viewed as an extension of Andrich's Rating Scale model to situations in which ordered response alternatives are free to vary in number and structure from item to item. The difference between the parameters in this model and the “category boundaries” in Samejima's Graded Response model is demonstrated. An unconditional maximum likelihood procedure for estimating the model parameters is developed.
IRTShiny: Item response theory via Shiny (Version 1.2)
  • W K Hamilton
  • A Mizumoto
Hamilton, W. K., & Mizumoto, A. (2017). IRTShiny: Item response theory via Shiny (Version 1.2). Retrieved from https://CRAN.R-project.org/ packageϭIRTShiny
Measurement with judges: Many-faceted conjoint measurement
  • J M Linacre
  • G Engelhard
  • Jr
  • D S Tatum
  • C M Myford
Linacre, J. M., Engelhard, G., Jr., Tatum, D. S., & Myford, C. M. (1994). Measurement with judges: Many-faceted conjoint measurement. International Journal of Educational Research, 21, 569 -577. http://dx.doi .org/10.1016/0883-0355(94)90011-6
jrt: Item Response Theory Modeling and Scoring for Judgment Data (Version 1.0.0) [Computer software
  • N Myszkowski
Myszkowski, N. (2019). jrt: Item Response Theory Modeling and Scoring for Judgment Data (Version 1.0.0) [Computer software]. Retrieved from https://CRAN.R-project.org/packageϭjrt
The partial revival of a dead horse? Comparing classical test theory and item response theory
  • M J Zickar
  • A A Broadfoot
Zickar, M. J., & Broadfoot, A. A. (2009). The partial revival of a dead horse? Comparing classical test theory and item response theory. In C. E. Lance & R. J. Vandenberg (Eds.), Statistical and methodological myths and urban legends: Doctrine, verity and fable in the organizational and social sciences (pp. 37-59). New York, NY: Routledge.