ArticlePublisher preview available

Judge Response Theory? A Call to Upgrade Our Psychometrical Account of Creativity Judgments

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The Consensual Assessment Technique (CAT)-more generally, using product creativity judgments-is a central and actively debated method to assess product and individual creativity. Despite a constant interest in strategies to improve its robustness, we argue that most psychometric investigations and scoring strategies for CAT data remain constrained by a flawed psychometrical framework. We first describe how our traditional statistical account of multiple judgments, which largely revolves around Cronbach's and sum/average scores, poses conceptual and practical problems-such as misestimating the construct of interest, misestimating reliability and structural validity, underusing latent variable models, and reducing judge characteristics as a source of error-that are largely imputable to the influence of classical test theory. Then, we propose that the item-response theory framework, traditionally used for multi-item situations, be transposed to multiple-judge CAT situations in Judge Response Theory (JRT). After defining JRT, we present its multiple advantages, such as accounting for differences in individual judgment as a psychological process-rather than as random error-giving a more accurate account of the reliability and structural validity of CAT data and allowing the selection of complementary not redundant-judges. The comparison of models and their availability in statistical packages are notably discussed as further directions.
Judge Response Theory? A Call to Upgrade Our Psychometrical Account
of Creativity Judgments
Nils Myszkowski
Pace University
Martin Storme
Université Paris Descartes
The Consensual Assessment Technique (CAT)—more generally, using product creativity judgments—is
a central and actively debated method to assess product and individual creativity. Despite a constant
interest in strategies to improve its robustness, we argue that most psychometric investigations and
scoring strategies for CAT data remain constrained by a flawed psychometrical framework. We first
describe how our traditional statistical account of multiple judgments, which largely revolves around
Cronbach’s and sum/average scores, poses conceptual and practical problems—such as misestimating
the construct of interest, misestimating reliability and structural validity, underusing latent variable
models, and reducing judge characteristics as a source of error—that are largely imputable to the
influence of classical test theory. Then, we propose that the item–response theory framework, tradition-
ally used for multi-item situations, be transposed to multiple-judge CAT situations in Judge Response
Theory (JRT). After defining JRT, we present its multiple advantages, such as accounting for differences
in individual judgment as a psychological process—rather than as random error— giving a more accurate
account of the reliability and structural validity of CAT data and allowing the selection of complemen-
tary—not redundant—judges. The comparison of models and their availability in statistical packages are
notably discussed as further directions.
Keywords: classical test theory, item–response theory, consensual assessment technique, creativity
judgment, creativity assessment
Although various methods have been imagined to assess cre-
ativity, a substantial amount of research relies on Amabile’s (1982)
Consensual Assessment Technique (CAT), which consists of ask-
ing experts to evaluate creative products (Baer & McKool, 2009).
Extensive research has provided a set of methodological guidelines
on how to best collect accurate judgments of creative products.
However, these methodological recommendations are often about
how to better prepare (e.g., Storme, Myszkowski, Çelik, & Lubart,
2014) or select judges (e.g., Kaufman, Baer, Cole, & Sexton,
2008). In contrast, there have been much fewer investigations
regarding how to examine the robustness of CAT data or how to
obtain accurate composite scores for the measured attribute.
To examine the robustness of CAT data and derive composite
scores for the attribute, researchers generally respectively compute
Cronbach’s across judges and sum (or average) scores to aggre-
gate judgments into a single score (Baer & McKool, 2009). There
have been uses of latent variable models of judgment data (e.g.,
Myszkowski & Storme, 2017;Silvia et al., 2008) and discussions
on how to investigate CAT data (e.g., Stefanic & Randles, 2015),
but the general measurement framework to adopt to investigate the
psychometric properties of CAT data and to obtain composite has
not yet been discussed.
In this article, we discuss the typical psychometric investiga-
tions of CAT and creativity judgments, as well as describe the
recurring challenges encountered. We trace them back to the
underlying framework of Classical Test Theory (CTT) and subse-
quently present the framework of Item–Response Theory (IRT) as
a more coherent and useful approach to CAT data.
The Limitations of Our Current Psychometrical
Practice
While the CAT is an important advance in the measurement of
product creativity, the employed statistical techniques that are
commonly used, in both psychometric investigations and scoring
strategies, result in critical challenges. In this section, we want to
point to the main ones.
The Issues of Sum/Average Scoring
Typically, to aggregate the scores of judges in CAT and thus
estimate a product’s creativity—in other words, to achieve its
measurement—researchers compute sums/averages across judg-
Nils Myszkowski, Department of Psychology, Pace University; Martin
Storme, Laboratoire Adaptations Travail-Individu, Université Paris Des-
cartes.
Correspondence concerning this article should be addressed to Nils
Myszkowski, Department of Psychology, Pace University, Room 1315, 41
Park Row, New York, NY 10038. E-mail: nmyszkowski@pace.edu
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Psychology of Aesthetics, Creativity, and the Arts
© 2019 American Psychological Association 2019, Vol. 13, No. 2, 167–175
1931-3896/19/$12.00 http://dx.doi.org/10.1037/aca0000225
167
... The authors use up-to-date methods, mainly relying on human rater judgments, a classical paradigm in the field [1]. Nevertheless, their work suffers from the scatteredness and inconsistency of the recommendations on rater-mediated assessment in creativity research [5,11]. Notably, their methods, based on linear mixed models (with intraclass correlation coefficients) make unrealistic assumptions regarding their data. ...
... While I do not advance that the validity of their findings is jeopardized here, it is clear that our field has yet to develop a consistent and robust methodological approach for ratermediated measurement. As I have argued [8,11], item-response theory (IRT) provides a unified framework that makes realistic assumptions and allows to account for item and rater effects (varying severities, difficulties, discrimination, etc.), along with their interactions [11,13,14]. Further, seeing how dispersed the literature is on reliability estimation with human raters in our field [5], and how it is usually disjointed from scoring procedures (e.g., the same average score is used regardless of the reliability estimate) [11], having a framework that uses a single model to estimate attributes (e.g., to measure an artwork's quality) and their uncertainty (i.e., standard error/reliability) is a key advance. ...
... While I do not advance that the validity of their findings is jeopardized here, it is clear that our field has yet to develop a consistent and robust methodological approach for ratermediated measurement. As I have argued [8,11], item-response theory (IRT) provides a unified framework that makes realistic assumptions and allows to account for item and rater effects (varying severities, difficulties, discrimination, etc.), along with their interactions [11,13,14]. Further, seeing how dispersed the literature is on reliability estimation with human raters in our field [5], and how it is usually disjointed from scoring procedures (e.g., the same average score is used regardless of the reliability estimate) [11], having a framework that uses a single model to estimate attributes (e.g., to measure an artwork's quality) and their uncertainty (i.e., standard error/reliability) is a key advance. ...
... A common approach is to ask human raters to provide subjective judgements for each response, in the spirit of the classic Consensual Assessment Technique (CAT; Amabile, 1982). Although subjective scoring can be reliable and valid (Amabile, 1982;Kaufman et al., 2007;Myszkowski & Storme, 2019), it is also time-consuming and resourceintensive, slowing the pace of research, and acting as a barrier for researchers and practitioners without the human resources to support subjective scoring methods such as the CAT. Recently, researchers have begun to rigorously test whether verbal creativity assessment can be automated using machine learning, with encouraging signs of progress, including strong correlations between computational metrics and human ratings (Acar et al., 2021;Beaty & Johnson, 2021;Buczak et al., 2023;Dumas et al., 2021;Stevenson et al., 2020). ...
... Consistent with the tenets of Classical Test or 'true score' theory -which assumes that observed data consist of a true underlying value plus random error -evaluations are most commonly averaged across experts to estimate the creativity of a given product (or modeled as a latent variable capturing the shared variance across raters). However, more contemporary latent variable approaches like Judge Response Theory (JRT) -an application of Item Response Theory to rating data -model evaluations as an underlying latent construct not only driven by the creativity of the product, but also the properties of the judges/raters (Myszkowski & Storme, 2019). When applied to divergent thinking tests, subjective scoring typically involves training raters to rate the originality of responses, using a continuous or ordinal scale (e.g., Silvia et al., 2008Silvia et al., , 2009. ...
Article
Full-text available
The visual modality is central to both reception and expression of human creativity. Creativity assessment paradigms, such as structured drawing tasks Barbot (2018), seek to characterize this key modality of creative ideation. However, visual creativity assessment paradigms often rely on cohorts of expert or naïve raters to gauge the level of creativity of the outputs. This comes at the cost of substantial human investment in both time and labor. To address these issues, recent work has leveraged the power of machine learning techniques to automatically extract creativity scores in the verbal domain (e.g., SemDis; Beaty & Johnson 53 , 757–780, 2021). Yet, a comparably well-vetted solution for the assessment of visual creativity is missing. Here, we introduce AuDrA – an Automated Drawing Assessment platform to extract visual creativity scores from simple drawing productions. Using a collection of line drawings and human creativity ratings, we trained AuDrA and tested its generalizability to untrained drawing sets, raters, and tasks. Across four datasets, nearly 60 raters, and over 13,000 drawings, we found AuDrA scores to be highly correlated with human creativity ratings for new drawings on the same drawing task ( r = .65 to .81; mean = .76). Importantly, correlations between AuDrA scores and human raters surpassed those between drawings’ elaboration (i.e., ink on the page) and human creativity raters, suggesting that AuDrA is sensitive to features of drawings beyond simple degree of complexity. We discuss future directions, limitations, and link the trained AuDrA model and a tutorial ( https://osf.io/kqn9v/ ) to enable researchers to efficiently assess new drawings.
... Judge response theory (JRT) refers to the adaptation of polytomous item response theory models [e.g., the graded response model (Samejima, 1969) or the generalized partial credit model (Muraki, 1992)] for human ratings in the context of creativity research (Myszkowski, 2021;Myszkowski & Storme, 2019). JRT explicitly models differences in the rating behavior of human judges as reflected by severity (or leniency) effects and differences between raters with respect to their discrimination parameter. ...
Article
Full-text available
Human ratings are ubiquitous in creativity research. Yet the process of rating responses to creativity tasks—typically several hundred or thousands of responses, per rater—is often time consuming and expensive. Planned missing data designs, where raters only rate a subset of the total number of responses, have been recently proposed as one possible solution to decrease overall rating time and monetary costs. However, researchers also need ratings that adhere to psychometric standards, such as a certain degree of reliability, and psychometric work with planned missing designs is currently lacking in the literature. In this work, we introduce how judge response theory and simulations can be used to fine-tune planning of missing data designs. We provide open code for the community and illustrate our proposed approach by a cost-effectiveness calculation based on a realistic example. We clearly show that fine tuning helps to save time (to perform the ratings) and monetary costs, while simultaneously targeting expected levels of reliability.
... Although some research has found strong correlations between expert and novice CAT ratings , other studies have found much weaker relationships (Freeman et al., 2017;Kaufman et al., 2008). There is evidence that expert and novice ratings are most similar when less complex works are judged (Galati, 2015), and other researchers have stated that future research could use factor analysis to determine how judge characteristics influence scores (Myszkowski & Storme, 2019). CAT ratings can also vary depending on whether instructions are open-ended or explicitly direct judges to avoid overlapping criteria such as aesthetic appeal and technical proficiency (Jeffries, 2015). ...
... IRT assumes manifest variables (i.e., the response behavior to test items) and a latent variable (i.e., an underlying characteristic of the subjects). Despite its clear advantages (e.g., items with different difficulty levels and sample independence of test characteristics), IRT approaches usually assume only one latent variable, which is reflected in the correlation between the manifest variables [10][11][12][13]. ...
Article
Full-text available
Parallel test versions require a comparable degree of difficulty and must capture the same characteristics using different items. This can become challenging when dealing with multivariate items, which are for example very common in language or image data. Here, we propose a heuristic to identify and select similar multivariate items for the generation of equivalent parallel test versions. This heuristic includes: 1. inspection of correlations between variables; 2. identification of outlying items; 3. application of a dimension-reduction method, such as for example principal component analysis (PCA); 4. generation of a biplot, in case of PCA of the first two principal components (PC), and grouping the displayed items; 5. assigning of the items to parallel test versions; and 6. checking the resulting test versions for multivariate equivalence, parallelism, reliability, and internal consistency. To illustrate the proposed heuristic, we applied it exemplarily on the items of a picture naming task. From a pool of 116 items, four parallel test versions were derived, each containing 20 items. We found that our heuristic can help to generate parallel test versions that meet requirements of the classical test theory, while simultaneously taking several variables into account.
Preprint
Full-text available
Creative thinking is a primary driver of innovation in science, technology, engineering, and math (STEM), allowing students and practitioners to generate novel hypotheses, flexibly connect information from diverse sources, and solve ill-defined problems. To foster creativity in STEM education, there is a crucial need for assessment tools for measuring STEM creativity that educators and researchers can apply to test how different teaching approaches impact scientific creativity in undergraduate education. In this work, we introduce the Scientific Creative Thinking Test (SCTT). The SCTT includes three subtests that assess cognitive skills important for STEM creativity: generating hypotheses, research questions, and experimental designs. In five studies with young adults, we demonstrate the reliability and validity of the SCTT—including test-retest reliability and convergent validity with measures of creativity and academic achievement—as well as measurement invariance across race/ethnicity and gender. In addition, we present a method for automatically scoring SCTT responses, training the large language model Llama 2 to produce originality scores that closely align with human ratings—demonstrating STEM-specific, automated creativity assessment for the first time. The full SCTT, along with the code to automatically score it, are available on a repository in the Open Science Framework.
Article
Full-text available
In this three‐study investigation, we applied various approaches to score drawings created in response to both Form A and Form B of the Torrance Tests of Creative Thinking‐Figural (broadly TTCT‐F) as well as the Multi‐Trial Creative Ideation task (MTCI). We focused on TTCT‐F in Study 1, and utilizing a random forest classifier, we achieved 79% and 81% accuracy for drawings only ( r = .57; .54), 80% and 85% for drawings and titles ( r = .59; .65), and 78% and 85% for titles alone ( r = .54; .65), across Form A and Form B, respectively. We trained a combined model for both TTCT‐F forms concurrently with fine‐tuned vision transformer models (i.e., BEiT) observing accuracy on images of 83% ( r = .64) . Study 2 extended these analyses to 11,075 drawings produced for MTCI. With the feature‐based regressors, we found a Pearson correlation with human labels ( r s = .80, 78, and .76 for AdaBoost, and XGBoost, respectively). Finally, the vision transformer method demonstrated a correlation of r = .85. In Study 3, we re‐analyzed the TTCT‐F and MTCI data with unsupervised learning methods, which worked better for MTCI than TTCT‐F but still underperformed compared to supervised learning methods. Findings are discussed in terms of research and practical implications featuring Ocsai‐D, a new in‐browser scoring interface.
Article
Item-response theory (IRT) represents a key advance in measurement theory. Yet, it is largely absent from curricula, textbooks and popular statistical software, and often introduced through a subset of models. This Element, intended for creativity and innovation researchers, researchers-in-training, and anyone interested in how individual creativity might be measured, aims to provide 1) an overview of classical test theory (CTT) and its shortcomings in creativity measurement situations (e.g., fluency scores, consensual assessment technique, etc.); 2) an introduction to IRT and its core concepts, using a broad view of IRT that notably sees CTT models as particular cases of IRT; 3) a practical strategic approach to IRT modeling; 4) example applications of this strategy from creativity research and the associated advantages; and 5) ideas for future work that could advance how IRT could better benefit creativity research, as well as connections with other popular frameworks.
Article
Full-text available
Immersive virtual reality (IVR) takes advantage of exponential growth in our technological abilities to offer an array of new forms of entertainment, learning opportunities, and even psychological interventions and assessments. The field of creativity is a driving force in both large-scale innovations and everyday progress, and imbedding creativity assessment in IVR programs has important practical implications for future research and interventions in this field. Creativity assessment, however, tends to either rely on traditional concepts or newer, yet cumbersome methods. Can creativity be measured within IVR? This study introduces the VIVA, a new IVR-based visual arts creativity assessment paradigm in which user create 3D drawings in response to a prompt. Productions are then rated with modern extensions of a classic product-based approach to creativity assessment. A sample of 67 adults completed the VIVA, further scored using item-response modeling. Results demonstrated the strong psychometric properties of the VIVA assessment, including its structural validity, internal reliability, and criterion validity with relevant criterion measures. Together, this study established a solid proof-of-concept of the feasibility of measuring creativity in IVR. We conclude by discussing directions for future studies and the broader importance and impact of this line of work for the field of creativity and virtual reality.
Article
Full-text available
Creativity assessment with open-ended production tasks relies heavily on scoring the quality of a subject's ideas. This creates a faceted measurement structure involving persons, tasks (and ideas within tasks), and raters. Most studies, however, do not model possible systematic differences among raters. The present study examines individual rater differences in the context of a planned-missing design and its association with reliability and validity of creativity assessments. It applies the many-facet Rasch model (MFRM) to model and correct for these differences. We reanalyzed data from 2 studies (Ns = 132 and 298) where subjects produced metaphors, alternate uses for common objects, and creative instances. Each idea was scored by several raters. We simulated several conditions of reduced load on raters where they scored subsets of responses. We then compared the reliability and validity of IRT estimated scores (original vs. IRT adjusted scores) on various conditions of missing data. Results show that (a) raters vary substantially on the lenient-severity dimension, so rater differences should be modeled; (b) when different combinations of raters assess different subsets of ideas, systematic rater differences confound subjects' scores, increasing measurement error and lowering criterion validity with external variables; and (c) MFRM adjustments effectively correct for rater effects, thus increasing correlations of scores obtained from partial with scores obtained with full data. We conclude that MFRM is a powerful means to model rater differences and reduce rater load in creativity research.
Article
Full-text available
Empirical studies in psychology commonly report Cronbach's alpha as a measure of internal consistency reliability despite the fact that many methodological studies have shown that Cronbach's alpha is riddled with problems stemming from unrealistic assumptions. In many circumstances, violating these assumptions yields estimates of reliability that are too small, making measures look less reliable than they actually are. Although methodological critiques of Cronbach's alpha are being cited with increasing frequency in empirical studies, in this tutorial we discuss how the trend is not necessarily improving methodology used in the literature. That is, many studies continue to use Cronbach's alpha without regard for its assumptions or merely cite methodological papers advising against its use to rationalize unfavorable Cronbach's alpha estimates. This tutorial first provides evidence that recommendations against Cronbach’s alpha have not appreciably changed how empirical studies report reliability. Then, we summarize the drawbacks of Cronbach's alpha conceptually without relying on mathematical or simulation-based arguments so that these arguments are accessible to a broad audience. We continue by discussing several alternative measures that make less rigid assumptions which provide justifiably higher estimates of reliability compared to Cronbach’s alpha.. We conclude with empirical examples to illustrate advantages of alternative measures of reliability including omega total, Revelle’s omega total, the greatest lower bound, and Coefficient H. A detailed software appendix is also provided to help researchers implement alternative methods.
Article
Full-text available
Three estimation methods with robust corrections—maximum likelihood (ML) using the sample covariance matrix, unweighted least squares (ULS) using a polychoric correlation matrix, and diagonally weighted least squares (DWLS) using a polychoric correlation matrix—have been proposed in the literature, and are considered to be superior to normal theory-based maximum likelihood when observed variables in latent variable models are ordinal. A Monte Carlo simulation study was carried out to compare the performance of ML, DWLS, and ULS in estimating model parameters, and their robust corrections to standard errors, and chi-square statistics in a structural equation model with ordinal observed variables. Eighty-four conditions, characterized by different ordinal observed distribution shapes, numbers of response categories, and sample sizes were investigated. Results reveal that (a) DWLS and ULS yield more accurate factor loading estimates than ML across all conditions; (b) DWLS and ULS produce more accurate interfactor correlation estimates than ML in almost every condition; (c) structural coefficient estimates from DWLS and ULS outperform ML estimates in nearly all asymmetric data conditions; (d) robust standard errors of parameter estimates obtained with robust ML are more accurate than those produced by DWLS and ULS across most conditions; and (e) regarding robust chi-square statistics, robust ML is inferior to DWLS and ULS in controlling for Type I error in almost every condition, unless a large sample is used (N = 1,000). Finally, implications of the findings are discussed, as are the limitations of this study as well as potential directions for future research.
Article
Full-text available
Intelligence assessment is often viewed as a narrow and ever-narrowing field, defined (as per IQ) by the measurement of finely distinguished cognitive processes. It is instructive, however, to remember that other, broader conceptions of intelligence exist and might usefully be considered for a comprehensive assessment of intellectual functioning. This article invokes a more holistic, systems theory of intelligence—the theory of successful intelligence—and examines the possibility of including in intelligence assessment a similarly holistic measure of creativity. The time and costs of production-based assessments of creativity are generally considered prohibitive. Such barriers may be mitigated by applying the consensual assessment technique using novice raters. To investigate further this possibility, we explored the question: how much do demographic factors such as age and gender and psychological factors such as domain-specific expertise, personality or self-perceived creativity affect novices’ unidimensional ratings of creativity? Fifty-one novice judges from three undergraduate programs, majoring in three disparate expertise domains (i.e., visual art, psychology and computer science) rated 40 child-generated Lego creatures for creativity. Results showed no differences in creativity ratings based on the expertise domains of the judges. However, judges’ personality and self-perception of their own everyday creativity appeared to influence the way they scored the creatures for creativity.
Article
Full-text available
The Consensual Assessment Technique is a powerful tool used by creativity researchers in which panels of expert judges are asked to rate the creativity of creative products such as stories, collages, poems, and other artifacts. Experts in the domain in question serve as judges; thus, for a study of creativity using stories and poems, a panel of writers and/or teachers of creative writing might judge the creativity of the stories, and a separate panel of poets and/or poetry critics might judge the creativity of the poems. The Consensual Assessment Technique is based on the idea that the best measure of the creativity of a work of art, a theory, a research proposal, or any other artifact is the combined assessment of experts in that field. Unlike other measures of creativity, such as divergent-thinking tests, the Consensual Assessment Technique is not based on any particular theory of creativity, which means that its validity (which has been well established empirically) is not dependent upon the validity of any particular theory of creativity. This chapter explains the Consensual Assessment Technique, discusses how it has been used in research, and explores ways it might be employed in assessment in higher education.
Article
Full-text available
The purpose of this study was to explore the reliability of measures of both individual and group creative work using the consensual assessment technique (CAT). CAT was used to measure individual and group creativity among a population of pre-service music teachers enrolled in a secondary general music class (n = 23) and was evaluated from multiple perspectives for reliability. Consistency was calculated using Cronbach's alpha. Judges were found to be highly consistent for individual creativity (α = .90), individual craftsmanship (α = .87), group creativity (α = .86) and group craftsmanship (α = 81). Judges were much less consistent with their ratings of aesthetic sensitivity for individual compositions (α = .67) or group performances (α = .69). Absolute agreement was calculated by using intraclass correlation coefficient (ICC). Judges were found to be highly in agreement for individual creativity (α = .79), individual craftsmanship (α = .83), group creativity (α = .87) and group craftsmanship (α = 83). Judges were much less in agreement with their ratings of aesthetic sensitivity for individual compositions (α = .57) or group performances (α = .71). Judges ratings for individual creativity were consistent over time, as evidenced by test-retest reliabilities of .89 (creativity), .83 (craftsmanship) and .79 (aesthetic sensibility). Results indicate, in agreement with prior research, that CAT is a reliable measure of creativity. The researchers introduce the idea that absolute agreement might be a worthwhile construct to explore in future work in the measurement of creativity in music education.
Article
This monograph is a part of a more comprehensive treatment of estimation of latent traits, when the entire response pattern is used. The fundamental structure of the whole theory comes from the latent trait model, which was initiated by Lazarsfeld as the latent structure analysis [Lazarsfeld, 1959], and also by Lord and others as a theory of mental test scores [Lord, 1952]. Similarities and differences in their mathematical structures and tendencies were discussed by Lazarsfeld [Lazarsfeld, 1960] and the recent book by Lord and Novick with contributions by Birnbaum [Lord & Novick, 1968] provides the dichotomous case of the latent trait model in the context of mental measurement.