ArticlePDF Available

Abstract and Figures

In this three‐study investigation, we applied various approaches to score drawings created in response to both Form A and Form B of the Torrance Tests of Creative Thinking‐Figural (broadly TTCT‐F) as well as the Multi‐Trial Creative Ideation task (MTCI). We focused on TTCT‐F in Study 1, and utilizing a random forest classifier, we achieved 79% and 81% accuracy for drawings only (r = .57; .54), 80% and 85% for drawings and titles (r = .59; .65), and 78% and 85% for titles alone (r = .54; .65), across Form A and Form B, respectively. We trained a combined model for both TTCT‐F forms concurrently with fine‐tuned vision transformer models (i.e., BEiT) observing accuracy on images of 83% (r = .64). Study 2 extended these analyses to 11,075 drawings produced for MTCI. With the feature‐based regressors, we found a Pearson correlation with human labels (rs = .80, 78, and .76 for AdaBoost, and XGBoost, respectively). Finally, the vision transformer method demonstrated a correlation of r = .85. In Study 3, we re‐analyzed the TTCT‐F and MTCI data with unsupervised learning methods, which worked better for MTCI than TTCT‐F but still underperformed compared to supervised learning methods. Findings are discussed in terms of research and practical implications featuring Ocsai‐D, a new in‐browser scoring interface.
This content is subject to copyright. Terms and conditions apply.
Automated Scoring of Figural Tests of Creativity 1
Manuscript Post-Print
Citation:
Acar, S*., Organisciak, P*., & Dumas, D. (2024). Automated scoring of figural tests of creativity
with computer vision. Journal of Creative Behavior. Advance Online
Publication. https://doi.org/10.1002/jocb.677
Formerly titled: “A Comparison of Supervised and Unsupervised Learning Methods in
Automated Scoring of Figural Tests of Creativity”
Selcuk Acar1*, Peter Organisciak2*, Denis Dumas3,
1 Department of Educational Psychology, University of North Texas
2Library and Information Science, University of Denver
3 Department of Educational Psychology, The University of Georgia
Author’s Note
*Dr. Selcuk Acar and Dr. Peter Organisciak contributed equally to this work and share first-
authorship. We have no known conflict of interest to disclose. Data availability statement: Data
and analyses are available at
https://osf.io/q82ax/?view_only=69bff72648c64c6c992420b971e716ad. Input drawings for
TTCT are not shared due to copyright protection.
Corresponding author: Selcuk Acar, Department of Educational Psychology, Matthews
Hall, Room 304E, 1300 W. Highland Street, Denton, TX 76201.
selcuk.acar@unt.edu
Dr. Peter Organisciak: peter.organisciak@du.edu
Dr. Denis Dumas: denis.dumas@uga.edu
Automated Scoring of Figural Tests of Creativity 2
April 27, 2024
Automated Scoring of Figural Tests of Creativity 3
Abstract
In this three-study investigation, we applied various approaches to score drawings created in
response to both Form A and Form B of the Torrance Tests of Creative Thinking-Figural
(broadly TTCT-F) as well as the Multi-Trial Creative Ideation task (MTCI). We focused on
TTCT-F in Study 1, and utilizing a random forest classifier, we achieved 79% and 81% accuracy
for drawings only (r=.57; .54), 80% and 85% for drawings and titles (r=.59; .65), and 78% and
85% for titles alone (r=.54; .65), across Form A and Form B, respectively. We trained a
combined model for both TTCT-F forms concurrently with fine-tuned vision transformer models
(i.e., BEiT) observing accuracy on images of 83% (r=.64). Study 2 extended these analyses to
11,075 drawings produced for MTCI. With the feature-based regressors, we found a Pearson
correlation with human labels (rs=.80, 78, and .76 for AdaBoost, and XGBoost, respectively).
Finally, the vision transformer method demonstrated a correlation of r=.85. In Study 3, we re-
analyzed the TTCT-F and MTCI data with unsupervised learning methods, which worked better
for MTCI than TTCT-F but still underperformed compared to supervised learning methods.
Findings are discussed in terms of research and practical implications featuring Ocsai-D, a new
in-browser scoring interface.
Keywords: Creativity assessment; divergent thinking; Torrance Tests of Creative Thinking;
Multi-Trial Creative Ideation task; Supervised Learning; Vision Transformers; Computer Vision
Automated Scoring of Figural Tests of Creativity 4
Automated Scoring of Figural Tests of Creativity with Computer Vision
Creativity is a key factor for an innovation-driven workforce (Petrone, 2018), ranked at
the highest level of thinking in Bloom’s revised taxonomy (Anderson et al., 2001), and listed
among 21st century skills that schools need to foster (Partnership for 21st Century Skills, 2008).
Creative thinking has gained global recognition in education, as evidenced by its inclusion in the
Program for International Student Assessment (PISA; OECD, 2022) assessments. Further,
creativity is a key component of gifted education and its assessment. The field of gifted
education has been moving toward universal screening for gifted identification (Card &
Giuliano, 2016; Peters, 2022), but the cost of assessments is a major challenge to the
implementation of these efforts (McBee et al., 2016). This is especially so for creativity
assessment tools such as the Torrance Tests of Creative Thinking-Figural (TTCT-F), as manual
scoring of these tasks takes time and extensive training
1
. Such classic creativity tests are open-
ended where participants are instructed to generate original, creative, and unexpected responses.
Scoring these responses can be complicated in the absence of an answer key.
Efforts to automatically score creativity assessments have recently gained widely
recognized traction (Acar & Runco, 2019; Plucker, 2022), but this process has a relatively long
history. To our knowledge, the first report of documented evidence to apply automated scoring to
creativity tests dates to Paulus et al. (1970) who developed predictions of major creativity
indicesfluency, flexibility, and originality (the number, diversity, and unusualness of produced
responses, respectively)based on text features such as number of specific notations (e.g.,
commas, periods, question marks), number of words and sentences, and average word length
(see Forthman & Doebler, 2022 for a replication study). After a long silence, the second wave
1
For example, scoring adds an additional cost of $8.60 per student booklet as of 2023 on top of the cost of test
booklets.
Automated Scoring of Figural Tests of Creativity 5
started with Forster and Dunbar’s (2009) application of latent semantic analysis to divergent
thinking tasks, which operationalized originality as the semantic distance between a divergent
thinking prompt word and each individual response for it and utilized open-access software
based on the touchstone applied sciences corpus (Landauer & Dumais, 1997). This was later
followed by others that used associative networks (Acar & Runco, 2014) to determine if the
produced responses overlap with those already listed in these networks (original if not found on
this list of associates), and those who explored latent semantic analysis with various corpora
(Forthmann et al., 2019), examined construct validation of these scores (Dumas & Dunbar,
2014), and tracked the idea generation process and latency in divergent thinking (Hass, 2017a,
2017b). Dumas et al. (2020) showed improvements in semantic distance scores by adopting
Global Vectors for Word Representation (GloVe; Pennington et al., 2014) into scoring of
divergent thinking responses given for the Alternate Uses Test (Guilford, Merrifield, & Wilson,
1958) and later work by Acar, Berthuiaume et al. (2023) extended this to Just Suppose Test in
the verbal form of Torrance Tests of Creative Thinking. Semantic distance scores based on
GloVe can now be used as open-access tools for researchers (see Beaty & Johnson, 2021;
Organisciak & Dumas, 2020). This phase is also the first in automated scoring to benefit from
the use of transfer learning (Pan & Yang 2009), where research groups release trained models,
not just architectures or procedures, for further use downstream. This reduces the burden for a
researcher to train a detailed foundational model of verbal or visual language from scratch and
allowed the emergence of larger and more nuanced models. An applied researcher can adopt a
well-trained model, usually released accompanying an architecture (e.g., GloVe models
accompanying the GloVe paper, Pennington et al. 2014) use it directly for a task or, as described
next, adapt it from a strong foundation.
Automated Scoring of Figural Tests of Creativity 6
Applications of Supervised Learning
The defining characteristic of this second wave is the application of a semantic distance
framework to verbal divergent thinking prompts (predominantly Alternate Uses Test) to score
originality. Despite its promise, this application has a few obvious limitations: it has been only
applicable to verbal prompts, and originality is broader than semantic distance (Wilson et al.,
1953). This second wave also represents an unsupervised approach in scoring the divergent
thinking tests where scores are obtained by machine learning algorithms without any use of
labeled datasets (i.e., lack of classifier for an original versus unoriginal response) with no
intervention of human ratings. Supervised learning involves training models on labeled datasets,
where each input is associated with a corresponding output, enabling the algorithm to learn the
mapping between inputs and outputs. This contrasts with unsupervised learning, where models
are trained on unlabeled data to identify patterns and structures without explicit guidance. Here,
algorithms explore the data autonomously to uncover underlying relationships. Shortly put,
supervised learning relies on labeled examples for prediction tasks to search for patterns, as has
been applied in recent creativity work (Buczak et al., 2022; Organisciak et al., 2023), whereas
unsupervised learning does not (Alpaydin, 2020).
Organisciak et al. (2023) applied a large language modeling approach, one kind of
supervised learning method, to score seven different Alternate Uses Test datasets that were rated
by two or more human judges. They showed that the scores from the fine-tuned large language
models surpassed the semantic distance scores in approximating the human ratings consistently
across these seven datasets (r = .81 vs r < .26). Newer large-language-models, such as ChatGPT
and GPT-4, even surpass semantic distance approaches without a fully-supervised training
process, and can be prompted few-shot with a few examples of human judgments (Organisciak et
Automated Scoring of Figural Tests of Creativity 7
al., 2023). The success of the supervised learning models in scoring creativity tasks was
documented in other recent work, as well. Buczak et al. (2022) compared three different machine
learning algorithms (i.e., Random Forest, XGBoost, and the Support Vector Regression) and
found that that long responses and responses with greater semantic distance produced better
estimates of originality (see also Forthman & Doebler, 2022). Stevenson et al. (2020) followed a
hybrid approach where responses were first clustered based on semantic distance scores, which is
unsupervised learning, and then these responses were used to train the scoring algorithm to
automatically predict new scores.
Automated Scoring of Figural Responses
As previously mentioned, the summarized works above have focused on verbal divergent
thinking tasks whereas creativity tasks can also be figural. In fact, even some of the verbal
response tasks involve figural prompts such as Toy Improvement (Torrance, 1998) and Pattern
Meanings (Wallach & Kogan, 1965). Extending automated scoring methods to figural tasks is a
remarkable improvement to broaden the scope of creativity assessment. This is important at least
in two ways. First, performance on verbal and figural creativity tasks tend to be distinct
(Richardson, 1986) and are only moderately correlated (Clapham, 2004; Kim, 2017; Ulger,
2015). Thus, use of figural tests provides information about participants that is not supplied by
the verbal tests (and vice versa). Second, figural tests may provide a unique advantage for a more
inclusive assessment that could expected to be less biased by culture or language background
(Lohman & Gambrell, 2012)
2
.
2
The term "figural tests" encompasses various types of tasks. One category of such tests, exemplified by the Multi-
Trial Creative Ideation task, relies solely on figural elements, with both prompts and responses confined to shapes,
figures, or squiggles, devoid of titles or explanations. Another category involves tasks that utilize a figural prompt to
elicit verbal responses, as seen in Line Meanings (Wallach & Kogan, 1965). A third category combines these
approaches, with figural prompts and responses incorporating both figural and verbal components, such as
Automated Scoring of Figural Tests of Creativity 8
Creativity researchers have made remarkable progress in applying automated scores to
figural tests of creativity. Cropley and Marrone (2022) used image classification with
convolutional neural networks (Hussain et al., 2018; LeCun & Bengio, 1995), a machine
learning method that processes visual outputs, applied to the Test of Creative ThinkingDrawing
Production (TCT-DP; Urban & Jellen, 1996). Cropley and Marrone (2022) tested five levels of
classification schemes ranging from the simplest binary classification (low creative vs. high
creative) to seven levels of classifications (1= Far below average, 7 = Phenomenal). After using
these classifications to train the CNN algorithms, they found a high level of accuracy across
various classification schemes between the automated and manual classification ( = .83 to .94).
Patterson et al. (2023) applied similar methods to the Multi-Trial Creative Ideation task (MTCI;
Barbot, 2018) that consists of some incomplete visual prompts that participants are directed to
turn into a meaningful drawing. Their dataset (13,000 total drawings, 1,104 for validation, 2,216
for test) was larger than that of Cropley and Marrone’s (414 total drawings; ~ 60 test drawings)
and used 50 human raters’ scores to train ResNet, a deep convolutional neural network (CNN;
He et al., 2016) with 70% of the dataset. Their Automated Drawing Assessment (AuDrA)
platform showed a strong correlation with human raters (average r = .76) and exceeded their
baseline comparison method that defined creativity as the amount of “ink” on a white
background space and learned to discern creativity demonstrated in the drawings beyond the
amount of color contrast.
completed drawings accompanied by titles. The TTCT-F falls into this third category because one of the five major
scoring indices focus on the verbal output, prompting consideration in future research to investigate whether
performance on tasks involving titles presents any disadvantages influenced by cultural or language backgrounds.
Automated Scoring of Figural Tests of Creativity 9
The Present Study
Machine learning offers practical solutions for creativity assessment that can encourage
an affordable, sustainable, and inclusive assessment in both verbal and figural modalities.
Because remarkable improvements have already been accomplished in the verbal modality, we
focused on improving our methods in the figural tests. To facilitate this goal, the present work
complements work by Cropley and Marrone (2022) and Patterson et al. (2023) and applies
computational scoring methods to two figural creativity tests in Study 1 and Study 2 and used
unsupervised learning methods in Study 3 to assess performance gains between the two
approaches. In Study 1, we compare multiple supervised machine learning systems on the
TTCT-F Forms A and B. In supervised learning, a system is trained to learn from prior human
rater generated labels, and its generalizability is subsequently evaluated on held-out data. Two
bodies of supervised methods are evaluated. First, a feature-based classification approach is
employed, trained on image features extracted with a model called CLIP, for Contrastive
LanguageImage Pre-training (Radford et al., 2021). The classifiers evaluated are Random
Forest trees (Breiman, 2001), AdaBoost (Freund & Schapire, 1997), and XGBoost (Chen &
Guestrin, 2016). Random Forests and XGBoost have previously shown strong results with
automated scoring of verbal originality (Buczak et al 2023). All of these are improvements to
decision trees, where a model learns a set of decision points that branch toward a decision. In
each case, the algorithm receives a numeric representation of the input figural test the CLIP
embedding and tries to learn a tree for predicting the originality of the test. Since decision trees
can be led down a garden path toward a non-optimal decision, and the three approaches are
ensemble learners that learn their final model from multiple iterations of tentative ‘weak trees.
Automated Scoring of Figural Tests of Creativity 10
The second body of supervised learning approaches that we evaluate employs fine-tuned
vision transformer models, specifically the Vision Transformer (ViT) and Bidirectional Encoder
representation from Image Transformers (BEiT) models (Bao et al., 2021; Dosovitskiy et al.,
2021). This approach is adopted as our primary approach, which we refer to as Open Creativity
Scoring with Artificial Intelligence for Drawings (Ocsai-D). Visions Transformers are a distinct
approach from CNNs, though, breaking images down into small patches that are trained as
sequences, similar to how modern large language models do with text (Vaswani et al., 2017). In
contrast to convolutional neural networks (CNNs) such as MobileNet, applied by Cropley and
Marrone (2022), and ResNet, applied by Patterson et al. (2023), Vision Transformers scale up to
bigger models more capably and can understand a full image at once, allowing a higher ceiling
on their potential performance. Whereas CNNs start by interpreting relationships in localized
sections of an image, a vision transformer considers relationships across the full image at once,
using a mechanism called self-attention to determine what requires more focus (Vaswani et al.,
2017). Training and evaluations for Study 1 used an 85/5/10 train/validation/test split. Training
and evaluations for Study 1 used an 85/5/10 train/validation/test split. For the feature classifiers,
the validation data was not needed and included with the testing data.
In Study 2, we apply the same supervised learning methods to a new context, the Multi-
Trial Creative Ideation (MTCI; Barbot, 2018), on the dataset previously studied in Patterson et
al. (2023). In addition to offering a look at the generalizability of our supervised learning
methods across tests, evaluating on this dataset builds on and provides a direct comparison with
Patterson et al. (2023), achieving state-of-the-art performance. The MTCI does not include
participant-generated text labels, so supervised learning is exclusively on visual elements.
Automated Scoring of Figural Tests of Creativity 11
In Study 3, we evaluate unsupervised methods relying on semantic distance, akin to a
popular approach from automated verbal originality measurement (e.g., Acar, Berthiaume et al.,
2023; Beaty & Johnson 2021; Dumas et al., 2021). Unsupervised methods have the benefit of not
being dependent on needing human-judged training responses. Here we use semantic distance in
CLIP between figural responses and the original prompt as a proxy for originality, evaluating
three different instantiations of the approach: distance from a blank prompt, distance from an
average response, and distance from the TTCT-F zero-originality list. Whereas traditional
computer vision models learn from visual similarities, the combined training alongside text leads
distances in CLIP to be more semantically oriented.
Across all three studies, the reported values reflect response level analyses that are
averaged for the purpose of parsimony rather than individual level analyses where responses
given by a participant are aggregated by summation or through a latent variable model.
Study 1: Supervised Scoring on TTCT-F
In Study 1, we evaluated computational scores obtained from supervised learning
methods with the TTCT-F data. TTCT-F is the most well-known test in the field of creativity
that has age and grade-based norms, allowing gifted identification to be conducted based on a
threshold determined by the local schools or school districts. Automation of TTCT-F could save
schools’ limited resources for gifted education programs from assessment and allocate it to less-
prone-to-automation tasks such as teaching (Frey & Osborne, 2017). This change could also
encourage more schools to bring creativity assessment back to gifted identification due to its
affordability. Further, these newly developed machine learning methods can lead to the
Automated Scoring of Figural Tests of Creativity 12
development of new generation of creativity assessments that are even more affordable due to
the lack of necessity to buy commercial tests of creativity such as TTCT
3
.
Structure of TTCT-F
TTCT-Figural is a timed test of creative thinking comprising three activities, each taking
10 minutes. Activity 1 involves a single prompt where respondents are asked to construct a
picture using the visual prompt. Activity 2 consists of 10 visual prompts spreading over two
pages where each individual prompt is completed into a picture or drawing. Participants are
asked to provide a title for each drawing, which is used for scoring alongside the drawings.
TTCT-F is scored for five major indices and 13 creativity strengths. Five major indices include
a) fluency that is scored in Activity 2 and 3 as the number of meaningful, non-repeating,
drawings, b) originality that is scored on all three activities based on the zero-originality lists
where commonly produced responses are provided to identify those that are not on the list, c)
elaboration that is scored on all three activities based on the amount of detail and elegance in the
drawing responses, d) abstractness of titles that is scored on Activity 1 and 2 in terms how
abstract a title given for a drawing is, e) resistance to premature closure that is scored only on
Activity 2 as the openness to exploration of a variety of ideas rather than jumping to quick
conclusions. These five major indices are the backbone of the total creativity score and can only
be quantified by a trained expert. The training process takes about 2 full-day trainings followed
by a test cycle where trainees submit their ratings to prove their proficiency. According to the
anecdotal evidence from those with the certification (including the first author of the current
paper), at least several cycles of evaluation and feedback lasting for a few weeks to months
3
Test booklets and scoring are sold separately. Even when scoring is free, test booklets still cost some money.
Automated Scoring of Figural Tests of Creativity 13
precede the certification. This explains the cost of scoring facing the schools that intend to use
TTCT-F for gifted identification.
Research Question
Supervised learning seeks to learn the parameters of a task from held-out examples of the
desired response. It can be trained as both a regressor, which predicts a continuous variable, or a
classifier, which seeks to predict discrete classes. The TTCT-F is a binary classification
challenge, where a response is either original (1) or not original (0). Here, we posed the
following research questions:
1. Can a classification model be trained on the combination of image-derived computer
measures and manual judgments to predict originality?
2. Is image classification on the TTCT-F improved by the inclusion of participant- provided
text titles for responses?
TTCT-F provides both image and text features that can be used to train a classifier. Here
we examine if images (produced drawings), text (titles of the drawings), and their combined use
(titles and drawings) result in greater accuracy of responses. Further, we compared two bodies of
approaches: a) traditional feature-based machine learning classifiers on CLIP outputs and b)
transfer learning by fine-tuning large computer vision models, specifically ones using Vision
Transformer architectures, to our target domain. The final approach, which we refer to as Open
Creativity Scoring with AI for Drawings (Ocsai-D) is adopted as our primary contribution. We
also examined if one of these methods work better with TTCT-F data than others.
Method
Automated Scoring of Figural Tests of Creativity 14
Participants
Our research team used a specific dataset from Engerman et al. (2010), where expert
raters coded individual scores for each prompt, allowing us to examine responses at the prompt
level. The sample consisted of 464 freshman college students who were enrolled in a Historically
Black College and University, with a 75.9% Black student population. They completed forms A
and B of TTCT-F as part of a STEM intervention project. Our research had no access to
individually identifiable demographic information of the study participants.
Measures
TTCT-F Form A. TTCT-F includes three activities, but we focused on Activities 1 and 2
in the dataset because the scores and drawings for each individual prompt were more easily
analyzed than those in Activity 3. We conducted the analyses at the prompt level. Both Activities
1 and 2 are scored on originality and elaboration, which we used in our present analyses.
Procedures
We evaluated two approaches to training classifiers to predict figural originality from
prior information. The first approach applies traditional feature-based machine learning
classifiers on the vector outputs of multimodal models, specifically CLIP. The second approach
finetunes pre-trained vision transformer models, specifically ViT (Dosovitskiy et al., 2021) and
BEiT (Bao et al., 2021).
Feature-based classifiers seek to predict an output variable, a binary originality class in
our case, from a set of input variables. For the input features, we adopt the multimodal CLIP
model (Radford et al. 2021), specifically the CLIP ViT-B/32 sized model. With CLIP, each
image is assigned a place in a 512-dimensional space, and the embedding is used as the input
features for classifiers. The primary value of CLIP is that it is a multi-modal model capable of
Automated Scoring of Figural Tests of Creativity 15
projecting text and images into the same space. This encourages ‘nearness’ in the model to be
more semantic than a visual model trained solely on visual space, and allows text and images to
be compared directly. CLIP achieves this by learning from a large dataset of images paired with
descriptive texts. This training process leverages the symbolic nature of written languagea
system of signifiers that represents concepts (de Saussure, 1913). Language thus serves as a
bridge linking different visual representations through shared meanings. For example, consider
three images: a cartoon of a dog, a cubist painting of a dog, and a photograph of a dog. All three
images may vary visually, but just like the preceding sentence itself demonstrates, the language
ties them together as representations of dogs.
For the first research question we evaluate image features. However, TTCT-F figures are
accompanied by a text caption, and for the second research question, classification on CLIP-
derived text labels, as well as a 1,024 concatenation of text and image labels, are both evaluated.
Three types of feature-based classifiers are compared: Random Forest trees (Breiman, 2001),
AdaBoost (Freund & Schapire, 1997), and XGBoost (Chen & Guestrin, 2016). Each is an
ensemble classifier, which combines the predictive power of multiple smaller models to generate
more accurate and robust predictions than a single model. XGBoost and Random Forests have
previously been used in verbal originality scoring (Buczak et al., 2022). As ensemble classifiers,
they are instantiated with a number of ‘weak learner’ estimators which are used to compose the
final strong model. Each experiment reported here is instantiated with 300 estimators and trained
directly with CLIP features, varying between experiments on whether CLIP-encoded images,
text or a concatenation of both was used. Learning rate was set to 0.2 for XGBoost and 1.0 for
AdaBoost, and parameters were kept constant across experiments, without employing hyper-
parameter tuning.
Automated Scoring of Figural Tests of Creativity 16
Vision transformer models are a novel type of deep neural network computer vision
model which is trained with a process called a transformer (Vaswani et al., 2017), originated
from text modeling. For example, in the BEiT architecture we apply, images are split into non-
overlapped 16x16 pixel square patches, where each unique patch is assigned a unique token.
These patch tokens are then fed sequentially into a transformer-based model, which aims to learn
relationships between patches in a self-supervising way, masking (hiding from the model)
random patches. The objective of the model is to generate a good guess at what the missing patch
is, based on the surrounding context. The transformer architecture employs a concept called
attention (Bahdanau et al., 2016), which offers an efficient way to model sequences by
selectively focusing more on certain parts of the sequence (e.g., parts of the image) than others.
Altogether, this initial training step of vision transformer models functionally resembles how
large language models are trained, and the naming of Bidirectional Encoder Representations
from Image Transformers (BEiT, Bao et al., 2021) is an allusion to the first notable transformer-
based language model, Bidirectional Encoder Representations from Transformers (BERT,
Devlin et al., 2018).
As its input, BEiT takes the images of responses that are broken down into 16x16 pixel
patches, and each unique patch is assigned a token ID, and the sequence of patches is used in the
model. The pre-trained BEiT model that we adopt from Bao et al. (2023) was trained to predict
masked patches in a Transformer architecture (Vaswani et al., 2017): that is, the relationship of
elements in an image was learned by seeing if a set of hidden patches can be predicted based on
the context of the patches around them. To that initial pre-trained model, having learned image
space in a self-supervised way but without any supervised semantic understanding yet, Bao et al
(2023) attached a classification layer and trained on a large collection of labeled images the
Automated Scoring of Figural Tests of Creativity 17
models we adopt were trained on the ImageNet 22k class dataset. In this work, we adopt that
model, and replace the final classification layer with our own originality classification layer,
while keeping all other layers and weights unfrozen, then train the model using the TTCT-F
training dataset with the same architecture as trained the original models (i.e. ViT or BEiT).
Training was done in 9 epochs, with the AdamW optimizer (Loshchilov & Hutter 2019), a max
learning rate of 5e-5, and a scheduler that did a warmup for 20% of the total steps. Batch size
and gradient accumulation steps were varied across models sizes, to allow for bigger batches
when GPU memory allowed: 20 and 8 for larger models, 48 and 4 otherwise.
Data and analyses are available at
https://osf.io/q82ax/?view_only=69bff72648c64c6c992420b971e716ad. Input drawings for
TTCT-F are not shared for copyright reasons.
Results
As previously explained, we ran the analyses adopting two approaches: feature-based
classification approach and fine-tuned vision transformer models. Below, we report our findings
for these two approaches, respectively.
Feature Classification Methods
We tested the performance of three different classifiers (AdaBoost, Random Forests,
XGBoost) in three conditions of response: image, text, and image and text. For each approach,
an individual classifier was trained for each activity. Table 1 presents the results of feature
classification, shown separately for the two different booklets of the TTCT-F, and reported for
the image-only models as well as the text-only and text + image model. The differences between
classifiers were small, and Random Forests are reported here; full results with XGBoost and
Adaboost are presented in the Appendix. Reported is accuracy, the total correct proportion of
Automated Scoring of Figural Tests of Creativity 18
class judgements, as well as precision, recall, and the F1 Score, which provide a fuller picture of
performances beyond overall correctness. Precision is the ratio of predicted positive classes that
are truly positive, recall is the ratio of predicted labels in a class to the total number of true labels
in that class, and F1 is the harmonic mean between the precision and recall measures. We also
report Pearson correlations, as r. Finally, we provided a weighted mean of results from both
booklets, reported as combined.
Table 1
Feature Classification on TTCT-F with Random Forest Trees
TTCT-F
Form
Modality
Accuracy
Precision
Recall
F1
Form A
Image
.79
.77
.92
.84
Image & Text
.80
.78
.92
.85
Text
.78
.76
.90
.83
Form B
Image
.81
.80
.95
.87
Image & Text
.85
.84
.96
.90
Text
.85
.84
.96
.89
Combined
Image
.80
.78
.93
.85
Image & Text
.82
.81
.94
.87
Text
.81
.80
.93
.86
N = 3,000 (form A figures), 2,790 (form A figures with text), 2,572 (booklet B figures), 2,430 (booklet B
figures with text), Combined measures are sample weighted; TTCT-F: Torrance Tests of Creative
Thinking-Figural
Fine-tuned Vision Transformer Models
Here, we report on the results of our computer vision model fine-tuning. In contrast to the
feature-based classification approach, where entirely new classifiers are trained on latent CLIP
representations of images, here a pre-trained deep neural networkthe BEiT model (Bao et al.,
2023)–is unlocked and ‘fine-tuned’ to our problem space. There are different BEiT models
available with different instantiations of model size, image size, and pre-training parameters.
Here, we use 224 x 224 pixel inputs, adopting models that were pre-trained on the ImageNet 22k
Automated Scoring of Figural Tests of Creativity 19
dataset (Russakovsky et al., 2015). While we expect that larger BEiT models would perform
better (Bao et al., 2023), a comparison of the difference between the ‘base’ and ‘large’ model
sizes is also provided, as well as a comparison of BEiT with an antecedent vision transformer
model, ViT (Dosovitskiy et al., 2021).
Training is performed with up to 10 epochs, with early stopping possible based on
validation data. We compared a variety of image input transformation techniques to add
robustness to the models, such as randomized horizontal flips or color inversions. However, for a
standardized task that is very dependent on the relationship of added ink to the base prompt ink,
these did not aid performance, and only models with untransformed inputs are reported. A
benefit of untransformed images is that the inputs did not need to be tagged with any additional
encoding for marking which question prompt an image is associated with. Rather, each prompt is
recognizable through the dark pixels that are consistently dark throughout the responses to that
prompt.
TTCT-F only classifies originality as a binary variable, and the model in turn treats the
problem in the same manner. However, our model’s classification layer uses softmax, where it
predicts likelihood of either class on a continuous scale, and the final label is assigned from the
most likely class. Thus, we can also measure Pearson correlation with the raw continuous scores,
before the class assignment, to see more signal about how well the model understands the task
before it must flatten its prediction to a binary class. This is labelled ‘continuous r’ in the results.
As seen in Table 2, BEiT achieved superior performance than ViT, specifically with
respect to the F1, and BEiT Large in particular was comparable to the combined random forest
trees classifier reported in Table 1.
Automated Scoring of Figural Tests of Creativity 20
Table 2
Vision Transformer Results
Model
Accuracy
Precision
Recall
F1
r
Continuous
r
BEiT
.79
.77
.96
.85
.55
.64
BEiT Large
.83
.83
.91
.87
.63
.64
ViT
.78
.77
.92
.84
.51
.55
ViT Large
.80
.82
.88
.84
.56
.60
BEiT = Bidirectional Encoder Representations from Image Transformers; ViT = Vision
Transformer.
Discussion
Findings in Study 1 that employed feature-classification and vision transformer models
had overlapping and promising results. Since F1 values are indicative of balanced overall
performance, we focus on them to compare the strength of various scoring options. Accordingly,
feature-classification models with random forest trees and BEiT Large have the same
performance (F1 = .87) when the former used both image (drawings) and text (titles given for the
drawings). There are relative benefits to each approach, a primary difference being that BEiT
Large resulted in one trained model for all items, while the random forests trained a classifier for
each item. Between the two forms, feature classification with random forests worked better for
Form B (F1 = .90) than Form A (F1 = .85), implying the possibility of variance in performance
in relation the prompts and types of drawings they invoke. Finally, while the combination of
images and text consistently yielded the strongest results, both sources individually yielded
strong results, including models that ignored images and only considered text labels.
Study 2: Supervised Scoring on the MTCI
To test the usefulness of the supervised learning methods tested in Study 1 with a
different dataset, we evaluated the methods of Study 1 on drawings produced for the Multi-Trial
Creative Ideation task (MTCI; Barbot, 2018). The dataset was previously used in Patterson et al.
Automated Scoring of Figural Tests of Creativity 21
(2023) who applied supervised learning methods, using a trained ResNet (He et al., 2016) model,
with the same goal of automated originality scoring. An important aspect of this dataset was that
the responses were rated by a large group of judges, whose scores produced an aggregate
continuous score rather than a binary classification. Study 1 of the present paper as well as
Cropley and Marrone’s (2022) work both utilized ordered categorical originality outcomes, but
this might be a limitation to the effectiveness of model training. In the absence of such
limitations related to the nature of the outcome data, the performance of various supervised
learning methods can be more effectively assessed.
In Study 2, we addressed the following research question:
Can the models trained on the combination of computer measures and manual judgments in
Study 1 also predict creativity in the drawings produced for different figural tasks (i.e., Multi-
Trial Creative Ideation task)?
Method
Measures
Participants completed the drawings in response to MTCI (Barbot, 2018) “incomplete
shapes” task. Most of the drawings (~95%) were based on abstract stimuli that resemble Activity
2 or 3 in TTCT-F, and Line Meanings in Wallach and Kogan (1965). Each drawing was then
rated by 50 mostly female (60%) undergraduate raters (M = 19.4) whose scores reached an
average intraclass correlation of .90, indicating satisfactory consistency. Ratings were conducted
on 5-point scale (1 = not at all creative; 5 = very creative). Raters were asked to focus on the idea
rather than the technical quality of the drawings. Judge Response Theory (JRT; Myszkowski &
Storme, 2019) was used to derive a final ground truth label reflecting the collective judgement of
the individual raters. JRT, which seeks to account for differences among rater habits, was found
Automated Scoring of Figural Tests of Creativity 22
to have a net-neutral effect on performance by Patterson et al. (2023); however, we retained it for
a more direct comparison to their work. Reconciling multiple judgments had not been necessary
in the previous study, where one expertly trained judge assessed each TTCT-F response.
Procedures
Training was largely consistent with Study 1, with a few updates to acknowledge
differences in the MTCI. Rather than a binary classification problem, training for both feature-
based and vision transformer training is treated as a regression problem, seeking to predict a
continuous variable. Additionally, there are no text labels associated with test responses, so the
CLIP-based feature extraction only evaluated image inputs. The vision transformer training
parameters were consistent with study 1, with an increase to 12 epochs to reflect the larger
dataset. As with study 1, the MSE and RMSE measures are based on a score normalized to 0-1,
scaled linearly from the minimum to the maximum values in the data.
Results
As presented in Table 3, both feature classifiers with CLIP and vision transformers
reached a high accuracy but overall, vision transformers exceeded the feature classifiers slightly.
Among the feature classifiers, AdaBoost performed the best (r = .80) whereas BEiT Large was
the best vision transformer (r = .85). These findings are comparable to the findings in Patterson
et al. (2023) and the vision transformer models proved slightly superior.
Automated Scoring of Figural Tests of Creativity 23
Table 3
Supervised Learning Results For MTCI
Classifier
r
MSE
RMSE
R2
Baseline
AuDrA (Patterson et al., 2023)
.80
0.0088
0.094
.45
Feature Classifiers with CLIP
AdaBoost
.80
0.0092
0.096
.62
Random Forests
.79
0.0094
0.097
.60
XGBoost
.77
0.0098
0.099
.59
Vision Transformers
BEiT
.82
0.0077
0.088
.52
BEiT Large
.85
0.0067
0.082
.63
ViT
.81
0.0081
0.090
.52
ViT Large
.83
0.0074
0.086
.58
Training Count = 7,755; Testing Count = 2,216; MTCI = Multi-Trial Creative Ideation; AuDrA =
Automated Drawing Assessment; CLIP = Contrastive LanguageImage Pre-training; BEiT =
Bidirectional Encoder Representations from Image Transformers; ViT = Vision Transformer; r =
Pearson correlation; MSE = Mean Square Error; RMSE=Root Mean Squared Error; R2= Coefficient of
Determination
Discussion
Our re-analyses of the MTCI data showed that best of the feature classification models,
AdaBoost had the same performance as AuDrA (r = .80), but vision transformer models,
specifically BEiT Large surpassed them (r = .85). AuDrA used ResNet image classification
method, which is a deep convolutional neural network whereas we employed a set of feature
classification models and vision transformers. This performance also exceeds the performance on
the TTCT-F in Study 1 (rs= .59, .65), which had a different task structure (i.e., involving both
image and text), scoring method (ordered binary), and number of raters (i.e., only one).
Study 3: Unsupervised Scoring with Semantic Distance
To serve as a comparison to the supervised learning methods evaluated in Study 1 and
Study 2, we turn to an evaluation of unsupervised scoring methods, using semantic distance as a
Automated Scoring of Figural Tests of Creativity 24
measure of originality with the same datasets utilized in these studies. The major difference
between supervised and unsupervised methods is that the former involves the use of classifiers or
regressors, which learn from existing data to either predict the outcome class, or a continuous
target variable. In contrast, unsupervised approaches are naïve, and their scores are not learned
by a system based on past knowledge of the problem. Rather, the effort is in engineering a metric
that is predictive of the target measure without needing examples for training.
In Study 3, the primary metric is semantic distance in CLIP space. This adopts, for
images, a technique that has been applied to automated scoring of verbal originality (e.g., Beaty
& Johnson 2021; Dumas et al., 2021), where the measure of ‘originality’ is taken by proxy of
‘distance’ in a model trained to embed similar concepts together and dissimilar concepts further
apart. Whereas verbal scoring research applied text models such as GloVe (Pennington et al.,
2014) and Word2Vec (Mikolov et al., 2013), in the current figural scoring context, CLIP
(Radford et al., 2021) is trained to understand distance among images and text, a property that
allows it to be used as an encoder for text-to-image generation models like DALL-E (Ramesh et
al., 2021). We evaluate four instantiations of unsupervised scoring, which vary the target from
which a response’s distance is measured.
The first approach evaluated is distance from average (AVG-DIST), which measures the
distance of the response from an average of what past responses looked like. While it does
require past data, negating one benefit of unsupervised models, this approach alludes to the
scoring of the various forms of the TTCT-F, which use zero-originality lists that reflect common
responses in the past. Scoring originality based on infrequency of a response relative to the pool
of responses by a given sample is a common practice in the field of creativity (see Runco &
Acar, 2012). A unique response that is given only by one individual or those by a small minority
Automated Scoring of Figural Tests of Creativity 25
(e.g., less than 5%) can be potentially considered more original than another idea that is
produced by 40% of the participants. AVG-DIST aims to automate what has been manually
conducted over the years in the classic divergent thinking by creating a lexicon of responses to
determine the frequency of each response in a given sample. AVG-DIST compares each
individual drawing to the rest of the pool of drawings for the same prompt by recognizing each
one of them and assessing the semantic similarity to the overall pool. Here, lower semantic
similarity refers to a higher semantic distance, a greater departure from the pool of drawings, and
thus a higher level of originality.
The second is distance from blank (BLANK-DIST), which compares each individual
drawing to the original prompt that the drawing sprang from. In this measure, the unfilled initial
prompt doodle (either an ambiguous image that looks like a black egg as in Activity 1 or a
doodle in Activity 2) and the filled-in participant response are both interpreted by CLIP, and the
response is considered original if CLIP’s semantic interpretation is far from what it thought the
prompt looked like. This mimics some of the design of the TTCT-F, where a prompt initially
appears as something (e.g., an egg, a bird) and a creative response does not lean into the obvious
resemblance. It is likely that originality can be achieved by successfully departing from the
prompt itself. Creativity is often defined as the ability to think outside the box and involves
rejecting existing paradigms and conventions (Sternberg & Lubart, 1995). Participants might
demonstrate that ability by making a drawing that is unlike the prompt that led to it. Drawings
that are only minimally different from the prompt can be seen as having limited divergence from
the stimulus and may have an increased similarity to the drawings of others.
Next, we measure distance from zero-originality lists. This method replaces the criterion
from the overall pool or the prompt itself (like in AVG-DIST and BLANK-DIST, respectively)
Automated Scoring of Figural Tests of Creativity 26
to the zero-originality lists. The zero-originality lists are the basis of human scoring of originality
in TTCT-F. They consist of a set of commonly given responses for an individual prompt based
on the past participants’ pool of responses (see Acar et al., 2021). When a response is found in
the zero-originality list, it is scored as zero, and scored as one otherwise. To computationally
compare the similarity of images and text, a combined model which can represent both is
needed. For ZERO-DIST, one such set of models, CLIP, was used (Radford et al., 2021). In this
method, we used the zero-originality list as if they represented the figural prompt and assessed
the degree to which the drawings departed from it. A larger semantic distance from the response
figure to zero-originality lists implies a greater originality. We used the Open Creativity Scoring
website (Organisciak et al., 2024) to obtain the originality scores by matching each zero-
originality list (list of common names of the objects drawn for each figural prompt) along with
the drawing name for each individual prompt as determined by computer vision methods.
Similar to the method above where images are compared to the zero-originality list,
TITLES-ZERO uses the zero-originality lists, but it is compared to the actual titles of the
drawings as given by the respondents. Thus, it measures semantic distance of the image titles to
the zero-originality list, without regard to the images themselves. As explained previously,
TTCT-F includes instructions to give a title for each drawing, and they are used for scoring
abstractness of titles. When these titles are different from the responses on the zero-originality
list, this would be reflected as higher semantic distance that can be a marker of originality and
abstractness of these titles. Title-zero-originality lists are paired and semantic distance between
the two was calculated via Open Creativity Scoring (Organisciak et al., 2024).
In addition to advanced semantic approaches to treating images for originality, we also
measured the correlation of ‘black ink’ with originality scores. Black ink operationalized as the
Automated Scoring of Figural Tests of Creativity 27
proportion of dark pixels on a white background can be one source of evidence for elaboration.
Elaboration is one of the most difficult indices to score in TTCT-F as judging what components
of a drawing are basic and what is extra involves a greater degree of subjectivity in scoring. The
black ink method is quite easy to detect by computers, correlates well with expert-judged
elaboration on the TTCT-F (r=.54 on Form A, r=.43 on Form B), and has been similarly applied
by Patterson et al. (2023).
All in all, we posed the following research questions in Study 3:
1. How well does semantic distance scores in the image space correlate with expert scored
originality, where semantic distance scores represent average dissimilarity to other
responses given by the study sample (AVG-DIST)?
2. How well does semantic distance in the image space correlate with originality, where
semantic distance scores represent dissimilarity of the drawing to the original prompt
doodle (BLANK-DIST)?
3. How well does semantic distance in the image space correlate with originality where
semantic distance scores represent similarity of the drawings to the zero-originality list
(ZERO-DIST)?
4. How well does semantic distance scores correlate with expert scores of originality and
abstractness of titles when semantic distance is scored as the distance between the zero-
originality lists and the titles of the drawings (TITLES-ZERO)?
5. How well does black ink, as a proxy to elaboration, correlate with expert scores of
originality?
Automated Scoring of Figural Tests of Creativity 28
Procedures
Using publisher-rated scores of originality, elaboration, and abstractness of titles, we
devised unsupervised automated scores. An unsupervised method is not ‘tuned’ to the data, but
rather engineered to use an author-defined signal as a marker of the target variable in this case
originality and elaboration. For originality, we used the CLIP embeddings of images and text for
AVG-DIST, BLANK-DIST, and ZERO-DIST. For TITLES-ZERO, we only used text inputs,
while for elaboration, we measured the amount of blank ink on a white page using only the
image inputs.
Results
Accuracy of unsupervised scores
We tested research questions 1 through 5 by running the correlations between the
TTCT-F and MTCI human-judged and the automated unsupervised scores. Correlations are
presented in Table 4. Echoing the evidence of Studies 1 and 2 even more drastically, they
show that originality is much easier to predict for MTCI than the TTCT-F; possible reasons
for this are discussed later. For TTCT-F, the computer vision measures (AVG-DIST,
BLANK-DIST, and ZERO-DIST) are overall weakly correlated with expert ratings and at
times negatively so. These scores ignored the titles of the drawings. The mean correlations
of the scores with the titles were small and inconsistent across Form A and Form B (average
rs = .-10 and .12 for AVG-DIST and .05 and .12 for BLANK-DIST in Form A and Form B,
respectively). Thus, these two unsupervised methods overall cannot successfully predict the
human scores with or without the titles. On the other hand, ZERO-DIST and TITLES-
ZERO scores, which focused on semantic distance of the images or image titles to the zero-
originality lists, were more promising. Overall, these methods produced correlations that
Automated Scoring of Figural Tests of Creativity 29
were stronger with the human scores (rs = .18 and .24, respectively) and consistently
positive across individual prompts.
Last, we examined the correlations between black ink as a basic measure of
elaboration and originality scores to assess the potential confound of elaboration in
originality. We found that black ink scores are highly correlated with MTCI creativity ratings (r
= .52), but very weakly correlated with originality scores in TTCT-F Form A (r = .05) and
TTCT-F Form A (r = .08).
Table 4
Correlations between manual and automated originality and elaboration scores
Test, Form
Avg-Dist
Blank-Dist
Zero-Dist
Titles-Zero
Black
Ink
Form A
-.10
-.05
.18
.23
.05
Form B
.12
.12
.03
.24
.08
MTCI
.38
.44
n/a
n/a
.52
AVG-DIST = semantic distance to the response pool; BLANK-DIST = semantic distance to the prompt;
ZERO-DIST = semantic distance of the image to the zero-originality list; TITLES-ZERO = semantic
distance of the image title to the zero-originality list; Black ink = Proportion of dark-to-light pixels;
MTCI = Multi-Trial Creative Ideation. All correlations significant at p<0.01 except Form B/Zero-Dist
(p=.11).
Discussion
In Study 3, we explored usefulness of several unsupervised learning methods. Our
findings showed that AVG-DIST and BLANK-DIST methods worked better for MTCI (rs = .38
and .44) than TTCT-F (rs = -.10 to .12). These differences are likely to due to the restricted
scoring range in TTCT-F coupled by a use of a single expert judge. On the other hand, titles
given to the drawings can successfully predict originality of the responses when their semantic
distance to zero-originality lists were employed (rs = .23, .24). While this performance is
remarkably lower than supervised learning methods, it is in the ballpark of the past performance
of the semantic distance models (see Organisciak et al., 2023). Importantly, MTCI originality
Automated Scoring of Figural Tests of Creativity 30
scores are strongly related to Black ink elaboration whereas TTCT-F was not. This is likely
because MTCI drawings were rated by non-experts (college students) whereas TTCT-F is scored
by trained scorers who follow very specific guidelines. The scores of the former may be
reflective of the art bias (Runco, 2008).
General Discussion
Automation has long been expected by scholars in creativity. Guilford (1950), a pioneer
of creativity research and assessment predicted the success of “these thinking machines”. Recent
years have seen rapid improvements in adopting natural language models for scoring verbal tests
such as Alternate Uses Test (e.g., Organisciak et al., 2023). Here, we evaluated similar
applications of computer vision models to scoring divergent thinking in figural tests.
Complementing findings seen in recent work (Patterson et al., 2023), our findings
demonstrate that artificial intelligence applications are highly promising for scoring figural
creativity tests, and provide a comparison between many methods (i.e., various approaches of
supervised and unsupervised learning) and tests (i.e., MTCI, TTCT-F Form A, and TTCT-F
Form B) which can inform future test and scoring design. Taken in concert with findings by
Patterson et al. (2023) and Cropley and Marrone (2022), we observe a number of insights on the
application of automated scoring to figural tests of creativity.
Supervised learning is much stronger than unsupervised, and the latter is not tractable.
The most important finding is that the supervised learning methods outperform the
methods of unsupervised learning based on our analyses with TTCT-F Forms A and B as well as
MTCI. The improvements with supervised learning are not necessarily surprising, given what is
seen with verbal tests (Buczak et al., 2022; Organisciak et al., 2023) and more generally in
machine learning, but the degree of the improvement is instructive. Feature-based classifiers or
Automated Scoring of Figural Tests of Creativity 31
regressors resulted in correlations of up to .64 on the TTCT-F, and up to .80 for MTCI. Fine-
tuned vision transformers were even more effective, up to r = .85 on the MTCI. These values are
remarkably higher than the unsupervised learning methods that primarily relied on the semantic
distance methods (r < .25 and < .45 for TTCT-F and MTCI, respectively). Whereas for verbal
testing there may still be instances where the benefits of unsupervised methods may outweigh
their lower performance (perhaps in cases where reliable and knowledgeable human judgements
are not available to train a supervised model), the approach evaluated here is generally a non-
starter for figural testing.
The quality of the human ratings is extremely important--multiple judges that use a rating
scale provide better outcomes.
A key factor in the success of supervised learning is the quality of human ratings used in
training the classifiers. If the human ratings are robust and precise, supervised learning methods
“learn” the scoring patterns better, increasing their accuracy. The difference in accuracy between
models for the TTCT-F and MTCI may be partially attributed to data quality. Where the former
has binary originality scores (original vs unoriginal) scored by one expert rater, the latter had
continuous data rated by 50 raters. The continuous scale provides much more signal for learning
the nuances on the measure, while the larger count of judges ensures a more reliable, consensus
ground truth. We observed higher correlations with human ratings in MTCI than TTCT-F (rs =
.62 vs .79 with Random Forests). This provides an important clue for future applications of
supervised learning to figural tasks where subjective ratings should involve a large group of
judges who uses a continuous (at least non-binary) scoring to obtain continuous data although the
original scoring framework may be binary such as the case with TTCT-F. The continuous scale
may also be one of the factors in the better performance of unsupervised methods for the MTCI
Automated Scoring of Figural Tests of Creativity 32
than the TTCT-F AVG-DIST (r = .38 for MTCI, r = -.10 for TTCT-F Form A, and r = .18 for
TTCT-F Form B), and BLANK-DIST (r = .44 for MTCI, r = .06 for TTCT-F Form A, and r =
.12 for TTCT-F Form B). Overall, we observe that the approach taken to human ratings both in
the detail of the scale and the redundancy in human judges is a significant differentiating factor
in supporting automated scoring.
The binary originality scores in TTCT-F are related to the way zero-originality scores are
developed and used. Older scoring guidelines (Torrance, 1966), for example, used three levels of
scores awarding 2 points when a response is represented in less than 2% of the sample and 1
point if the response is represented within between 2% and 5% of the sample. Cropley (1967)
used even more thresholds (4 points < 1% of the population; 3 points between 12%, 2 points
between 36%, 1 point for 715%, and 0 for > .15%. Importantly, this method is sample-
dependent and very time consuming given the necessity to create a new pool of responses in
every new sample. While zero-originality lists alleviate this problem and provides faster manual
scoring, absence of partial scores undermines the score precision. This problem in turn reduces
the training efficiency of automated scores. Thus, an alternative use of the zero-originality list
could be to determine the responses with varying level of statistical frequency, or these lists
could be developed with this strategy in mind. Future studies should examine if the use of partial
or incremental scores as well as multiple scorers (e.g., using judge response theory, see
Myszkowski & Storme, 2019) make substantial increments in the predictive validity of
automated scores.
Vision transformers are the most tractable methodological approach.
We also found some performance differences between the two approaches of supervised
learning, namely feature-based classification and transfer learning by fine-tuning vision
Automated Scoring of Figural Tests of Creativity 33
transformers such as ViT (Dosovitskiy et al., 2021) and BEiT (Bao et al., 2021). For the TTCT-
F, both approaches show comparable results. However, we adopt the latter as the underpinning
of Open Creativity Scoring with Artificial Intelligence for Drawings (Ocsai-D), specifically the
fine-tune using the BEiT architecture, for a number of reasons. First, while the feature-based
classifiers performed comparably to the ResNet-based AuDrA system of past work (Patterson et
al., 2023), the vision transformers outperformed both. Generally, vision transformers do not
fundamentally hold a higher ceiling than convolutional neural networks like ResNet, but they are
thought to scale more easily (Dosovitskiy et al., 2021; Paul & Chen, 2022). This scalability can
be beneficial in transfer learning, where a large model is pretrained once in a single compute-
intensive process then fine-tuned for many tailored purposes downstream. Transfer learning
employed both by Ocsai-D (fine-tuned BEiT-Large) and AuDrA (fine-tuned ResNet) makes it
possible to build applications with data-rich models that would be too costly for most researchers
to train from scratch, and better scalability can lead to better base models from the institutions
that can.
There are benefits to a feature classifier approach. For one, the resource use is lower,
even possible to train on regular computers without graphical processing units. This can be
reproduced with the associated code for this paper, for which the feature classification was
primarily run on a 2021 Apple MacBook Pro with an M1 Pro processor. In contrast, the vision
transformers require a system with a capable graphical processing unit (GPU), though they can
be run on free or low-cost cloud services, such as Google’s Colaboratory.
Here, again, we consider the future ceiling for the approaches measured. While
architectures and the models trained with them may change, the underlying modality of deep
neural networks with transfer learning is the site of most new machine learning contributions. As
Automated Scoring of Figural Tests of Creativity 34
we saw in the improvements from ViT to BEiT in both Studies 1 and 2, a system adopting one
can easily adapt to new innovations from a new, similar model.
Despite ostensibly being a figural test, the text labels complementing images in the TTCT-F
are surprisingly effective.
Another key lesson learned in our analyses is related to the differences in test structures
where TTCT-F had drawings and titles whereas MTCI had drawings only. We focused on
TTCT-F to see if the use of drawings alone, titles alone, or their combination produce better
outcomes. We found that the use of titles alone can be quite successful although not as successful
as their combined use with the drawings. Correlations of the combined use rose to .62 with text
and drawings together, from .56 (drawings only) and .60 (titles only). This finding shows that
asking for titles can be a good practice in test design for increasing the accuracy of automated
scoring of the figural tests although this implies adding a verbal component to a figural test.
Especially in the context of data collection with young children, titles might be particularly
useful in cases where adult researchers (or an artificial intelligence model) scoring the task
cannot tell what the child meant to draw. There are some figural creativity tasks such as Line
Meanings and Pattern Meanings (Wallach & Kogan, 1965) where participants list what the
presented visuals look like without engaging in any drawing. Our findings here with text-only
input are promising to extend our methods to such versions of the figural divergent thinking
tasks.
The inclusion of drawing titles seems to help with unsupervised learning methods, as
well. When four different semantic distance scoring methods were compared, the most
successful one was TITLES-ZERO (r = .23 and .24 in Form A and B, respectively), which
measured semantic distance between the zero-originality list items and the actual title given for
Automated Scoring of Figural Tests of Creativity 35
the drawing. Notably, this method ignored the drawing altogether and focused on the common
responses generated in the norming data (zero-originality lists) and the title of the drawing.
Notably, this success was observed despite the possibility of participants not giving descriptive
titles. In fact, respondents are encouraged to give “surprising names” to the drawings in TTCT-F
(Torrance, 1998). Although this is a good practice to allow for imaginative naming of the
creative works through abstraction (Welling, 2006), asking for a descriptive title can be preferred
if the goal is to enhance the accuracy of creativity and originality of the idea behind the drawing,
especially in cases where the depiction of the drawing is ambiguous. On the other hand, language
and cultural bias (Gould, 1996) and performance differences by socio-economic status especially
in verbal cognitive assessments are long-standing considerations (Herrnstein & Murray, 2010).
Whether the inclusion of titles in scoring and test structure introduces or expands cultural or
language bias should be examined in future research. Likewise, the extent to which combined
use of verbal and figural outputs influences the construct validity needs further investigation
given the evidence on distinctiveness of verbal and figural creativity (Richardson, 1986).
There is a black ink confound for MTCI but not TTCT-F, likely owing to fundamental
differences in scoring.
Relatedly, an instructive difference between the tests in Study 1 and Study 2 was noted in
the confound of black ink to originality. Our findings supported the use of black ink method as
an indicator of elaboration for TTCT-F (rs=.54 & .43 on Form A and Form B). While black ink
explained expert-generated scores of elaboration where we had them (i.e., for TTCT-F) it was
also highly correlated with MTCI ratings of creativity (r = .52), but not with expert rated scores
of originality in TTCT-F Form A (r = .05) and TTCT-F Form B (r = .07). This difference
between tests is notable because the creativity literature has observed a confounding influence on
Automated Scoring of Figural Tests of Creativity 36
originality by fluency (Acar, 2023; Acar et al., 2022; Clark & Mirels, 1970; Forthmann et al.,
2020; Hocevar, 1979; Silvia, 2008) and elaboration (Forthmann et al., 2019; see also Maio et al.,
2020) with verbal tasks. It is sensible to expect a confound by elaboration on the figural tasks
because elaboration is scored for the amount of elegance and detail in the drawings, which may
be relevant to layperson views of creativity such as the art bias (Runco, 2008, 2015; Glăveanu,
2014). For some raters, higher ratings of creativity may be associated with what looks more
stylish, elegant, and artistic, and those ratings may or may not always reflect the originality of
the idea driving the drawing. Currently, the reasons for the disparity may be speculative, but we
believe that the design and training of the TTCT-F are responsible for more complete separation
of originality and elaboration. For one, the raters of the MTCI drawings were college students
who did not go through specific training for creativity, whereas TTCT-F was scored by as trained
expert. The raters in the former are more likely to use their implicit theories of creativity that
may include the art bias. Second, in implementing zero-originality lists, TTCT-F raters are
pushed to look exclusively at the content of the figure in judging originality, not the expression
of it. Such a strict scoring method may limit, if not eliminate, the raters’ implicit perceptions of
creativity that influence the originality scores in TTCT-F. This also means that, in addition to the
benefits inherited from its continuous ratings given by a large sample of raters, the better
suitability of the MTCI to automated scoring is also partially explained by the presence of a very
easily measurable confound in the black ink, something previously observed by Patterson et al.
(2023). Nonetheless, it seems that some of the strengths of the TTCT-F in avoiding the common
elaboration confound is rooted in the same mechanisms that lead to the previously discussed data
quality weaknesses (i.e., the dichotomous originality ratings), and it remains to be considered
how a future figural test might balance the two.
Automated Scoring of Figural Tests of Creativity 37
Data Release
The code for training and running all models is shared at
https://osf.io/q82ax/?view_only=69bff72648c64c6c992420b971e716ad. The models trained on
the cleaner MTCI data are available on Hugging Face
(https://huggingface.co/collections/massivetexts/ocsai-d-66299982e080f4ac17fca02f), in three
sizes: Ocsai-D Base and Ocsai-D Large, trained on the corresponding sizes of BEiT, and Ocsai-
D Web. Web is a special applied version of the large model trained with more of the data, 90%
rather than 70%, and augmented with blank image training to ensure they are bound to zero
originality. It evaluated against its small test set at r=0.86 and RMSE=0.082.
Finally, an interface to try the system in-browser is available at
https://openscoring.du.edu/draw (Figure 1). Again, this uses the version of the models trained on
MTCI, given their strong performance. The code for this scoring system can also be run on a
personal computer by researchers or adapted for other goals.
Implications for Future Research
There are several implications of this work. First, we have solid evidence that automated
scores can be successfully extended to the figural tests. Consequently, the costs associated with
scoring of the figural creativity tasks can be reduced drastically. This may help more creativity
tests to be used in research and practice, increase the likelihood of their inclusion in universal
screening efforts for gifted identification, and ultimately result in creativity becoming a more
essential consideration within educational and psychological research. Such an effort may also
require developing new set of prompts that can use our platform if transfer to new prompts is
proven effective. Future research can examine this possibility, and if successful, this would also
enhance test security with continuously changing set of prompts. Further, there may be added
Automated Scoring of Figural Tests of Creativity 38
benefits of measuring creativity beyond verbal tests to be more inclusive of various demographic
groups whose level of language proficiency may be an impediment for their verbal test
performance, but not to their figural test performance (Acar, Dumas et al., 2023). Regardless,
verbal and figural creativity tests do not highly overlap (Clapham, 2004; Kim, 2017; Richardson,
1986; Ulger, 2015) indicating the superiority of their combined use over the use of either one
alone. Future research may use the lessons from the present work to focus on the development of
new figural tasks that can be scored by supervised learning. This way, the material costs of the
tests (price for the test booklets) may also be reduced, and more different kinds of tasks could be
incorporated within a single test.
Limitations and Future Directions
Our findings overall were stronger with MTCI than TTCT-F very likely because of
binary classification problem with TTCT-F. The performance of the tested models should be
revisited with an alternative, non-binary human ratings for TTCT-F, although that implies the
original scoring guidelines would have to be dismissed. Further, the automated scores in this
study were compared against the human ratings of the same task to compare various scoring
methods, and future research may examine the validity of these methods with other external
criteria such as performance on a different test of creativity. Further, both datasets were the
drawings of college students and future research should replicate them with children’s drawings.
Children have historically been the main target participants of drawing assessments of creativity,
and clearly, they draw in different ways than adults do, which might make the automated scoring
process more challenging. Most importantly, our findings provide important clues for designing
the tasks that can be successfully scored by the methods examined in this work.
Automated Scoring of Figural Tests of Creativity 39
References
Acar, S. (2023). Does the task structure impact the fluency confound in divergent thinking? An
investigation with TTCT-Figural. Creativity Research Journal, 35(1), 1-14.
https://doi.org/10.1080/10400419.2022.2044656
Acar, S., & Runco, M. A. (2014). Assessing associative distance among ideas elicited by tests of
divergent thinking. Creativity Research Journal, 26(2), 229-238.
https://doi.org/10.1080/10400419.2014.901095
Acar, S., & Runco, M. A. (2019). Divergent thinking: New methods, recent research, and
extended theory. Psychology of Aesthetics, Creativity, and the Arts, 13(2), 153-158.
https://doi.org/10.1037/aca0000231
Acar, S., Berthuiaume, K., Grajzel, K., Dumas, D., Flemister, T., & Organisciak, P. (2023).
Applying automated originality scoring to the verbal form of Torrance Tests of Creative
Thinking. Gifted Child Quarterly, 67(1), 3-17.
https://doi.org/10.1177/00169862211061874
Acar, S., Dumas, D., Organisciak, P., & Berthiaume, K. (2023). Measuring original thinking in
elementary school: Development and validation of a computational psychometric
approach. Journal of Educational Psychology.
Acar, S., Branch, M. J., Burnett, C., & Cabra, J. F. (2021). Assessing the universality of the zero
originality lists of the Torrance Tests of Creative Thinking (TTCT)-Figural: An
examination with African American college students. Gifted Child Quarterly, 65(4), 354-
369. https://doi.org/10.1177/00169862211012964
Automated Scoring of Figural Tests of Creativity 40
Acar, S., Ogurlu, U., & Zorychta, A. (2022). Exploration of discriminant validity in divergent
thinking tasks: A meta-analysis. Psychology of Aesthetics, Creativity, and the Arts.
Advance online publication. https://doi.org/10.1037/aca0000469
Alpaydin, E. (2020). Introduction to machine learning. MIT press.
Anderson, L.W. (Ed.), Krathwohl, D.R. (Ed.), Airasian, P.W., Cruikshank, K.A., Mayer, R.E.,
Pintrich, P.R., Raths, J., & Wittrock, M.C. (2001). A taxonomy for learning, teaching,
and assessing: A revision of Bloom’s Taxonomy of Educational Objectives (Complete
edition). Longman.
Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to
Align and Translate (arXiv:1409.0473). arXiv. https://doi.org/10.48550/arXiv.1409.0473
Bao, H., Dong, L., Piao, S., & Wei, F. (2021). BeiT: BERT Pre-Training of Image Transformers
(arXiv:2106.08254). arXiv. http://arxiv.org/abs/2106.08254
Barbot, B. (2018). The Dynamics of creative ideation: Introducing a new assessment paradigm.
Frontiers in Psychology, 9, 2529. https://doi.org/10.3389/fpsyg.2018.02529
Beaty, R. E., & Johnson, D. R. (2021). Automating creativity assessment with SemDis: An open
platform for computing semantic distance. Behavior Research Methods, 53(2), 757780.
https://doi.org/10.3758/s13428-020-01453-w
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 532.
https://doi.org/10.1023/A:1010933404324
Buczak, P., Huang, H., Forthmann, B., & Doebler, P. (2022). The machines take over: A
comparison of various supervised learning approaches for automated scoring of divergent
thinking tasks. The Journal of Creative Behavior. https://doi.org/10.1002/jocb.559
Automated Scoring of Figural Tests of Creativity 41
Card, D., & Giuliano, L. (2016). Universal screening increases the representation of low-income
and minority students in gifted education. Proceedings of the National Academy of
Sciences, 113(48), 13678-13683.
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, 785794. https://doi.org/10.1145/2939672.2939785
Clapham, M. M. (2004). The convergent validity of the Torrance Tests of Creative Thinking and
Creativity Interest Inventories. Educational and Psychological Measurement, 64, 828-
841. https://doi.org/10.1177/0013164404263883
Clark, P. M., & Mirels, H. L. (1970). Fluency as a pervasive element in the measurement of
creativity. Journal of Educational Measurement, 7(2), 8386.
https://doi.org/10.1111/j.1745-3984.1970.tb00699.x
Cropley, A. J. (1967). Creativity, intelligence, and achievement. Alberta Journal of Educational
Research, 13, 5158.
Cropley, D. H., & Marrone, R. L. (2022). Automated scoring of figural creativity using a
convolutional neural network. Psychology of Aesthetics, Creativity, and the
Arts. Advance online publication. https://doi.org/10.1037/aca0000510
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding.
https://arxiv.org/abs/1810.04805v2
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021).
Automated Scoring of Figural Tests of Creativity 42
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
(arXiv:2010.11929). arXiv. https://doi.org/10.48550/arXiv.2010.11929
Dumas, D., Organisciak, P., & Doherty, M. (2021). Measuring divergent thinking originality
with human raters and text-mining models: A psychometric comparison of
methods. Psychology of Aesthetics, Creativity, and the Arts, 15(4), 645
663. https://doi.org/10.1037/aca0000319
Engerman, K., Alexandridis, K., Drost, D., & Michailidis, S. (2010). The pedagogical use of
creative problem solving. In D. Pardlow & M. A. Trent (Eds.), Cultivating visionary
leadership by learning for global success: Beyond the language and literature classroom
(pp. 196-207). Cambridge Scholars Publishing.
Forster, E. A., & Dunbar, K. N. (2009). Creativity evaluation through latent semantic analysis. In
N. A. Taatgen & H. van Rijn (Eds.), Proceedings of the 31st Annual Conference of the
Cognitive Science Society (pp. 602 607). Cognitive Science Society.
Forthmann, B., & Doebler, P. (2022). Fifty years later and still working: Rediscovering Paulus et
al’s (1970) automated scoring of divergent thinking tests. Psychology of Aesthetics,
Creativity, and the Arts. Advance online publication. https://doi.org/10.1037/aca0000518
Forthmann, B., Oyebade, O., Ojo, A., Günther, F., & Holling, H. (2019). Application of latent
semantic analysis to divergent thinking is biased by elaboration. The Journal of Creative
Behavior, 53(4), 559-575. https://doi.org/10.1002/jocb.240
Forthmann, B., Szardenings, C., & Holling, H. (2020). Understanding the confounding effect of
fluency in divergent thinking scores: Revisiting average scores to quantify artifactual
correlation. Psychology of Aesthetics, Creativity, and the Arts, 14(1), 94112.
https://doi.org/10.1037/aca0000196
Automated Scoring of Figural Tests of Creativity 43
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and
an application to boosting. Journal of Computer and System Sciences, 55(1), 119139.
Frey, C. B., & Osborne, M. A. (2017). The future of employment: How susceptible are jobs to
computerisation? Technological Forecasting and Social Change, 114, 254-280.
https://doi.org/10.1016/j.techfore.2016.08.019
Glăveanu, V. P. (2014). Revisiting the “art bias” in lay conceptions of creativity. Creativity
Research Journal, 26(1), 11-20.
Gould, S. J. (1996). Mismeasure of man. WW Norton & company.
Guilford, J. P. (1950). Creativity. American Psychologist, 5(9), 444454.
https://doi.org/10.1037/h0063487
Guilford, J. P., Merrifield, P. R., & Wilson, R. C. (1958). Unusual uses test. Sheridan
Psychological Services.
Hass, R. W. (2017a). Semantic search during divergent thinking. Cognition, 166, 344-357.
https://doi.org/10.1016/j.cognition.2017.05.039
Hass, R.W. (2017b) Tracking the dynamics of divergent thinking via semantic distance: Analytic
methods and theoretical implications. Memory & Cognition, 45, 233244 (2017).
https://doi.org/10.3758/s13421-016-0659-y
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770778.
https://doi.org/10.1109/CVPR.2016.90
Herrnstein, R. J., & Murray, C. (2010). The bell curve: Intelligence and class structure in
American life. Simon and Schuster.
Automated Scoring of Figural Tests of Creativity 44
Hocevar, D. (1979). Ideational fluency as a confounding factor in the measurement of
originality. Journal of Educational Psychology, 71(2), 191196.
https://doi.org/10.1037/0022-0663.71.2.191
Hussain, M., Bird, J. J., & Faria, D. R. (2018). A study on CNN transfer learning for image
classification. Paper presented at the UK Workshop on Computational Intelligence,
Nottingham, UK.
Kim, K. H. (2017). The Torrance Tests of Creative Thinking-Figural or Verbal: Which one
should we use? Creativity. TheoriesResearch-Applications, 4(2), 302-321.
https://doi.org/10.1515/ctra-2017-0015
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic
analysis theory of acquisition, induction, and representation of knowledge. Psychological
Review, 104(2), 211240. https://doi.org/10.1037/0033-295X.104.2.211
LeCun, Y., & Bengio, Y. (1995). Convolutional networks for images, speech, and time series. In
M. A. Arbib (Ed.). The handbook of brain theory and neural networks (pp. 276-279).
MIT Press.
Lohman, D. F., & Gambrell, J. L. (2012). Using nonverbal tests to help identify academically
talented children. Journal of Psychoeducational Assessment, 30(1), 25-44.
https://doi.org/10.1177/0734282911428194
Loshchilov, I., & Hutter, F. (2018). Decoupled Weight Decay Regularization. International
Conference on Learning Representations.
Maio, S., Dumas, D., Organisciak, P., & Runco, M. (2020). Is the reliability of objective
originality scores confounded by elaboration? Creativity Research Journal, 32(3), 201-
205. https://doi.org/10.1080/10400419.2020.1818492
Automated Scoring of Figural Tests of Creativity 45
McBee, M. T., Peters, S. J., & Miller, E. M. (2016). The impact of the nomination stage on gifted
program identification: A comprehensive psychometric analysis. Gifted Child
Quarterly, 60(4), 258-278. https://doi.org/10.1177/0016986216656256
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed
representations of words and phrases and their compositionality. In C. J. Burges, L.
Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.). Advances in neural
information processing systems, 26.
Myszkowski, N., & Storme, M. (2019). Judge response theory? A call to upgrade our
psychometrical account of creativity judgments. Psychology of Aesthetics, Creativity, and
the Arts, 13(2), 167175. https://doi.org/10.1037/aca0000225
OECD (2022). PISA 2022 Assessment and Analytical Framework.
https://www.oecd.org/pisa/publications/pisa-2021-assessment-and-analytical-
framework.htm
Organisciak, P., Dumas, D., Acar, S., & de Chantal, P.L. (2024). Open creativity scoring
[Computer software]. University of Denver. https://openscoring.du.edu/
Organisciak, P., Acar, S., Dumas, D., & Berthiaume, K. (2023). Beyond semantic distance:
Automated scoring of divergent thinking greatly improves with large language models.
Thinking Skills and Creativity, 49, 101356. https://doi.org/10.1016/j.tsc.2023.101356Pan,
S. J., & Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on Knowledge
and Data Engineering, 22(10), 13451359.
Partnership for 21st Century Skills. (2006). A state leader’s action guide to 21st century skills: A new
vision for education. Partnership for 21st Century Skills.
Automated Scoring of Figural Tests of Creativity 46
Patterson, J. D., Barbot, B., Lloyd-Cox, J., & Beaty, R. E. (2023). AuDrA: An automated
drawing assessment platform for evaluating creativity. Behavior Research Methods, 1-18.
https://doi.org/10.3758/s13428-023-02258-3
Paulus, D. H., Renzulli, J.S., & Archambault, F.X., Jr. (1970). Computer simulation of human
ratings of creativity. Final report. Storrs, CT: School of Education, University of
Connecticut.
Pennington, J., Socher, R., & Manning, C. (2014, October). Glove: Global vectors for word
representation. In A. Moscitti, A. Pang, & B. Daelemans (Eds.), Proceedings of the 2014
conference on empirical methods in natural language processing (pp. 15321543).
Association for Computational Linguistics. http://dx.doi.org/10.3115/v1/D14- 1162
Peters, S. J. (2022). The challenges of achieving equity within public school gifted and talented
programs. Gifted Child Quarterly, 66(2), 82-94.
https://doi.org/10.1177/00169862211002535
Petrone, P. (2019). Why creativity is the most important skill in the world.
https://learning.linkedin.com/blog/top-skills/why-creativity-is-the-most-important-skill-in-
the-world
Plucker, J. A. (2022). The patient is thriving! Current issues, recent advances, and future
directions in creativity assessment. Creativity Research Journal, 1-13.
https://doi.org/10.1080/10400419.2022.2110415
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A.,
Mishkin, P., & Clark, J. (2021). Learning transferable visual models from natural
language supervision. International Conference on Machine Learning, 87488763.
Automated Scoring of Figural Tests of Creativity 47
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I.
(2021). Zero-Shot Text-to-Image Generation (arXiv:2102.12092). arXiv.
https://doi.org/10.48550/arXiv.2102.12092
Richardson, A. G. (1986). Two factors of creativity. Perceptual and Motor Skills, 63(2), 379-
384. https://doi.org/10.2466/pms.1986.63.2.379
Runco, M. A. (2008). Creativity and education. New Horizons in Education, 56(1), n1.
Runco, M. A. (2015). Meta-creativity: Being creative about creativity. Creativity Research
Journal, 27(3), 295-298. https://doi.org/10.1080/10400419.2015.1065134
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., & Bernstein, M. (2015). Imagenet large scale visual recognition challenge.
International Journal of Computer Vision, 115, 211252.
Ferdinand de Saussure. (1916). Cours de linguistique générale (Charles Bally & Albert
Sechehaye, Eds.). Libraire Payot & Cie. Retrieved from
https://hdl.handle.net/2027/coo.31924026441687
Silvia, P. J. (2008). Creativity and intelligence revisited: A latent variable analysis of Wallach
and Kogan. Creativity Research Journal, 20(1), 3439.
https://doi.org/10.1080/10400410701841807
Sternberg, R. J., & Lubart, T. I. (1995). Defying the crowd: Cultivating creativity in a culture of
conformity. Free press.
Stevenson, C., Smal, I., Baas, M., Dahrendorf, M., Grasman, R., Tanis, C., Scheurs, E., Sleiffer,
D., & van der Maas, H. (2020). Automated AUT scoring using a big data variant of the
Consensual Assessment Technique: Final technical report. Modeling Creativity Project,
Universiteit van Amsterdam.
Automated Scoring of Figural Tests of Creativity 48
http://modelingcreativity.org/blog/wpcontent/uploads/2020/07/ABBAS_report_200711_f
inal.pdf
Torrance, E. P. (1966). Torrance tests of creative thinking: Norms-technical manual (Research
Edition). Princeton, NJ: Personnel Press.
Torrance, E. P. (1998). Torrance Test of Creative Thinking: Manual for scoring and interpreting
results, verbal forms A&B. Scholastic Testing Service.
Ulger, K. (2015). The structure of creative thinking: Visual and verbal areas. Creativity Research
Journal, 27, 102106. https://doi.org/10.1080/10400419.2015.992689
Urban, K. K., & Jellen, H. G. (1996). Test for Creative Thinking Drawing Production (TCT-DP)
manual. Frankfurt Germany: Swets Test Services.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., &
Polosukhin, I. (2017). Attention is all you need. ArXiv:1706.03762 [Cs].
http://arxiv.org/abs/1706.03762
Wallach, M. A., & Kogan, N. (1965). Modes of thinking in young children: A study of the
creativity-intelligence distinction. Holt, Rinehart & Winston.
Welling, H. (2007). Four mental operations in creative cognition: The importance of
abstraction. Creativity Research Journal, 19(2-3), 163-177.
Wilson, R. C., Guilford, J. P., & Christensen, P. R. (1953). The measurement of individual
differences in originality. Psychological Bulletin, 50(5), 362370.
https://doi.org/10.1037/h0060857
Automated Scoring of Figural Tests of Creativity 49
Appendix
Full Feature Classifier Results for TTCT -F
TTCT
Form
Modality
Classifier
Training
Count
Testing
Count
Accuracy
Precision
Recall
F1
r
Form A
Image
AdaBoost
3000
506
0.78
0.78
0.87
0.82
0.55
Image
Random
Forests
3000
506
0.79
0.77
0.92
0.84
0.57
Image
XGBoost
3000
506
0.80
0.79
0.89
0.84
0.57
Image & Text
AdaBoost
2790
466
0.82
0.82
0.89
0.85
0.63
Image & Text
Random
Forests
2790
466
0.80
0.78
0.92
0.85
0.59
Image & Text
XGBoost
2790
466
0.82
0.81
0.91
0.85
0.62
Text
AdaBoost
2790
466
0.78
0.80
0.84
0.82
0.54
Text
Random
Forests
2790
466
0.78
0.76
0.90
0.83
0.54
Text
XGBoost
2790
466
0.77
0.77
0.87
0.82
0.53
Form B
Image
AdaBoost
2572
468
0.76
0.82
0.82
0.82
0.46
Image
Random
Forests
2572
468
0.81
0.80
0.95
0.87
0.54
Image
XGBoost
2572
468
0.78
0.81
0.89
0.84
0.49
Image & Text
AdaBoost
2430
438
0.83
0.86
0.89
0.87
0.61
Image & Text
Random
Forests
2430
438
0.85
0.84
0.96
0.90
0.65
Image & Text
XGBoost
2430
438
0.82
0.83
0.91
0.87
0.58
Text
AdaBoost
2430
438
0.82
0.85
0.88
0.86
0.58
Text
Random
Forests
2430
438
0.85
0.84
0.96
0.89
0.65
Text
XGBoost
2430
438
0.84
0.84
0.92
0.88
0.62
TTCT-F = Torrance Tests of Creative Thinking-Figural.
Automated Scoring of Figural Tests of Creativity 50
Figure 1
Ocsai-D Online Interface
... Recent research on automated scoring of creative thinking tasks looks promising in this regard Beaty & Johnson, 2021;Buczak et al., 2023;Organisciak et al., 2023;Zielińska et al., 2023). The ability of automated scoring to mimic human judgments of creative performance has been impressive in recent studies (Acar et al., , 2025Cropley & Marrone, 2025;DiStefano et al., 2024;Goecke et al., 2024;Johnson et al., 2023;Patterson et al., 2024;Zielińska et al., 2023). While there are still challenges to overcome for high-stakes decision-making (e.g., job applications or college admissions; Johnson et al., 2023) it is likely that automatic scoring will be applied in various assessment contexts. ...
... This approach works also very well for other languages than English when responses in the original language (e.g., Polish) are translated into the English language. Supervised approaches of automated scoring have also been shown to work well for a variety of other verbal tasks such as metaphor tasks, creative problem-solving tasks, and creative writing tasks, as well as figural creative thinking tasks (Acar et al., 2025;Cropley & Marrone, 2025). ...
Article
Full-text available
Automated scoring with machine learning has received considerable attention, especially for the alternative uses test (AUT) of divergent thinking. In particular, supervised learning on word embeddings and approaches based on large language models predict human ratings with sufficient accuracy for many practical purposes. However, since automated systems will potentially be used without a “human in the loop,” the robustness of any model is crucial, not only for the validity of basic research, but also for applications. We investigate the potential of adversarial examples (AEs) as robustness checks. AEs are synthetic responses obtained by subtle permutations of original responses that are assigned a substantially different score by the automated system. Specifically, we propose to employ synonyms to obtain semantically close synthetic responses. A synthetic response is an AE when the deviation between the automated scores of the synthetic response and the original response is at least as large as the prediction error of the automated system. A random forest trained on Global Vectors for Word Representation (GloVe) word embeddings predicts human ratings of 2,690 distinct original AUT answers (stimuli: brick and paperclip) and for 42% AEs can be found, including both single and multiword responses. On average, the AEs have slightly higher automated scores than the originals. While a small fraction of the AEs are invalid AUT responses, pronounced large deviations indicate that the prediction model is nonrobust. Retraining including the AEs in the training data improves the robustness. We recommend using AEs routinely to assess the robustness of automated scoring systems.
... In another study, Acar et al. (2024) explored automated scoring methods for TTCT-F and MTCI. They employed Random Forest classifiers on TTCT-F, achieving an accuracy of up to 85%, and further enhanced the scoring accuracy by incorporating both image and title information. ...
... They employed Random Forest classifiers on TTCT-F, achieving an accuracy of up to 85%, and further enhanced the scoring accuracy by incorporating both image and title information. Additionally, they tested Vision Transformer models such as BEiT and ViT, which demonstrated higher correlations (r = 0.85) on the MTCI dataset, indicating the superior potential of Vision Transformers in handling figural creativity tests (Acar et al. 2024). ...
Article
Full-text available
This study proposes a multimodal deep learning model for automated scoring of image-based divergent thinking tests, integrating visual and semantic features to improve assessment objectivity and efficiency. Utilizing 708 Chinese high school students’ responses from validated tests, we developed a system combining pretrained ResNet50 (image features) and GloVe (text embeddings), fused through a fully connected neural network with MSE loss and Adam optimization. The training set (603 images, triple-rated consensus scores) showed strong alignment with human scores (Pearson r = 0.810). Validation on 100 images demonstrated generalization capacity (r = 0.561), while participant-level analysis achieved 0.602 correlation with total human scores. Results indicate multimodal integration effectively captures divergent thinking dimensions, enabling simultaneous evaluation of novelty, fluency, and flexibility. This approach reduces manual scoring subjectivity, streamlines assessment processes, and maintains cost-effectiveness while preserving psychometric rigor. The findings advance automated cognitive evaluation methodologies by demonstrating the complementary value of visual-textual feature fusion in creativity assessment.
... For each drawing, we obtain human ratings from two experts raters and get automated scores from two recently released tools for automated assessment of drawing creativity, AuDrA (Patterson, Barbot, Lloyd-Cox, & Beaty, 2024) and OSC-figural (Acar, Organisciak, & Dumas, 2023). Research suggests that AI-based evaluation methods favor AI-generated responses, whereas human evaluators prefer human-created outputs (Laurito et al., 2024). ...
Preprint
Full-text available
Can we derive computational metrics to quantify visual creativity in drawings across intelligent agents, while accounting for inherent differences in technical skill and style? To answer this, we curate a novel dataset consisting of 1338 drawings by children, adults and AI on a creative drawing task. We characterize two aspects of the drawings -- (1) style and (2) content. For style, we define measures of ink density, ink distribution and number of elements. For content, we use expert-annotated categories to study conceptual diversity, and image and text embeddings to compute distance measures. We compare the style, content and creativity of children, adults and AI drawings and build simple models to predict expert and automated creativity scores. We find significant differences in style and content in the groups -- children's drawings had more components, AI drawings had greater ink density, and adult drawings revealed maximum conceptual diversity. Notably, we highlight a misalignment between creativity judgments obtained through expert and automated ratings and discuss its implications. Through these efforts, our work provides, to the best of our knowledge, the first framework for studying human and artificial creativity beyond the textual modality, and attempts to arrive at the domain-agnostic principles underlying creativity. Our data and scripts are available on GitHub.
Article
Full-text available
This study presents the Cebeci Test of Creativity (CTC), a novel computerized assessment tool designed to address the limitations of traditional open‐ended paper‐and‐pencil creativity tests. The CTC is designed to overcome the challenges associated with the administration and manual scoring of traditional paper and pencil creativity tests. In this study, we present the first validation of CTC, demonstrating strong internal and external validity across two studies with a large sample size of over 14,000 students in grades 1–8. The results provide support for the proposed unidimensional factor structure of CTC, with robust reliability (ω = 0.833 and 0.872). Analyses of measurement invariance showed that the unidimensional factor structure of CTC holds consistently across all grade levels, with factor loadings exhibiting notable similarity. Additionally, the item intercepts demonstrate considerable uniformity across grades 3–5. The composite CTC scores were positively correlated with creative self‐efficacy but not with Standard Progressive Matrices. The outcomes of our study indicate that CTC is a valuable and efficient tool for assessing creativity in educational settings. Its scalability and comprehensive evaluation of four key dimensions of creative ideation (i.e., fluency, flexibility, originality, and elaboration) make it particularly advantageous for educators seeking to assess students' creative potential.
Preprint
Full-text available
The integration of artificial intelligence (AI) into creative work continues to expand, yet its impact on human creativity itself—beyond simply providing ideas—remains uncertain. We reposition AI’s role from idea generator to idea evaluator, using trained models to provide real-time feedback on human-generated ideas. Across two studies—a preregistered online experiment involving individuals with varying levels of expertise (N = 554) and a large-scale naturalistic experiment during a year-long museum exhibit (N = 36,198)—participants generated solutions to real-world problems or created visual sketches. AI feedback significantly improved participant originality in both verbal and visual creative domains. Mediation analyses revealed these gains were partly driven by changes in individuals’ self-evaluation of their own originality, implying a key role for metacognition—the ability to monitor, control and regulate one’s thinking. These findings suggest that AI’s potential extends beyond generation to include idea evaluation, helping humans assess and refine their ideas through real-time feedback.
Preprint
Full-text available
Automated scoring with machine learning has received considerable attention, especially for the Alternative Uses Test (AUT) of divergent thinking. In particular, supervised learning on word embeddings and approaches based on large language models predict human ratings with sufficient accuracy for many practical purposes. However, since automated systems will potentially be used without a ``human in the loop'', the robustness of any model is crucial, not only for the validity of basic research, but also for applications. We investigate the potential of adversarial examples (AEs) as robustness checks. AEs are synthetic responses obtained by subtle permutations of original responses that are assigned a substantially different score by the automated system. Specifically, we propose to employ synonyms to obtain semantically close synthetic responses. A synthetic response is an AE, when the deviation of the automated scores of the synthetic response and the original response is at least as large as the prediction error of the automated system. A random forest trained on GloVe word embeddings predicts human ratings of 2,690 distinct original AUT answers (stimuli: brick, paperclip) and for 42% AEs can be found, including both single and multi-word responses. On average, the AEs have slightly higher automated scores than the originals. While a small fraction of the AEs are invalid AUT responses, pronounced large deviations indicate that the prediction model is non-robust. Retraining including the AEs in the training data improves the robustness. We recommend to use AEs routinely to assess the robustness of automated scoring systems.
Article
Full-text available
Who should evaluate the originality and task-appropriateness of a given idea has been a perennial debate among psychologists of creativity. Here, we argue that the most relevant evaluator of a given idea depends crucially on the level of expertise of the person who generated it. To build this argument, we draw on two complimentary theoretical perspectives. The model of domain learning (MDL) suggests that, for novices in a domain, creativity is by-necessity self-referenced, but as expertise develops, more socially referenced creativity is possible. Relatedly, the four-C model posits four forms of creativity that fall along a continuum of social impact: mini-c, little-c, Pro-c, and Big-C. We show that the MDL implies a learning trajectory that connects the four Cs because, as socially referenced creativity develops, greater societal impact becomes available to a creator. Then, we describe four sources of evaluations that become relevant as an individual learns: judgments from the creators themselves, their local community, consumers of the idea, and finally, critics in the domain. We suggest that creators’ judgments are of essential importance for mini-c, community judgments are paramount for little-c, Pro-c requires either positive evaluations from consumers or critics, and Big-C requires both consumers and critics to evaluate an idea positively for an extended time. We identify key insights and imperatives for the field: aligning our measures (both human and AI scored) with the most relevant evaluations of ideas to support the reliability and validity of our measurements, using evaluations as feedback for learners to support the development of creative metacognition, and the importance of considering domain differences when evaluating ideas.
Article
Full-text available
Automated scoring is a current hot topic in creativity research. However, most research has focused on the English language and popular verbal creative thinking tasks, such as the alternate uses task. Therefore, in this study, we present a large language model approach for automated scoring of a scientific creative thinking task that assesses divergent ideation in experimental tasks in the German language. Participants are required to generate alternative explanations for an empirical observation. This work analyzed a total of 13,423 unique responses. To predict human ratings of originality, we used XLM‐RoBERTa (Cross‐lingual Language Model‐RoBERTa), a large, multilingual model. The prediction model was trained on 9,400 responses. Results showed a strong correlation between model predictions and human ratings in a held‐out test set (n = 2,682; r = 0.80; CI‐95% [0.79, 0.81]). These promising findings underscore the potential of large language models for automated scoring of scientific creative thinking in the German language. We encourage researchers to further investigate automated scoring of other domain‐specific creative thinking tasks.
Article
Full-text available
Creativity is highly valued in both education and the workforce, but assessing and developing creativity can be difficult without psychometrically robust and affordable tools. The open-ended nature of creativity assessments has made them difficult to score, expensive, often imprecise, and therefore impractical for school- or district-wide use. To address this challenge, we developed and validated the Measure of Original Thinking for Elementary School (MOTES) in five phases, including the development of the item pool and test instructions, expert validation, cognitive pilots, and validation of the automated scoring and latent test structure. MOTES consists of three game-like computerized activities (uses, examples, and sentences subscales), with eight items in each for a total of 24 items. Using large language modeling techniques, MOTES is scored for originality by our open-access artificial intelligence platform with a high level of agreement with independent subjective human ratings across all three subscales at the response level (rs = .79, .91, and .85 for uses, examples, and sentences, respectively). Confirmatory factor analyses showed a good fit with three factors corresponding to each game, subsumed under a higher-order originality factor. Internal consistency reliability was strong for both the subscales (H = 0.82, 0.85, and 0.88 for uses, examples, and sentences, respectively) and the higher-order originality factor (H = 0.89). MOTES scores showed moderate positive correlations with external creative performance indicators as well as academic achievement. The implications of these findings are discussed in relation to the challenges of assessing creativity in schools and research.
Article
Full-text available
The visual modality is central to both reception and expression of human creativity. Creativity assessment paradigms, such as structured drawing tasks Barbot (2018), seek to characterize this key modality of creative ideation. However, visual creativity assessment paradigms often rely on cohorts of expert or naïve raters to gauge the level of creativity of the outputs. This comes at the cost of substantial human investment in both time and labor. To address these issues, recent work has leveraged the power of machine learning techniques to automatically extract creativity scores in the verbal domain (e.g., SemDis; Beaty & Johnson 53 , 757–780, 2021). Yet, a comparably well-vetted solution for the assessment of visual creativity is missing. Here, we introduce AuDrA – an Automated Drawing Assessment platform to extract visual creativity scores from simple drawing productions. Using a collection of line drawings and human creativity ratings, we trained AuDrA and tested its generalizability to untrained drawing sets, raters, and tasks. Across four datasets, nearly 60 raters, and over 13,000 drawings, we found AuDrA scores to be highly correlated with human creativity ratings for new drawings on the same drawing task ( r = .65 to .81; mean = .76). Importantly, correlations between AuDrA scores and human raters surpassed those between drawings’ elaboration (i.e., ink on the page) and human creativity raters, suggesting that AuDrA is sensitive to features of drawings beyond simple degree of complexity. We discuss future directions, limitations, and link the trained AuDrA model and a tutorial ( https://osf.io/kqn9v/ ) to enable researchers to efficiently assess new drawings.
Article
Full-text available
In 1998, Plucker and Runco provided an overview of creativity assessment, noting current issues (fluency confounds, generality vs. specificity), recent advances (predictive validity, implicit theories), and promising future directions (moving beyond divergent thinking measures, reliance on batteries of assessments, translation into practice). In the ensuing quarter century, the field experienced large growth in the quantity, breadth, and depth of assessment work, suggesting another analysis is timely. The purpose of this paper is to review the 1998 analysis and identify current issues, advances, and future directions for creativity measurement. Recent advances include growth in assessment quantity and quality and use of semantic distance as a scoring technique. Current issues include mismatches between current conceptions of creativity and those on which many measures are based, the need for psychometric quality standards, and a paucity of predictive validity evidence. The paper concludes with analysis of likely future directions, including use of machine learning to administer and score assessments and refinement of our conceptual frameworks for creativity assessment. Although the 1998 paper was written within an academic climate of harsh criticism of creativity measurement, the current climate is more positive, with reason for optimism about the continued growth of this important aspect of the field.
Article
Full-text available
Automated scoring of divergent thinking tasks is a current hot topic in creativity research. Most of the debated approaches are unsupervised machine learning approaches and researchers seemingly just started to evaluate supervised approaches. Hence, rediscovering the seminal work of Paulus et al. (1970) came as a big surprise to us. More than 50 years ago, they derived prediction formulas for an automated scoring of the Torrance Test of Creative Thinking (Torrance, 1966) that was based on a set of text mining variables (e.g., average word length, word counts, and so forth). They found quite impressive cross-validation results. This work reintroduces Paulus et al.’s (1970) approach and investigates how it performs compared with semantic distance scoring. The main contribution of Paulus et al.’s (1970) neglected masterpiece on divergent thinking assessment is echoed by the findings of this work: Creative quality of responses can be well predicted by means of simple text mining statistics. The validity was also stronger as compared with semantic distance. Importantly, using the Paulus et al. (1970) features in a state-of-the-art supervised machine learning approach does not outperform the simple stepwise regression used by Paulus et al. (1970). Yet, we found that supervised machine learning can outperform the Paulus et al. (1970) approach, when semantic distance is added to the set of prediction variables. We discuss challenges that are expected for future research that aim at combining unsupervised approaches based on word embeddings and supervised learning relying on text-mining features.
Article
Full-text available
One of the abiding challenges in creativity research is assessment. Objectively scored tests of creativity such as the Torrance Tests of Creativity and the test of Creative Thinking–Drawing Production (TCT-DP; Urban & Jellen, 1996) offer high levels of reliability and validity but are slow and expensive to administer and score. As a result, many creativity researchers default to simpler and faster self-report measures of creativity and related constructs (e.g., creative self-efficacy, openness). Recent research, however, has begun to explore the use of computational approaches to address these limitations. Examples include the Divergent Association Task (Olson et al., 2021) that uses computational methods to rapidly assess the semantic distance of words, as a proxy for divergent thinking. To date, however, no research appears to have emerged that uses methods drawn from the field of artificial intelligence to assess existing objective, figural (i.e., drawing) tests of creativity. This article describes the application of machine learning, in the form of a convolutional neural network, to the assessment of a figural creativity test—the TCT-DP. The approach shows excellent accuracy and speed, eliminating traditional barriers to the use of these objective, figural creativity tests and opening new avenues for automated creativity assessment.
Article
Full-text available
Traditionally, researchers employ human raters for scoring responses to creative thinking tasks. Apart from the associated costs this approach entails two potential risks. First, human raters can be subjective in their scoring behavior (inter‐rater‐variance). Second, individual raters are prone to inconsistent scoring patterns (intra‐rater‐variance). In light of these issues, we present an approach for automated scoring of Divergent Thinking (DT) Tasks. We implemented a pipeline aiming to generate accurate rating predictions for DT responses using text mining and machine learning methods. Based on two existing data sets from two different laboratories, we constructed several prediction models incorporating features representing meta information of the response or features engineered from the response’s word embeddings that were obtained using pre‐trained GloVe and Word2Vec word vector spaces. Out of these features, word embeddings and features derived from them proved to be particularly effective. Overall, longer responses tended to achieve higher ratings as well as responses that were semantically distant from the stimulus object. In our comparison of three state‐of‐the‐art machine learning algorithms, Random Forest and XGBoost tended to slightly outperform the Support Vector Regression.
Article
Full-text available
Discriminant validity has been a major concern regarding the indices of divergent thinking (DT). The present study used a meta-analytic approach to understand the nature of the problem by synthesizing the research between 2009 to 2019. Three data sets obtained from 242 different studies were analyzed by adopting a multilevel modeling approach, each consisting of one of the paired correlations between fluency, flexibility, and originality. Overall, the correlation of fluency is stronger with flexibility (r = .79, k = 247, m = 178, N = 46,933) than originality (r = .62, k = 465, m = 250, N = 58,656), showing lower discriminant validity with the former compared with the latter. This is the case even when the traditional scoring methods of flexibility (r = .81, k = 228, m = 171, N = 45,991) and originality (rs = .66 and .71, ks = 60 and 102, ms = 49 and 66, Ns = 10,514 and 10,535, respectively) were used. Among the three pairs of correlations, the flexibility-originality pair had the lowest correlation (r = .56, k = 250, m = 161, N = 41,251). Moderator analyses indicated that the averaging approach to aggregation and mixed task modality (i.e., the use of both verbal and figural DT tasks) are useful strategies to improve the discriminant validity of fluency with both flexibility and originality. Specific scoring techniques, such as semantic distance and inverse proportional scoring, improved discriminant validity on originality, and the use of multiple items undermined discriminant validity for flexibility.
Article
Automated scoring for divergent thinking (DT) seeks to overcome a key obstacle to creativity measurement: the effort, cost, and reliability of scoring open-ended tests. For a common test of DT, the Alternate Uses Task (AUT), the primary automated approach casts the problem as a semantic distance between a prompt and the resulting idea in a text model. This work presents an alternative approach that greatly surpasses the performance of the best existing semantic distance approaches. Our system, Ocsai, fine-tunes deep neural network-based large-language models (LLMs) on human-judged responses. Trained and evaluated against one of the largest collections of human-judged AUT responses, with 27 thousand responses collected from nine past studies, our fine-tuned large-language-models achieved up to r = 0.81 correlation with human raters, greatly surpassing current systems (r = 0.12–0.26). Further, learning transfers well to new test items and the approach is still robust with small numbers of training labels. We also compare prompt-based zero-shot and few-shot approaches, using GPT-3, ChatGPT, and GPT-4. This work also suggests a limit to the underlying assumptions of the semantic distance model, showing that a purely semantic approach that uses the stronger language representation of LLMs, while still improving on existing systems, does not achieve comparable improvements to our fine-tuned system. The increase in performance can support stronger applications and interventions in DT and opens the space of automated DT scoring to new areas for improving and understanding this branch of methods.
Article
Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy. What remains largely unexplored is their robustness evaluation and attribution. In this work, we study the robustness of the Vision Transformer (ViT) (Dosovitskiy et al. 2021) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT(Dosovitskiy et al. 2021) models and SOTA convolutional neural networks (CNNs), Big-Transfer (Kolesnikov et al. 2020). Through a series of six systematically designed experiments, we then present analyses that provide both quantitative andqualitative indications to explain why ViTs are indeed more robust learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness. Code for reproducing our experiments is available at https://git.io/J3VO0.
Article
Fluency confound (FC) has been a widely studied issue in divergent thinking (DT) tasks. In this study, the impact of DT task structure on FC was examined by focusing on activity level data from Torrance Tests of Creative Thinking (TTCT) – Figural. The TTCT-Figural involves two different task structures. Prompts in Activities 1 and 2 are designed for a single response each, for a total of 11, whereas Activity 3 allows for up to 30 responses through repeated presentation of a single prompt. The differences between the two task structures were tested by analyzing data from 477 adults. Correlations with and without Activity 3 scores indicated that FC is largely avoided when Activity 3 is dropped from the total calculation of scores. Additionally, FC increased when bonus originality points were excluded from analyses. These findings indicate that FC could be avoided when DT task structure is designed to restrict fluency. Furthermore, confirmatory factor analyses supported a two-factor structure model even when Activity 3 scores were removed. Thus, a two-factor structure is unlikely to be the result of FC. Implications are discussed in the context of divergent thinking assessments and the identification of students for gifted and talented programs.