ArticlePDF Available

Assessment of Divergent Thinking by Means of the Subjective Top-Scoring Method: Effects of the Number of Top-Ideas and Time-on-Task on Reliability and Validity

American Psychological Association
Psychology of Aesthetics, Creativity, and the Arts
Authors:

Abstract and Figures

Divergent thinking tasks are commonly used as indicators of creative potential, but traditional scoring methods of ideational originality face persistent problems such as low reliability and lack of convergent and discriminant validity. Silvia et al. (2008) have proposed a subjective top-2 scoring method, where participants are asked to select their two most creative ideas, which then are evaluated for creativity. This method was found to avoid problems with discriminant validity, and to outperform other scoring methods in terms of convergent validity. These findings motivate a more general, systematic analysis of the subjective top-scoring method. Therefore, this study examined how reliability and validity of the originality and fluency scores depend on the number of top-ideas and on time-on-task. The findings confirm that subjective top-scoring avoids the confounding of originality with fluency. The originality score showed good internal consistency, and evidence of reliability was found to increase as a function of the number of top-ideas and of time-on-task. Convergent validity evidence, however, was highest for a time-on-task of about 2 to 3 minutes and when using a medium number of about three top-ideas. Reasons for these findings are discussed together possible limitations of this study and future directions. The article also presents some general recommendations for the assessment of divergent thinking with the subjective top-scoring method.
Content may be subject to copyright.
Psychology of Aesthetics, Creativity, and the
Arts
Assessment of Divergent Thinking by Means of the
Subjective Top-Scoring Method: Effects of the Number of
Top-Ideas and Time-on-Task on Reliability and Validity
Mathias Benedek, Caterina Mühlmann, Emanuel Jauk, and Aljoscha C. Neubauer
Online First Publication, September 30, 2013. doi: 10.1037/a0033644
CITATION
Benedek, M., Mühlmann, C., Jauk, E., & Neubauer, A. C. (2013, September 30). Assessment
of Divergent Thinking by Means of the Subjective Top-Scoring Method: Effects of the
Number of Top-Ideas and Time-on-Task on Reliability and Validity. Psychology of Aesthetics,
Creativity, and the Arts. Advance online publication. doi: 10.1037/a0033644
Assessment of Divergent Thinking by Means of the Subjective
Top-Scoring Method: Effects of the Number of Top-Ideas and
Time-on-Task on Reliability and Validity
Mathias Benedek, Caterina Mühlmann, Emanuel Jauk, and Aljoscha C. Neubauer
University of Graz
Divergent thinking tasks are commonly used as indicators of creative potential, but traditional scoring
methods of ideational originality face persistent problems such as low reliability and lack of convergent
and discriminant validity. Silvia et al. (2008) have proposed a subjective top-2 scoring method, where
participants are asked to select their two most creative ideas, which then are evaluated for creativity. This
method was found to avoid problems with discriminant validity, and to outperform other scoring methods
in terms of convergent validity. These findings motivate a more general, systematic analysis of the
subjective top-scoring method. Therefore, this study examined how reliability and validity of the
originality and fluency scores depend on the number of top-ideas and on time-on-task. The findings
confirm that subjective top-scoring avoids the confounding of originality with fluency. The originality
score showed good internal consistency, and evidence of reliability was found to increase as a function
of the number of top-ideas and of time-on-task. Convergent validity evidence, however, was highest for
a time-on-task of about 2 to 3 minutes and when using a medium number of about three top-ideas.
Reasons for these findings are discussed together with possible limitations of this study and future
directions. The article also presents some general recommendations for the assessment of divergent
thinking with the subjective top-scoring method.
Keywords: originality, fluency, time-on-task, reliability, validity
Divergent thinking tests have had long-standing popularity in
creativity research but also have faced persistent debates about
their limitations. A common issue is that divergent thinking ability
may be considered a useful indicator of creative potential, but it
may not generalize to a more general conceptualization of creativ-
ity which for example, also includes real-life creative achievement
(Runco & Acar, 2012). A second common issue is related to the
unsatisfactory psychometric properties (i.e., objectivity, reliability,
and validity) of divergent thinking scores. These psychometric
issues need to be resolved to establish confidence in the use of
divergent thinking tasks for the assessment of creative potential
and for the study of the cognitive and neurocognitive mechanisms
underlying creative ideation (e.g., Benedek, Könen & Neubauer,
2012; Fink & Benedek, in press; Gilhooly et al., 2007; Nusbaum
& Silvia, 2011). In the past few years, strong efforts have been
made to further examine divergent thinking tests in the light of
different methodological considerations and to propose new solu-
tions to common issues (e.g., Plucker, Qian & Wang, 2011; Runco,
Okuda & Thurston, 1987; Silvia, Martin & Nusbaum, 2009; Silvia
et al., 2008). This study aims to extend these developments by a
systematic examination of the psychometric properties of subjec-
tive scoring methods.
Divergent thinking tasks require participants to generate cre-
ative solutions to given open problems. A large number of differ-
ent divergent thinking tasks have been devised (e.g., the alternate
uses tasks ask to find creative uses for a commodity item such as
a brick; cf., Benedek, Fink & Neubauer, 2006), and a variety of
different measures have been proposed for scoring responses gen-
erated in these tasks (e.g., Torrance, 2008). These measures com-
monly involve a scoring of the fluency and originality (or creativ-
ity) of ideas, which can be considered to reflect the quantity and
quality of ideation performance. The scoring of ideational fluency
is straightforward as it essentially requires counting the number of
relevant responses. In contrast, the scoring of originality is more
complex and can be achieved by different methods. In the unique-
ness scoring, the originality of responses is defined by their sta-
tistical infrequency. For example, infrequent responses (p
5–10%) are usually defined as unusual or unique, whereas more
frequent responses are considered to be common (e.g., Runco,
2008; Torrance, 1974). The originality score then is obtained by
counting the number of unique responses. Although this method
appears to allow for an objective scoring, a number of serious
objections have been raised including the issue that statistical
infrequency may not be a valid indicator of creativity because it
Mathias Benedek, Caterina Mühlmann, Emanuel Jauk, and Aljoscha C.
Neubauer, Department of Psychology, University of Graz, Graz, Austria.
This research was supported by a grant from the Austrian Science Fund
(FWF): P23914.
Correspondence concerning this article should be addressed to Mathias
Benedek, Department of Psychology, University of Graz, Maiffredygasse
12b, 8010 Graz, Austria. E-mail: mathias.benedek@uni-graz.at
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Psychology of Aesthetics, Creativity, and the Arts © 2013 American Psychological Association
2013, Vol. 7, No. 3, 000 1931-3896/13/$12.00 DOI: 10.1037/a0033644
1
does not account for the appropriateness of responses (Silvia et al.,
2008). As an alternative, in the subjective scoring method, external
judges are used to evaluate all responses for creativity (i.e., un-
usualness and appropriateness; cf., Amabile, 1982) and ratings are
finally summed. Good interrater reliability of this method can be
seen as an argument for a certain objectivity of this method, but the
evaluation of large amounts of responses by different judges is still
very laborious.
The uniqueness scoring and the subjective scoring method,
however, also face a more general methodological issue. Ideational
fluency has been realized to act as a contaminating factor for all
other scores (Hocevar, 1979a, 1979b; Kaufman, Plucker & Baer,
2008; Michael & Wright, 1989; Runco et al., 1987). According to
the scoring techniques outlined above, the scoring of originality is
directly related to the number of responses (i.e., fluency score). A
person who gives more responses thus is more likely to get points
for originality. This explains for the extremely high correlations of
fluency with originality scores, which often range from r.80 to
.90 (e.g., Mouchiroud & Lubart, 2001; Torrance, 2008). It has been
argued that these marked correlations do not support discriminant
validity (Plucker et al., 2011; Silvia et al., 2008). Moreover, after
the effect of fluency is partialed out, the reliability evidence of the
originality score is usually very low (Hocevar, 1979a, 1979b;
Runco et al., 1987); one study found that reliability is still adequate
for gifted children performing figural tasks (Runco & Albert,
1985). The reliability and validity of originality scores hence
appear to be substantially affected by the correlation with ide-
ational fluency.
Because ideational originality is conceived as an essential qual-
itative factor of divergent thinking ability, a number of suggestions
were made on how to control for the confounding influence of
ideational fluency. One suggestion is that the evaluations should be
based on the entire set of ideas rather than single ideas (i.e., scoring
of ideational pools, or snapshot scoring; Runco & Mraz, 1992;
Silvia et al., 2009). This method allows for a very quick overall
assessment but was found to yield only moderate evidence of
reliability. It was also proposed to divide total originality by the
number of ideas (i.e., average scoring, or ratio scoring). This
method has some merits (e.g., Plucker et al., 2011; Silvia et al.,
2008), but again it sometimes was found to show very low reli-
ability evidence (Runco, Okudo & Thurston, 1987), and it should
be noted that average originality might not be valid for the ability
to come up with the most creative ideas. Another possibility is to
focus on a constant number of responses (Clark & Mirels, 1970).
The examinees can, for example, be instructed to produce a pre-
defined number of responses (e.g., generate three creative re-
sponses; Hocevar, 1979b). This method controls for fluency, but
no longer allows for the implicit assessment of fluency. As an
alternative, the scoring can be restricted to a predefined number of
responses from the entire response set (Michael & Wright, 1989).
This method can be called subjective top-scoring. Recently, Silvia
et al. (2008) have adopted this approach by proposing the top-2
scoring method. This method asks the examinees to indicate their
two most creative responses per task, and only these two responses
then are evaluated. The top-2 scoring of originality was shown to
avoid excessive correlations with fluency and to perform better
than the snapshot scoring or average scoring (Silvia, 2011; Silvia
et al., 2008, 2009). Similarly, Reiter-Palmon, Illies, Cross, Buboltz
and Nimps (2009), using more complex, real-life divergent think-
ing tasks, found that using the single most creative response (i.e.,
top-1 scoring) is suitable to overcome a confounding with fluency.
Evaluating people by their best responses reflects a maximum
performance condition (Runco, 1986) and acknowledges that the
ability to select one’s best ideas is important for creativity (Smith,
Ward & Finke, 1995). This may involve that generative and
evaluative processes become confounded (Runco, 2008), but ex-
aminees were found to be quite discerning in selecting their most
creative ideas which supports the validity of this procedure (Silvia,
2008). Finally, from a practical point of view, this method also
enhances the efficiency of the rating procedure. Silvia et al. (2008)
reported that by using top-2 scoring only about 28% of the total
number of ideas had to be evaluated.
Recently, Plucker et al. (2011) have compared different methods
of originality scoring with respect to reliability and validity. The
methods included uniqueness scoring, average uniqueness (i.e.,
dividing uniqueness by fluency), uniqueness of the first or last 10
ideas, and subjective, rater-based scorings of the entire response
set (summative score) or the first or last 10 ideas. Using only two
items of the instances task, reliabilities ranged between .37 and
.62. The average uniqueness score was found to show somewhat
higher correlations with self-report creativity measures and nega-
tive correlations with fluency. It was concluded that this method
could be favored over subjective rater-based methods. However,
because in this study participants were not asked to select their
most creative ideas, the authors suggested that examining the
reliability and validity of top-ideas “is also promising and should
be the subject of additional study” (p. 15).
Main Research Questions
The main idea underlying subjective top-scoring of originality is
to focus on a constant number of top-ideas to avoid a confounding
with fluency which questions its discriminant validity. The top-2
scoring method, which focuses on the two most creative ideas, was
found to perform well in terms of reliability and validity as
compared to other scoring methods such as uniqueness scoring
(Silvia et al., 2008). In recent investigations, we have also used a
subjective scoring of divergent thinking tasks and found that
correlations with intelligence crucially depended on the scoring
method (i.e., top-2 vs. average originality; Jauk, Benedek & Neu-
bauer, 2013). Moreover, we observed that top-3 scoring resulted in
a somewhat higher reliability evidence of the originality score as
compared to top-1 or top-2 scoring (Benedek, Franz, Heene &
Neubauer, 2012). This raises the question to what extent the
number of top-ideas actually affects reliability and maybe also the
validity of the originality score. Moreover, because the number
and originality of ideas depend on time-on-task (e.g., Beaty &
Silvia, 2012; Mednick, 1962), the most adequate number of top-
ideas might also depend on the duration of divergent thinking
tasks. Therefore, this study aims to have a close look on different
realizations of the subjective top-scoring method and their effects
on the psychometric properties of originality scores. Specifically,
we want to examine systematically to what extent (a) the actual
number of top-ideas and (b) the time-on-task affect (1) the corre-
lation of originality scores with fluency, (2) the reliability of
originality scores, and (3) the convergent validity of originality
scores. Additionally, we also examine the effect of task duration
on the reliability and validity of fluency scores. We thereby hope
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
2BENEDEK, MU
¨HLMANN, JAUK, AND NEUBAUER
to reveal further information about the adequate assessment of
ideational originality, ensuring high psychometric quality but also
efficient scoring procedures.
Method
Participants
A sample of 105 participants (51 females) took part in this
study. The age ranged from 18 to 51 years (M23.80, SD
3.97). Forty-nine percent of participants were students of Psychol-
ogy at the University of Graz, 38% were majoring in different
fields, and the remaining 13% were nonstudents. Participants were
invited to take part in a study on creativity and personality and
were offered credits for participation in empirical investigations (if
applicable) and an individual feedback on personality structure in
exchange for participation. The only requirement for participation
was basic computer literacy. All participants gave written in-
formed consent. The study was approved by the local ethics
committee.
Tasks and Material
We used six divergent thinking tasks timed for five minutes
each. The tasks included three alternate uses tasks (“car tire,”
“glass bottle,” and “knife”) and three instances tasks (“what could
be round?”, “what could make a loud noise?”, “what could be used
for faster locomotion?”). Tasks were administered by a self-
devised computer program written in Matlab (The Mathworks;
Natick, MA), which allows for acquisition of time-stamped re-
sponses. There is evidence that computer-based assessment of
divergent thinking is highly comparable to a paper-pencil assess-
ment (Lau & Cheung, 2010).
In an initial general instruction participants were told that they
will be presented some questions for which they should try to
“generate as many different unusual and creative responses as
possible.” They were asked to express their ideas as succinctly as
possible, and to write each idea into the input box and then press
the enter-key to add it to their idea list. Participants were told that
there was “some minutes” time for each task and that the program
would proceed automatically as soon as time is over. By giving
participants no exact information about the total or remaining task
duration, we hoped that they would keep on entering every idea as
soon as it comes to mind, but not to develop specific task strategies
related to a five minutes task time. After a task was started,
participants were presented the specific task instructions on top of
the screen (e.g., “What could make a loud noise? Name all the
unusual and creative responses that you can think of.”). Below,
there was an editable input box where ideas could be entered.
Every idea was added to a list placed below the input box. Two
time events were recorded for each idea: (1) the time when the
participant started entering the idea, and (2) the time when writing
was complete and the idea was added to the list. We only consid-
ered the former time event, because this can be considered as the
time when the idea actually came to mind, whereas the latter time
event depends on the length of the idea and the typing speed.
After all tasks were completed, ideas were ranked for creativity
by the participants. To this end, participants were presented with
lists showing all their ideas within a single task with the ideas
being arranged in randomized order. They were asked to rearrange
the position of the ideas until the sequence of ideas in the list
reflected the creativity of ideas as subjectively appraised by the
participants. Ideas were rearranged by selecting them and moving
them up or down by means of specific buttons. At the end, the
topmost idea in the final list should be the most creative one, the
second idea in the list should be the second-most creative one, and
the last idea in the list should be the least creative one. This was
done for all six tasks, separately.
We also measured self-reported ideational behavior by means of
a German version of the Runco Ideational Behavior Scale (RIBS;
Runco, Plucker & Lim, 2000). Personality structure was assessed
by means of the five-factor inventory NEO-FFI (Borkenau &
Ostendorf, 1993).
Scoring of Divergent Thinking Tasks
The ideas generated in the divergent thinking tasks were scored
for fluency and originality. Fluency scores simply reflect the
number of ideas generated after a given time. Originality scores
were computed according to the subjective top-scoring method
using the creativity evaluations obtained by external judges.
External originality ratings. Participants generated a total of
10,921 ideas in the six divergent thinking tasks. All ideas were
pooled and identical ideas were removed, resulting in a final set of
nonredundant 6229 ideas. Eight external raters were asked to
evaluate the creativity of the ideas on a scale ranging from 0 (not
creative)to3(very creative). All raters received an initial train-
ing, which made them familiar with the scale (e.g., they were
informed that ideas can be considered “highly creative” when
they are perceived as original and useful, and probably only few
people will come up with them). The judges evaluated a small
subset of ideas and after that discussed their ratings. Because of
the large amount of ideas, each judge then evaluated the ideas
of only half of the tasks so that finally there were four inde-
pendent ratings for each idea. The interrater reliability between
the four judges was ICC .68, .80, .65, .60, .51, and .68 for the
tasks “car tire,” ‘glass bottle,” ‘knife,” “round,” “loud noise,”
and “faster locomotion,” respectively. The creativity of a single
idea was defined as the average creativity rating given by the
four external judges.
Subjective top-scoring. The main idea of the top-scoring
method is that the originality score is based on the creativity
evaluations of a predefined number of top-ideas. The top-ideas are
identified by the participants themselves according to their sub-
jective appraisal of the creativity of their ideas. For example, Silvia
et al. (2008) used a top-2 scoring, where participants marked their
two most creative ideas which then were considered for scoring.
Generally, all kinds of top-scores can be computed. For example,
for a top-1 score only the single most creative idea within a task
would be considered, whereas for a top-3 score the three most
creative ideas would be included.
A first specific aim of this study was to examine the effect of the
number of top-ideas. This was made possible by having partici-
pants sort all their ideas for creativity, as that allows for a post hoc
classification of any number of top-ideas. Second, we aimed to
also consider the effect of time-on-task. In this study, time-on-task
can theoretically vary from zero to five minutes (i.e., the total time
for each task). For a specific time-on-task lower than five minutes,
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
3
ASSESSMENT OF DIVERGENT THINKING
for example, 3 minutes, the scores were computed based on the
data available after 3 minutes. To illustrate this method, let us
consider the following example: Assume that a participant gener-
ated four ideas at 30, 60, 120 and 240 seconds, and afterward
ranked them 2., 4., 3., 1. (i.e., the first idea being second-most
creative, the second idea being least creative, and so on). Then, for
computing the top-2 originality score for a time-on-task of 3
minutes, only the two most creative ideas within the first 3 minutes
are considered, hence, the first and the third idea. The originality
score was finally computed by averaging the creativity evaluations
of the considered top-ideas. If a participant generated fewer ideas
than the number of top-ideas then the creativity evaluations of the
available ideas were averaged.
Procedure
Participants were tested in small groups of up to five people in
a computer room. They first performed the six divergent thinking
tasks, which were presented in a randomized order. The divergent
thinking tasks were preceded by a short exercise (enter words
starting with the letter “F”) to become familiar with the general
procedure of a computer-based idea generation task. After com-
pletion of the six tasks, the participants ranked their ideas for
creativity. Finally, the participants completed the personality in-
ventory, and the ideational behavior scale. The whole session took
about one hour.
Analysis Plan
The main analyses include correlation analysis of fluency and
originality scores, reliability analyses (i.e., internal consistency of
scores), and validity analyses (i.e., correlations with external cri-
teria). In all of these analyses two experimental factors are varied
systematically: First, the factor top-ideas is varied from 1 to 10
ideas (see section on top-scoring method for further details). Ad-
ditionally, this factor also includes the value “all ideas,” where all
ideas given by a participant were considered; this hence corre-
sponds to an average originality score (cf., Silvia et al., 2008). This
factor only applies for the originality score but not for the fluency
score. Second, the factor time-on-task is varied from 1 to 5 minutes
(i.e., scores are computed for time-on-task of 1, 2, 3, 4, and 5
minutes). In total, the scoring hence was computed for 55 different
conditions (11 top-idea conditions by 5 time-on-task conditions).
The results of these analyses are visualized by means of contour
plots (see Figures 1, 2, and 3).
Results
Descriptive Statistics and Preliminary Analyses
Participants generated on average 17 ideas (SD 6.36) within
the task time of 5 minutes. The fluency of ideas was significantly
higher in the three instances tasks (M21.95, SD 8.61) than in
the three alternate uses tasks (M12.72, SD 4.96), t(104)
15.83, p.01. As can be seen in Table 1, the total number of ideas
steadily increases over time, however the fluency of ideas also
steadily declines over time starting from the first minute. Con-
sidering the alternate uses tasks, half of participants had created
four ideas or more within the first minute of the task but the
increase flattened down to one additional idea in the last minute
of the task (difference between median values after 4 and 5
minutes; see Table 1). It can also be seen that there are large
individual differences in fluency scores. Whereas the least
fluent 10% of participants generated only nine ideas or less
after working on the instances task for five minutes, the most
productive 10% of the sample generated more than three times
as many ideas (i.e., 32.7).
A principal factor analysis with Varimax rotation and Kaiser
normalization was performed for the six fluency scores and the six
average originality scores derived from the alternate uses as well
as the instances tasks. The analysis extracted two factors (accord-
ing to the Kaiser criterion and according to the Scree test;
K-M-O .83), explaining 67% of total variance. Further evidence
for a two-factorial solution comes from the minimum average
partial test (MAP test: Velicer, Eaton, & Fava, 2000), which
returned two components as the number of factors to extract from
the 12 measures. This two-factor solution clearly revealed a flu-
ency factor and an originality factor with all six fluency and
originality scores loading on corresponding factors (unspecific
loadings were below .25). In other words, we obtained evidence
for score-specific factors rather than task-specific factors, which
supports the feasibility of aggregating fluency and originality
scores across tasks.
How Do Scoring Conditions Affect the Correlation of
Fluency and Originality?
We computed correlations between originality and fluency
scores for different numbers of top-responses at varying time-on-
task. It was assumed that the subjective top-scoring method can
avoid excessively high correlations between originality and flu-
ency because it focuses on a constant number of ideas. In line with
these expectations, correlations were found to be close to zero
ranging between –.30 and .06 (see Figure 1). Ideational fluency
and originality even showed significant negative correlations when
using the average score and considering task times of 3 minutes or
less.
For reasons of comparison with previous studies, we also com-
puted summative originality scores by summing up the creativity
evaluations of all ideas produced by participants within a task. As
expected, these summative creativity scores showed extremely
high correlations with the fluency scores of r.83, .87, .90, .90,
.91 for time-on-task of 1, 2, 3, 4, and 5 minutes, respectively.
Table 1
Number of Ideas Generated After a Time-on-Task of 1 to 5 Minutes in the Alternate Uses Tasks and the Instances Tasks
1 min 2 min 3 min 4 min 5 min
Alternate uses 2.0/4.3/6.7 3.5/7.0/10.3 4.8/9.0/13.7 5.3/11.0/17.0 6.3/12.0/19.0
Instances 2.5/7.7/11.3 4.8/12.0/17.7 6.5/15.3/23.7 8.4/18.7/28.3 9.0/21.7/32.7
Note. The three values in each cell denote 10, 50, and 90 percentile values.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
4BENEDEK, MU
¨HLMANN, JAUK, AND NEUBAUER
How Do Scoring Conditions Affect Reliability?
To examine how the scoring conditions affect reliability of fluency
and originality scores, we computed their internal consistency (Cron-
bach’s alpha). The fluency score shows high reliability (␣⫽.83) even
for a short time-on-task of 1 minute. The reliability evidence increases
with time-on-task up to ␣⫽.89 for a time-on-task of 5 minutes (see
Figure 2A).
The reliability evidence of the originality score also was found to
generally increase with an increasing number of top-ideas and with
increasing time-on-task (see Figure 2B). Reliability was lowest when
only a single top-idea was considered (i.e., top-1 score) staying below
an alpha of .60, but it increased substantially by including some
additional top-ideas. For example, at a time-on-task of 2 minutes
using top-2, top-3, or top-4 scoring increased reliability to .70, .75, or
.77, respectively.
Time-on-task also generally increased reliability evidence, but this
is especially true when a larger number of top-ideas is considered. For
the top-1 scoring, the increase of time-on-task from 1 minute to 5
minutes only causes an increase in reliability from .56 to .59, whereas
it increases from .71 to .83 for the top-5 scoring. A decent alpha
coefficient of at least .75 could be obtained only by using a time-on-
task of 2 minutes (or higher) and when using at least the top-3 ideas.
An alpha of .80 was obtained when using top-4 scoring with a
time-on-task of 4 minutes or top-5 scoring with time-on task of 3
minutes. Reliability of the originality score peaked at an alpha of .87
when using the average score (i.e., using all ideas) at a time-on-task of
5 minutes.
How Do Scoring Conditions Affect Validity?
The effect of the scoring method on the convergent validity
evidence of ideational fluency and originality was tested by
means of correlations with the external criteria of self-reported
ideational behavior and the personality factor openness. The
fluency score showed significant positive correlations with both
external criteria ranging from .26 to .33 for the ideational
behavior scale and from .25 to .30 for openness, respectively
(see Figure 3A and 3C). At this, there is a small trend for
correlations to increase with time-on-task.
The originality score showed no significant correlations with
the ideational behavior scale, but just a weak trend toward
positive correlations (see Figure 3B). With respect to openness,
the originality scores generally showed significant positive
correlations (see Figure 3D). These correlations were highest
(r.35 to .38) for a time-on-task of 2 minutes when using
scoring of top-2 to top-8. The correlations were substantially
lower but still significant (r.21 to .26) for the average
scoring method which considers all ideas.
Figure 1. Correlation of fluency and originality scores depending on the
number of top-ideas and time-on-task. Correlation coefficients exceeding
r.19 are considered statistically significant given the sample size of
n105.
Figure 2. Reliability (Cronbach’s alpha) of (A) the fluency score and (B) the originality score depending on
the number of top-ideas and time-on-task.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
5
ASSESSMENT OF DIVERGENT THINKING
Discussion
Can Subjective Top-Scoring Avoid the Confounding of
Originality and Fluency?
One major aim of the subjective top-scoring method is to avoid
the usually high dependency of qualitative measures of divergent
thinking (e.g., ideational originality) from the number of ideas
generated by participants (i.e., ideational fluency). We were able to
replicate the common finding that a summative scoring of origi-
nality (i.e., computing a sum of the creativity evaluations of all
ideas generated by a participant) results in extremely high corre-
lation with the fluency scoring ranging between .80 and .90 (cf.,
Mouchiroud & Lubart, 2001; Torrance, 2008). In contrast, when
originality scores were computed by means of the top-scoring
method, correlations with fluency were largely close to zero. This
is in line with the finding of Silvia et al. (2008), who also obtained
no significant correlation with fluency when using top-2 scoring.
The results hence confirm that the subjective top-scoring method
avoids the confounding of originality scores with fluency.
The average score is a special case because it uses all ideas but
by averaging ratings rather than summing them, a high positive
correlation with fluency of ideas can be avoided. For the average
score (and to a smaller extent also for the top-9 or top-10 score) we
even observed small negative correlations at least when time-on-
task was short. This result may probably be attributed to the
existence of people who focus on fluency rather than creativity of
ideas and thus were able to generate large amounts of responses.
This strategy probably involves the generation of a large number
of highly common responses which then results in a low average
originality score as compared to those who rather focus on cre-
ativity of ideas (Reiter-Palmon et al., 2009).
Psychometric Properties of the Fluency Score
We obtained high internal consistency for the ideational fluency
scores of the six divergent thinking tasks. Alpha coefficients
slightly increased with time-on-task but already settled above .85
for times-on-task of 2 minutes or more. This suggests that ide-
ational fluency can be reliably assessed even with short divergent
thinking tasks. We further obtained significant positive correla-
tions of fluency with self-reported ideational behavior and open-
ness supporting the general validity of this score. These correla-
tions also showed a slight increase with time-on-task which can
probably be attributed to the corresponding increases in reliability.
Psychometric Properties of the Originality Score
The top-scoring method was found to result in dependable
originality scores. Although interrater reliability was moderate for
some tasks, the internal consistency between the six different
divergent thinking tasks reached Cronbach’s alpha levels well
Figure 3. Correlation of fluency (A, C) and originality (B, D) with self-reported ideational behavior and
openness depending on the number of top-ideas and time-on-task. Correlation coefficients exceeding r.19 are
considered statistically significant given the sample size of n105.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
6BENEDEK, MU
¨HLMANN, JAUK, AND NEUBAUER
beyond .80 for some scoring conditions. This level of reliability
bears comparison with other well-established constructs of cogni-
tive ability. Together with the findings derived from factor anal-
ysis, this indicates that originality scores coming from different
divergent thinking tasks share a substantial amount of common
variance. Although divergent thinking tasks may not be fully
interchangeable with respect to their cognitive demands (Guilford,
1967; Kuhn & Holling, 2009; Silvia, 2011), our results support the
feasibility of computing aggregate scores across different diver-
gent thinking task to obtain a reliable total originality score. The
reliability, however, was found to be sensitive to scoring condi-
tions (i.e., top-ideas and time-on-task). Reliability was lowest
when only a single top-idea was considered, but it could be
increased substantially by including some additional top-ideas
(Benedek, Franz et al., 2012), and was highest for the average
score which makes use of all ideas (Silvia et al., 2008). A straight-
forward explanation for this is that the aggregated evaluations of a
larger number of ideas allows for a more reliable assessment, just
as any test increases reliability by extending the number of rele-
vant items. Also, considering more ideas could compensate for any
discrepancies between the participants and the raters about what
are considered to be the most creative ideas.
A higher time-on-task was found to increase reliability at least
for scores using four or more top-ideas. This suggests that scoring
a high number of ideas makes more sense when there is enough
time for participants to generate large numbers of ideas. A task
time of 2 or 3 minutes apparently already worked quite well for
most scores; further increases in task time only added small
increases in reliability.
We also examined correlations with other common indicators of
creativity to estimate effects of task properties on the validity of
the originality score. A priori, one could assume that the correla-
tion pattern would generally match that of reliability as any lack of
reliability necessarily impairs validity coefficients. Interestingly,
this was not the case. Whereas the reliability evidence of original-
ity scores was highest for average scoring at 5 minutes time-on-
task, the correlation with openness for this score was lowest. The
highest validity coefficients were obtained for a task time of 2
minutes using a medium number of about 3 to 6 top-ideas. This
raises the question why correlations did not increase with increas-
ing number of top-ideas just as reliability did. It has to be remem-
bered that people were instructed to generate as many unusual and
creative ideas as possible. High creative people presumably were
able to generate many unusual ideas, of which, however, only
some are very creative and thus truly indicative of their potential
for creative thought. Hence, when all ideas are considered, such as
in the average scoring, the evaluations of more and less creative
ideas become mixed up. This would result in a moderate total
creativity score for a high creative person which could equally be
attained by a less creative person who just generated a few mod-
erately creative ideas. It hence can be concluded that subjective
top-scoring may provide more valid scores than average scoring,
even though the latter method may be somewhat more reliable in
terms of internal consistency.
The question remains why the validity did not increase steadily
with time-on-task like reliability did. This might be explained by
the fact that originality generally increases over time (e.g., Beaty
& Silvia, 2012; Mednick, 1962; Piers & Kirchner, 1971) but
creative people overcome common ideas more quickly than less
creative people (Benedek & Neubauer, in press). As a conse-
quence, after a short time-on-task, creative people may already
have come up with highly original ideas whereas less creative
people have not. As the time-on-task proceeds less creative people
eventually also come up with more creative ideas, whereas high
creative people can hardly further improve their performance to
the same extent. Hence, the discernment between high and low
creative people (i.e., validity) may be higher for shorter task times
than for excessively long ones.
Originality showed significant correlations only with openness
but not with self-reported ideational behavior. The absent signif-
icant correlations with ideational behavior suggest that the ide-
ational behavior questionnaire is more indicative of ideational
fluency (two sample items read “I come up with a lot of ideas or
solutions to problems” or ”I have always been an active thinker—I
have lots of ideas”; Runco et al., 2001).
Recommendations/Implications for Scoring of
Divergent Thinking Tasks
Some straightforward recommendations concerning the ade-
quate assessment of ideational fluency and originality can be
derived from the results of this study. For ideational fluency, it
appears to be quite simple to obtain a reliable and valid score. This
can be achieved by using divergent thinking tasks with short task
time of about two minutes. The originality score, however, appears
to be more sensitive to task and scoring properties. First, original-
ity scores were found to be more valid when using tasks with
durations of about 2 to 3 minutes. This substantiates the common
practice of using similar tasks durations. Using shorter or much
longer tasks, however, might negatively affect the validity of
scores. Second, the top-scoring method should consider a medium
number of about three to six ideas. Using much fewer or much
more ideas (e.g., Plucker et al., 2011) may result in less valid
scores. Considering that using a higher number of top-ideas also
implies that a higher total number of ideas has to be subjected to
ratings, it could be a good compromise to use three top-ideas. For
a time-on-task of 2 minutes participants generated on average 10
ideas. Using only the three most creative ideas would help to
reduce the rating effort by about 70% as compared to having to
evaluate all ideas. Similar rates were reported by Silvia et al.
(2008) for top-2 scoring. Moreover, more than 90% of participants
generated three or more ideas within two minutes.
Some Limitations of This Study and Future Directions
Some limitations of this study need to be addressed. Time-on-
task was varied as an experimental variable by analyzing the
performance data available at different times within the task. This
was done to estimate scores that could be obtained for tasks of
different length. Although this method is efficient, results obtained
for, for example, a time-on-task of 2 minutes might not fully
generalize to studies which explicitly use 2 minute tasks. Differ-
ences might for example relate to higher effects of fatigue, because
performing six divergent thinking tasks with five minutes probably
involves more cognitive effort than six tasks with only two min-
utes. Moreover, people might apply different idea generation strat-
egies when they know that tasks are shorter. We tried, however, to
minimize these effects by not telling participants about the exact
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
7
ASSESSMENT OF DIVERGENT THINKING
task time and by not giving them any information about the
remaining task time.
A similar argument applies to the experimental variation of the
number of top-ideas. The post hoc selection of a specific number
of top-ideas may not fully generalize to the corresponding instruc-
tion to select a specific number of top-ideas. Some people who had
generated large amounts of ideas reported that they found it
difficult to arrange them all properly for creativity. This issue
might be less prominent for shorter tasks and when people are just
asked to identify their three or five most creative ideas. Taken
together, it could be assumed that shorter task durations and the
selection of a low number of most creative ideas may cause lower
fatigue and more accurate judgments, which might eventually have
additional positive effects on the psychometric properties of the
originality score. Further limitations include the sample size, and
the specific tasks which were selected for this study. For example,
for more complex divergent thinking tasks (e.g., Reiter-Palmon et
al., 2009), which often show lower fluency, the most adequate
number of top-ideas might differ. The present findings hence await
replication with larger samples, using other divergent thinking
tasks, and employing further criteria for examining the validity of
scores.
There are also some additional methodological issues that could
be addressed in future research. First of all, one might consider
using separate tasks for assessing fluency and originality. We
derived both scores from the same tasks (e.g., Torrance, 2008) and
instructed participants to generate as many unusual and creative
ideas as possible. This could be considered as kind of a double task
which permits participants to use different strategies either focus-
ing on fluency or creativity of ideas. Future work might therefore
attempt to assess fluency and originality with separate tasks using
specific task instructions to focus either only on fluency or only on
originality of ideas. Although this procedure may require a larger
total number of tasks, it might help to further increase the validity
of scores. Finally, it should be noted that using a specific number
of top-ideas also implies the possibility that there are some par-
ticipants who actually do not generate as much ideas within the
given time. There are different ways to handle this. In this study,
we then used all available ideas of the participant. Another possi-
bility would be to assign missing ideas with the lowest possible
creativity rating (i.e., a creativity rating of zero). This would
implicitly penalize very low fluency. Some side analyses indicated
that such originality scores can again be highly correlated with
fluency, at least for high numbers of top-ideas. This scoring
approach could, however, be useful in studies that decide not to
use separate fluency scorings but still allow for a moderate influ-
ence of fluency on the originality score.
Conclusions
This study provides further evidence of the usefulness of the
subjective top-scoring method for the assessment of ideational
originality (cf., Silvia et al., 2008). Using subjective top-scoring
ensures that ideational originality scores overcome the issues often
associated with this score, such as a lack of discriminant validity
with respect to fluency. Moreover, adequate scoring methods help
to obtain a highly reliable and valid originality score. As an
example, a top-3 originality score for 2 minutes time-on-task
showed a higher correlation with openness than fluency did. Ad-
equate scoring of ideational originality hence may provide re-
searchers with a powerful indicator of creative potential, besides
and beyond fluency.
References
Amabile, T. (1982). Social psychology of creativity: A consensual assess-
ment technique. Journal of Personality and Social Psychology, 43,
997–1013. doi:10.1037/0022-3514.43.5.997
Beaty, R. E., & Silvia, P. J. (2012). Why do ideas get more creative across
time? An executive interpretation of the serial order effect in divergent
thinking tasks. Psychology of Aesthetics, Creativity, and the Arts. Ad-
vance online publication. doi:10.1037/a0029171
Benedek, M., Fink, A., & Neubauer, A. (2006). Enhancement of ideational
fluency by means of computer-based training. Creativity Research Jour-
nal, 18, 317–328. doi:10.1207/s15326934crj1803_7
Benedek, M., Franz, F., Heene, M., & Neubauer, A. C. (2012). Differential
effects of cognitive inhibition and intelligence on creativity. Personality
and Individual Differences, 53, 480 – 485. doi:10.1016/j.paid.2012.04
.014
Benedek, M., Könen, T., & Neubauer, A. C. (2012). Associative abilities
underlying creativity. Psychology of Aesthetics, Creativity, and the Arts,
6, 273–281. doi:10.1037/a0027059
Benedek, M., & Neubauer, A. C. (in press). Revisiting Mednick’s model on
creativity-related differences in associative hierarchies. Evidence for a
common path to uncommon thought. Journal of Creative Behavior.
doi:10.1002/jocb.35
Borkenau, P., & Ostendorf, F. (1993). NEO-Fünf-Faktoren Inventar (NEO-
FFI) nach Costa und McCrae [NEO-Five factor inventory after Costa
and McCrae]. Göttingen, Germany: Hogrefe.
Clark, P. M., & Mirels, H. L. (1970). Fluency as a pervasive element in the
measurement of creativity. Journal of Educational Measurement, 7,
83– 86. doi:10.1111/j.1745-3984.1970.tb00699.x
Fink, A., & Benedek, M. (in press). EEG Alpha power and creative
ideation. Neuroscience and Biobehavioral Reviews. Advance online
publication. doi:10.1016/j.neubiorev.2012.12.002
Gilhooly, K. J., Fioratou, E., Anthony, S. H., & Wynn, V. (2007). Diver-
gent thinking: Strategies and executive involvement in generating novel
uses for familiar objects. British Journal of Psychology, 98, 611– 625.
Guilford, J. P. (1967). The nature of human intelligence. New York, NY:
McGraw-Hill.
Hocevar, D. (1979a). A comparison of statistical infrequency and subjec-
tive judgment as criteria in the measurement of originality. Journal of
Personality Assessment, 43, 297–299. doi:10.1207/s15327752
jpa4303_13
Hocevar, D. (1979b). Ideational fluency as a confounding factor in the
measurement of originality. Journal of Educational Psychology, 71,
191–196. doi:10.1037/0022-0663.71.2.191
Jauk, E., Benedek, M., Dunst, B., & Neubauer, A. C. (2013). The rela-
tionship between intelligence and creativity: New support for the thresh-
old hypothesis by means of empirical breakpoint detection. Intelligence,
41, 212–221. doi:10.1016/j.intell.2013.03.003
Kaufman, J. C., Plucker, J. A., & Baer, J. (2008). Essentials of creativity
assessment. Hoboken, NJ: Wiley.
Kuhn, J. T., & Holling, H. (2009). Measurement invariance of divergent
thinking across gender, age, and school forms. European Journal of
Psychological Assessment, 25, 1–7. doi:10.1027/1015-5759.25.1.1
Lau, S., & Cheung, P. C. (2010). Creativity assessment: Comparability of
the electronic and paper-and-pencil versions of the Wallach–Kogan
Creativity Tests. Thinking Skills and Creativity, 5, 101–107. doi:
10.1016/j.tsc.2010.09.004
Mednick, S. A. (1962). The associative basis of the creative process.
Psychological Review, 69, 220 –232. doi:10.1037/h0048850
Michael, W. B., & Wright, C. R. (1989). Psychometric issues in the
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
8BENEDEK, MU
¨HLMANN, JAUK, AND NEUBAUER
assessment of creativity. In J. A. Glover, R. R. Ronning & C. R.
Reynolds (Eds.), Handbook of creativity (pp. 33–52). New York, NY:
Plenum Press. doi:10.1007/978-1-4757-5356-1_2
Mouchiroud, C., & Lubart, T. (2001). Children’s original thinking: An
empirical examination of alternative measures derived from divergent
thinking tasks. The Journal of Genetic Psychology: Research and The-
ory on Human Development, 162, 382– 401. doi:10.1080/
00221320109597491
Nusbaum, E. C., & Silvia, P. J. (2011). Are intelligence and creativity
really so different? Fluid intelligence, executive processes, and strategy
use in divergent thinking. Intelligence, 39, 36 – 45. doi:10.1016/j.intell
.2010.11.002
Piers, E. V., & Kirchner, E. P. (1971). Productivity and uniqueness in
continued word association as a function of subject creativity and
stimulus properties. Journal of Personality, 39(2), 264 –276. doi:
10.1111/j.1467-6494.1971.tb00041.x
Plucker, J. A., Qian, M., & Wang, S. (2011). Is originality in the eye of the
beholder? Comparison of scoring techniques in the assessment of diver-
gent thinking. The Journal of Creative Behavior, 45, 1–22. doi:10.1002/
j.2162-6057.2011.tb01081.x
Reiter-Palmon, R., Illies, M. Y., Cross, L. K., Buboltz, C., & Nimps, T.
(2009). Creativity and domain specificity: The effect of task type on
multiple indexes of creative problem-solving. Psychology of Aesthetics,
Creativity, and the Arts, 3, 73– 80. doi:10.1037/a0013410
Runco, M. A. (1986). Maximal performance on divergent thinking tests
by gifted, talented, and nongifted children. Psychology in the Schools,
23, 308 –315. doi:10.1002/1520-6807(198607)23:3308::AID-
PITS23102303133.0.CO;2-V
Runco, M. A. (2008). Commentary: Divergent thinking is not synonymous
with creativity. Psychology of Aesthetics, Creativity, and the Arts, 2,
93–96. doi:10.1037/1931-3896.2.2.93
Runco, M. A., & Acar, S. (2012). Divergent thinking as an indicator of
creative potential. Creativity Research Journal, 24, 66 –75. doi:10.1080/
10400419.2012.652929
Runco, M. A., & Albert, R. S. (1985). The reliability and validity of
ideational originality in the divergent thinking of academically gifted
and nongifted children. Educational and Psychological Measurement,
45, 483–501.
Runco, M. A., & Mraz, W. (1992). Scoring divergent thinking tests using
total ideational output and a creativity index. Educational and Psycho-
logical Measurement, 52, 213–221.
Runco, M. A., Okuda, S. M., & Thurston, B. J. (1987). The psychometric
properties of four systems for scoring divergent thinking tests. Journal
of Psychoeducational Assessment, 5, 149 –156. doi:10.1177/
073428298700500206
Runco, M. A., Plucker, J. A., & Lim, W. (2001). Development and
psychometric integrity of a measure of ideational behavior. Creativity
Research Journal, 13, 393– 400. doi:10.1207/S15326934CRJ1334_16
Silvia, P. J. (2008). Discernment and creativity: How well can people
identify their most creative ideas? Psychology of Aesthetics, Creativity,
and the Arts, 2, 139 –146. doi:10.1037/1931-3896.2.3.139
Silvia, P. J. (2011). Subjective scoring of divergent thinking: Examining
the reliability of unusual uses, instances, and consequences tasks. Think-
ing Skills and Creativity, 6, 24 –30. doi:10.1016/j.tsc.2010.06.001
Silvia, P. J., Martin, C., & Nusbaum, E. C. (2009). A snapshot of creativity:
Evaluating a quick and simple method for assessing divergent thinking.
Thinking Skills and Creativity, 4, 79 – 85. doi:10.1016/j.tsc.2009.06.005
Silvia, P. J., Winterstein, B. P., Willse, J. T., Barona, C. M., Cram, J. T.,
Hess, K. I.,...Richard, C. A. (2008). Assessing creativity with
divergent thinking tasks: Exploring the reliability and validity of new
subjective scoring methods. Psychology of Aesthetics, Creativity, and
the Arts, 2, 68 – 85. doi:10.1037/1931-3896.2.2.68
Smith, S. M., Ward, T. B., & Finke, R. A. (1995). The creative cognition
approach. Cambridge, MA: MIT Press.
Torrance, E. P. (1974). Torrance Tests of Creative Thinking: Norms-
technical manual, verbal forms A and B. Bensenville, IL: Scholastic
Testing Service.
Torrance, E. P. (2008). Torrance Tests of Creative Thinking: Norms-
technical manual, verbal forms A and B. Bensenville, IL: Scholastic
Testing Service.
Velicer, W. F., Eaton, C. A., & Fava, J. L. (2000). Construct explication
through factor or component analysis: A review and evaluation of
alternative procedures for determining the number of factors or compo-
nents. In R. D. Goffin & E. Helmes (Eds.), Problems and solutions in
human assessment (pp. 41–71). Boston, MA: Kluwer. doi:10.1007/978-
1-4615-4397-8_3
Received October 22, 2012
Revision received January 28, 2013
Accepted March 4, 2013
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
9
ASSESSMENT OF DIVERGENT THINKING
... Based on this, we aimed to create different levels of retrieval fluency by controlling the number of ideas generated by participants in our study, thus examining its impact on AUT performance. Unlike the typical evaluation of creative ability from the three aspects of fluency (i.e., the number of valid ideas generated), originality (i.e., the novelty of the ideas generated), and flexibility (i.e., the variety of different types of ideas generated) (De Dreu et al. 2008;Guilford 1967), this study uses the average originality of ideas, calculated by dividing the originality score by the fluency score, as the core indicator of creativity (Benedek et al. 2013;Silvia et al. 2008). Since manipulating the number of ideas in the experiment directly affects the fluency score, and considering that the quality of ideas sometimes better reflects an individual's creativity level than quantity (Batey and Furnham 2008;Reiter-Palmon and Arreola 2015), the average originality of ideas is emphasized. ...
... The inter-rater reliability across the three raters ranged from 0.89 to 0.95. The average originality of ideas generated, calculated by dividing the originality score by the fluency score, was used as the core index to reflect individuals' creativity (Benedek et al. 2013;Silvia et al. 2008). An independent samples t-test showed that the average originality of ideas under the "14 or more ideas" condition (M = 2.65, SD = 0.27) was significantly higher than under the "7 ideas" condition (M = 2.50, SD = 0.36), t (136) = 2.69, p < 0.05, Cohen's d = 0.47. ...
... We employed a retrieval fluency paradigm to restrict the number of ideas that participants were required to generate for each object (Murphy and Castel 2021;Sanna et al. 2002), directly activating their different metacognitive (dis)fluency experiences in Experiment 1. By using the average originality of ideas as the core indicator of AUT performance (Batey and Furnham 2008;Benedek et al. 2013), the results indicated that ideas generated under the metacognitive disfluency experience condition were more creative. This finding was based on the exclusion of individual motivation, which is influenced by processing fluency (Nevo et al. 2020) and affects creativity (Roskes et al. 2012). ...
Article
Full-text available
Metacognition is vital for creativity; however, the specific contributions of its components (i.e., metacognition knowledge, metacognition experience, and metacognition monitoring and control) have received varying levels of attention, particularly due to the limited research on metacognitive experience. Additionally, the interactions among these components in influencing creative cognition remain unclear. We conducted two experiments to explore the influence of metacognitive experience on divergent thinking (e.g., alternative uses tasks, AUT) and the moderating role of creative mindsets—a core element of metacognitive knowledge—in this process. In Experiment 1, retrieval fluency, measured by the quantity of the ideas generated, was used to activate varying levels of metacognitive experience (fluency vs. disfluency) during the AUT. The findings showed that the originality of ideas generated under the disfluency condition was significantly higher than under the fluency condition, suggesting a positive effect of metacognitive disfluency experience on AUT. In Experiment 2, a multiple-choice task was used to prime individuals’ creative mindsets (entity vs. incremental). The results indicated that individuals with a creative growth mindset exhibited greater cognitive persistence under the disfluency condition, subsequently enhancing the originality of their ideas, indicating that creative mindsets moderate the effect of metacognitive disfluency experience on AUT performance via cognitive persistence. We integrated previous findings to describe the interactive impacts of creative mindsets, metacognitive experience, and metacognitive monitoring and control on divergent and convergent creative thinking processes within a metacognitive framework, providing a model to reveal the dynamic interplay of metacognitive processes in creative cognition. Practically, fostering individuals’ growth-oriented creative mindsets may represent a promising avenue for creativity development.
... One of the most notable issues with subjective scoring is the labor cost associated with human raters. Recruiting and retaining a team of trained raters can often be hard and resource intensive, and retraining a new team of raters can prove costly and impractical (Benedek et al., 2013). Moreover, qualified human raters may not always be available to researchers, or may be inaccessible due to lack of funding for their remuneration. ...
... Extensive research has validated this method, demonstrating good interrater agreement on judgments of narrative creativity (D'Souza, 2021; Kaufman et al., 2013;Mozaffari, 2013;Vaezi & Rezaei, 2019). Nevertheless, such methods are cost-intensive, relying on the recruitment, training, and labor of research assistants (Benedek et al., 2013;Forthmann et al., 2025). Thanks in part to recent advancements in natural language processing, automated methods for scoring narrative creativity may replace the need for human raters ). ...
Article
Full-text available
Researchers and educators interested in creative writing need a reliable and efficient tool to score the creativity of narratives, such as short stories. Typically, human raters manually assess narrative creativity, but such subjective scoring is limited by labor costs and rater disagreement. Large language models (LLMs) have shown remarkable success on creativity tasks, yet they have not been applied to scoring narratives, including multilingual stories. In the present study, we aimed to test whether narrative originality—a component of creativity—could be automatically scored by LLMs, further evaluating whether a single LLM could predict human originality ratings across multiple languages. We trained three different LLMs to predict the originality of short stories written in 11 languages. Our first monolingual model, trained only on English stories, robustly predicted human originality ratings (r = .81). This same model—trained and tested on multilingual stories translated into English—strongly predicted originality ratings of multilingual narratives (r ≥ .73). Finally, a multilingual model trained on the same stories, in their original language, reliably predicted human originality scores across all languages (r ≥ .72). We thus demonstrate that LLMs can successfully score narrative creativity in 11 different languages, surpassing the performance of the best previous automated scoring techniques (e.g., semantic distance). This work represents the first effective, accessible, and reliable solution for the automated scoring of creativity in multilingual narratives.
... Traditional scoring methods for originality in divergent thinking tests typically assign one point to unique answers (those given by only one participant) and zero points to nonunique answers, reflecting creativity's originality (Silvia et al. 2008). Benedek et al. (2013) proposed a subjective top-scoring method where participants self-select their most creative ideas. However, this approach demonstrated high concordance with external ratings. ...
Article
Full-text available
This study proposes a multimodal deep learning model for automated scoring of image-based divergent thinking tests, integrating visual and semantic features to improve assessment objectivity and efficiency. Utilizing 708 Chinese high school students’ responses from validated tests, we developed a system combining pretrained ResNet50 (image features) and GloVe (text embeddings), fused through a fully connected neural network with MSE loss and Adam optimization. The training set (603 images, triple-rated consensus scores) showed strong alignment with human scores (Pearson r = 0.810). Validation on 100 images demonstrated generalization capacity (r = 0.561), while participant-level analysis achieved 0.602 correlation with total human scores. Results indicate multimodal integration effectively captures divergent thinking dimensions, enabling simultaneous evaluation of novelty, fluency, and flexibility. This approach reduces manual scoring subjectivity, streamlines assessment processes, and maintains cost-effectiveness while preserving psychometric rigor. The findings advance automated cognitive evaluation methodologies by demonstrating the complementary value of visual-textual feature fusion in creativity assessment.
... Scores from three models (SBERT_mpnet, SBERT_MiniLM, and SimCSE) in TransDis were used and averaged to compute a final score. At an individual level, we aggregated the originality score using a top-3 scoring method to avoid the confounding effect with the fluency score, as employed in prior studies (Benedek et al., 2013;Yang et al., 2023). ...
Article
Full-text available
Metaphors are common and useful in real-world creations; yet, the semantic structure of creative metaphors is not well understood. This study used large language models to quantify semantic distances among the target, base, and common feature in 1,212 metaphors from 133 participants. The findings revealed that higher ratings of metaphor creativity were associated with greater target–feature and base–feature semantic distances rather than the target–base distance. Notably, a pervasive information gap was identified in participant-generated metaphors, where the target–feature semantic distance was greater than the base–feature distance. Information gap, semantic distances, and their interaction positively contributed to the creativity of metaphors, supporting the Conceptual Metaphor Theory and the associative combination view of creativity. Furthermore, information gap predicted divergent thinking and creative self-identity at the participant level. Our work extends the associative theory of creativity in the use of figurative speech: creative metaphors combine distant associations by bridging the target and base through a common feature, thereby enriching the target domain with novel insights.
Article
Creativity tests, like creativity itself, vary widely in their structure and use. These differences include instructions, test duration, environments, prompt and response modalities, and the structure of test items. A key factor is task structure, referring to the specificity of the number of responses requested for a given prompt. Classic creativity assessments often use divergent thinking tasks, which allow for multiple responses. In contrast, other measures, such as insight tasks or the Remote Associates Test, require a single correct answer. This distinction suggests that a creativity test's correlates could depend on its placement along the convergent–divergent continuum. The PISA Creative Thinking assessment leans toward the divergent end, as none of its items require a single correct answer. However, it differs from traditional divergent thinking tests by not explicitly instructing participants to generate as many responses as possible. Instead, PISA items allow varying numbers of responses—some requiring one, others two or three. This variation reflects different levels of divergence, with one‐response items being more convergent than three‐response items. We argue that this difference in task structure should be considered when examining the relationship between PISA creativity scores and factors like academic achievement and socioeconomic status.
Article
OBJECTIVE In low-grade glioma (LGG), although awake surgery (AS) with intraoperative functional mapping helps to minimize neurological and cognitive deficits, its impact on artistic abilities has received less attention. This study is the first to assess the capacity of professional or semiprofessional artists to resume various art activities following AS for LGG. METHODS Artists who underwent AS for an IDH-mutated WHO grade 2 glioma with connectome-based resection using cortico-subcortical electrostimulation were consecutively selected. Real-time, tailored multitasking was performed throughout the resection, but no additional tasks related to artistic abilities were introduced. RESULTS Nineteen patients were included, consisting of 15 professional artists (5 architects, 2 comedians, 2 musicians, 2 dancers, 1 sculptor, 1 plastic artist, 1 writer, and 1 art professor) and 4 semiprofessional artists—2 musicians (1 professor of chemistry, 1 informatician), 1 poet (theater administrator), and 1 painter (social worker). This consecutive cohort included 10 men (52.6%) and 9 women (47.4%) who underwent AS for LGG. Of the 19 patients, 16 were right-handed, the mean age was 36.8 ± 9.7 years, and the mean Karnofsky Performance Scale score was 94.7 ± 6.9. There were 11 left-sided and 8 right-sided tumors distributed across the 5 lobes (mean preoperative volume 52.8 ± 39.4 cm ³ ). All patients were fully active before surgery, except for 1 architect with intractable epilepsy. Postoperatively, no permanent deficits were observed, except 1 case of voluntary induced hemianopia (5.3%). The mean Karnofsky Performance Scale score was 95.7 ± 5 at 3 months after surgery. All patients returned to their artistic practice at the semiprofessional or professional level, and none reported a subjective loss of creativity. The mean extent of resection was 91.2% ± 8.6% (mean residual tumoral volume 5 ± 5.8 cm ³ ). There were 12 astrocytomas and 7 oligodendrogliomas. Only 1 patient received immediate adjuvant therapy. Five patients (26.3%) underwent subsequent AS. The mean follow-up duration was 7.6 ± 3.1 years since the initial AS. All patients except 3 (84.2%) were still alive at the last follow-up (1 died from an unrelated cause). There were no significant differences between professional and semiprofessional artists, except for a higher rate of reoperation in the latter subgroup (p = 0.037). CONCLUSIONS These original data show that AS with intraoperative continuous multitasking enabled semiprofessional and professional artists with LGG to resume their artistic work following surgery. This suggests that, although artistic creativity should be more systematically considered in surgical neuro-oncology, even for nonprofessional artists, there is nonetheless no need to introduce specific tests during surgery.
Article
Extensive research has shown that cognitive style is a non-negligible potential influencer of domains of human functioning, such as learning, creativity, and cooperation among individuals. However, the dichotomy of cognitive style is contradictory to the fact that cognitive style is a continuous variable, and the dichotomy loses information about the strength of people’s performance between the poles of cognitive style. To solve this problem, this study developed a computerized continuous scoring system (CCS) based on Python’s OpenCV library, and achieved continuous scoring of the test of cognitive style, with the Embedded Figure Test as an example. An empirical study was implemented to compare the performance of dichotomous scoring and CCS. The results show that CCS can accurately extract the traces of participants’ responses and achieve continuous scoring, supplementing the information on the strength of people’s cognitive styles between the two poles, and the performance of CCS-based tests such as discrimination, reliability, and validity are significantly improved compared with the dichotomous scoring. Given the high reproducibility of CCS, it is expected to be applied to scoring other continuity characteristics in the future.
Article
Full-text available
Divergent thinking (DT) ability is widely regarded as a central cognitive capacity underlying creativity, but its assessment is challenged by the fact that DT tasks yield a variable number of responses. Various approaches for the scoring of DT tasks have been proposed, which differ in how responses are evaluated and aggregated within a task. The present study aimed to identify methods that maximize psychometric quality while also reducing the confounding effect of DT fluency. We compared traditional scoring approaches (summative and average scoring) to more recent methods such as snapshot as well as top‐ and max‐scoring. We further explored the moderating role of task complexity as well as metacognitive abilities. A sample of 300 participants was recruited via Prolific. Reliability evidence was assessed in terms of internal consistency, concurrent criterion validity in terms of correlations with real‐life creative behavior, creative self‐beliefs, and openness. Findings confirm that alternative aggregation methods reduce the confounding effect of DT fluency. Reliability tends to increase as a function of the number of included responses with three responses as a minimal requirement for decent reliability evidence. Convergent validity was highest for snapshot as well as max‐scoring when using a medium number of three ideas.
Article
Previous research has highlighted the benefits of real-time automated feedback in enhancing originality in divergent thinking tasks. In this preregistered study, we sought to replicate these findings, investigating whether improvements in creative ideation persist after feedback is discontinued, and assess the impact on evaluation accuracy. A total of 230 participants were given three divergent thinking tasks (Alternate Uses tests), with or without semantic distance feedback in the first two trials. The third task was always performed without feedback. Participants were then asked to rate the originality of the ideas they produced in this last trial. Their evaluations were compared against originality scores calculated based on semantic distance and Large Language Models (LLM) for converging evidence. The results aligned with previous findings, showing that feedback was effective in improving overall levels of originality across the first two trials. Importantly, this effect carried over to the third trial after feedback was discontinued. However, feedback did not enhance evaluation accuracy, as participants in both conditions achieved relatively high levels of accuracy in rating the originality of their own ideas. We offer possible explanations for this unexpected result and discuss the study’s findings in the broader context of metacognition.
Article
Full-text available
Divergent thinking (DT) tests are among the most popular techniques for measuring creativity. However, the validity evidence for DT tests, as applied in educational settings, is inconsistent partly due to different scoring methods. This study explored the reliability and validity issues of various techniques for administering and scoring two DT tests. Results show distinct differences among several methods for scoring these DT tests and suggest that the percentage scoring method (i.e., dividing originality scores by fluency scores) may be the most appropriate scoring strategy. The potential impact on educational research and practice is discussed in detail.
Article
Full-text available
Fifty years ago, Mednick [Psychological Review, 69 (1962) 220] proposed an elaborate model that aimed to explain how creative ideas are generated and why creative people are more likely to have creative ideas. The model assumes that creative people have flatter associative hierarchies and as a consequence can more fluently retrieve remote associative elements, which can be combined to form creative ideas. This study aimed at revisiting Mednick's model and providing an extensive test of its hypotheses. A continuous free association task was employed and association performance was compared between groups high and low in creativity, as defined by divergent thinking ability and self-report measures. We found that associative hierarchies do not differ between low and high creative people, but creative people showed higher associative fluency and more uncommon responses. This suggests that creativity may not be related to a special organization of associative memory, but rather to a more effective way of accessing its contents. The findings add to the evidence associating creativity with highly adaptive executive functioning.
Article
Full-text available
The serial order effect—the tendency for later responses to a divergent thinking task to be better than earlier ones—is one of the oldest and most robust findings in modern creativity work. But why do ideas get better? Using new methods that afford a fine-grained look at temporal trajectories, we contrasted two explanations: the classic spreading activation account and a new account based on executive and strategic aspects of creative thought. After completing measures of fluid intelligence and personality, a sample of young adults (n � 133) completed a 10-min unusual uses task. Each response was time-stamped and then rated for creativity by three raters. Multilevel structural equation models estimated the trajectories of creativity and fluency across time and tested if intelligence moderated the effects of time. As in past work, creativity increased sharply with time and flattened slightly by the task’s end, and fluency was highest in the task’s first minute and then dropped sharply. Intelligence, however, moderated the serial order effect—as intelligence increased, the serial order effect diminished. Taken together, the findings are more consistent with a view that emphasizes executive processes, particularly processes involved in the strategic retrieval and manipulation of knowledge, than the simplespreading of activation to increasingly remote concepts.
Article
Full-text available
Although creativity is an important part of cognitive, social, and emotional activity, high-quality creativity assessments are lacking. This article describes the rationale for and development of a measure of creative ideation. The scale is based on the belief that ideas can be treated as the products of original, divergent, and creative thinking - a claim J. P. Guilford (1967) made years ago. Guilford himself assessed ideation with tests of divergent thinking, although through the years scores from these tests have only moderate predictive validity. This may be because previous research has relied on inappropriate criteria. For this reason, the Runco Ideational Behavior Scale (RIBS) was developed. It can be used as a criterion of creative ideation. Most items describe actual behaviors (i.e., overt actions and activities) that clearly reflect an individual's use of, appreciation of, and skill with ideas. Results obtained using both exploratory and confirmatory factor analysis are reported in this article. These suggest the presence of 1 or 2 latent factors within the scale. Based on the theoretical underpinnings of the scale, a 1-factor solution was judged to be more interpretable than a 2-factor solution. Analyses also supported the discriminant validity of the RIBS.
Article
Full-text available
The relationship between intelligence and creativity has been subject to empirical research for decades. Nevertheless, there is yet no consensus on how these constructs are related. One of the most prominent notions concerning the interplay between intelligence and creativity is the threshold hypothesis, which assumes that above-average intelligence represents a necessary condition for high-level creativity. While earlier research mostly supported the threshold hypothesis, it has come under fire in recent investigations. The threshold hypothesis is commonly investigated by splitting a sample at a given threshold (e.g., at 120 IQ points) and estimating separate correlations for lower and upper IQ ranges. However, there is no compelling reason why the threshold should be fixed at an IQ of 120, and to date, no attempts have been made to detect the threshold empirically. Therefore, this study examined the relationship between intelligence and different indicators of creative potential and of creative achievement by means of segmented regression analysis in a sample of 297 participants. Segmented regression allows for the detection of a threshold in continuous data by means of iterative computational algorithms. We found thresholds only for measures of creative potential but not for creative achievement. For the former the thresholds varied as a function of criteria: When investigating a liberal criterion of ideational originality (i.e., two original ideas), a threshold was detected at around 100 IQ points. In contrast, a threshold of 120 IQ points emerged when the criterion was more demanding (i.e., many original ideas). Moreover, an IQ of around 85 IQ points was found to form the threshold for a purely quantitative measure of creative potential (i.e., ideational fluency). These results confirm the threshold hypothesis for qualitative indicators of creative potential and may explain some of the observed discrepancies in previous research. In addition, we obtained evidence that once the intelligence threshold is met, personality factors become more predictive for creativity. On the contrary, no threshold was found for creative achievement, i.e. creative achievement benefits from higher intelligence even at fairly high levels of intellectual ability.
Article
Full-text available
Divergent thinking (DT) tests are very often used in creativity studies. Certainly DT does not guarantee actual creative achievement, but tests of DT are reliable and reasonably valid predictors of certain performance criteria. The validity of DT is described as reasonable because validity is not an all-or-nothing attribute, but is, instead, a matter of degree. Also, validity only makes sense relative to particular criteria. The criteria strongly associated with DT are detailed in this article. It also summarizes the uses and limitations of DT, conceptually and psychometrically. After the psychometric evidence is reviewed, alternative tests and scoring procedures are described, including several that have only recently been published. Throughout this article related processes, such as problem finding and evaluative thinking, are linked to DT.
Chapter
In this first section, an overview of the remaining seven divisions of the chapter is presented. The second section affords a brief description of categories of instrumentation to provide the reader with a foundation within which the psychometric issues in the assessment of creativity can be viewed. In the third section, psychometric concerns pertaining to construct validity, content validity, and criterion-related validity are addressed. Subsequent to an abbreviated review of the meaning of reliability, concerns regarding optimal approaches to the estimation of reliability of creativity measures are considered in the fourth section. The impact upon reliability and validity of scoring procedures is examined in the fifth section with particular emphasis on the presence of fluency as a confounding factor in the interpretation of scores of measures of divergent thinking. The sixth section provides a survey of a number of the difficulties encountered in establishing normative data for the understanding of scores on measures of creativity. A cursory exploration of a few selected issues in the administration of tests of creativity follows in the seventh section, and the eighth section contributes a concluding statement.
Article
Divergent thinking tests are probably the most commonly employed measures of creative potential and have demonstrated adequate psychometric properties with many populations. Recently, however, a partial correlation evaluation revealed that the indices drawn from divergent thinking tests are highly redundant. That is, in the nongifted population, ideational “originality” and “flexibility” were seriously confounded by ideational “fluency,” and hence were not reliable indices of divergent thinking. Because the ideation of gifted individuals is qualitatively and quantitatively different from that of nongifted individuals, the present investigation utilized partial correlation procedures in order to compare the reliability of ideational originality in academically gifted and nongifted intermediate school children (N = 225). The results indicated that the divergent thinking interitem and intertest correlations of the gifted children were significantly larger than those of the nongifted children. Still, ideational originality was adequately reliable after fluency was controlled only in the figural (nonverbal) divergent thinking tests.
Article
The traditional system for scoring divergent thinking tests has been criticized for its lack of predictive and discriminant validity. The present investigation was conducted to evaluate alternative scoring systems. Two divergent thinking tests (Uses and Line Meanings) were administered to 120 seventh- and eighth-grade children; the psychometric properties (i.e., reliability, and predictive and discriminant validity) of four scoring systems were evaluated and compared. Correlational analysis indicated (a) that the Uses test had notably higher validity coefficients than Line Meanings; (b) that the summation score (the sum of fluency, originality, and flexibility), the uncommon score (the number of ideas given by less than 5% of the sample), and the weighted-fluency score had the highest validity coefficients; (c) that ratio scores (e.g., flexibility divided by fluency) were generally unreliable and invalid; and (d) that all divergent thinking test scores were unrelated to IQ, but were related to achievement test scores. These findings have practical and theoretical implications for the testing of divergent thinking and creativity.