Content uploaded by Mathias Benedek
Author content
All content in this area was uploaded by Mathias Benedek
Content may be subject to copyright.
Psychology of Aesthetics, Creativity, and the
Arts
Assessment of Divergent Thinking by Means of the
Subjective Top-Scoring Method: Effects of the Number of
Top-Ideas and Time-on-Task on Reliability and Validity
Mathias Benedek, Caterina Mühlmann, Emanuel Jauk, and Aljoscha C. Neubauer
Online First Publication, September 30, 2013. doi: 10.1037/a0033644
CITATION
Benedek, M., Mühlmann, C., Jauk, E., & Neubauer, A. C. (2013, September 30). Assessment
of Divergent Thinking by Means of the Subjective Top-Scoring Method: Effects of the
Number of Top-Ideas and Time-on-Task on Reliability and Validity. Psychology of Aesthetics,
Creativity, and the Arts. Advance online publication. doi: 10.1037/a0033644
Assessment of Divergent Thinking by Means of the Subjective
Top-Scoring Method: Effects of the Number of Top-Ideas and
Time-on-Task on Reliability and Validity
Mathias Benedek, Caterina Mühlmann, Emanuel Jauk, and Aljoscha C. Neubauer
University of Graz
Divergent thinking tasks are commonly used as indicators of creative potential, but traditional scoring
methods of ideational originality face persistent problems such as low reliability and lack of convergent
and discriminant validity. Silvia et al. (2008) have proposed a subjective top-2 scoring method, where
participants are asked to select their two most creative ideas, which then are evaluated for creativity. This
method was found to avoid problems with discriminant validity, and to outperform other scoring methods
in terms of convergent validity. These findings motivate a more general, systematic analysis of the
subjective top-scoring method. Therefore, this study examined how reliability and validity of the
originality and fluency scores depend on the number of top-ideas and on time-on-task. The findings
confirm that subjective top-scoring avoids the confounding of originality with fluency. The originality
score showed good internal consistency, and evidence of reliability was found to increase as a function
of the number of top-ideas and of time-on-task. Convergent validity evidence, however, was highest for
a time-on-task of about 2 to 3 minutes and when using a medium number of about three top-ideas.
Reasons for these findings are discussed together with possible limitations of this study and future
directions. The article also presents some general recommendations for the assessment of divergent
thinking with the subjective top-scoring method.
Keywords: originality, fluency, time-on-task, reliability, validity
Divergent thinking tests have had long-standing popularity in
creativity research but also have faced persistent debates about
their limitations. A common issue is that divergent thinking ability
may be considered a useful indicator of creative potential, but it
may not generalize to a more general conceptualization of creativ-
ity which for example, also includes real-life creative achievement
(Runco & Acar, 2012). A second common issue is related to the
unsatisfactory psychometric properties (i.e., objectivity, reliability,
and validity) of divergent thinking scores. These psychometric
issues need to be resolved to establish confidence in the use of
divergent thinking tasks for the assessment of creative potential
and for the study of the cognitive and neurocognitive mechanisms
underlying creative ideation (e.g., Benedek, Könen & Neubauer,
2012; Fink & Benedek, in press; Gilhooly et al., 2007; Nusbaum
& Silvia, 2011). In the past few years, strong efforts have been
made to further examine divergent thinking tests in the light of
different methodological considerations and to propose new solu-
tions to common issues (e.g., Plucker, Qian & Wang, 2011; Runco,
Okuda & Thurston, 1987; Silvia, Martin & Nusbaum, 2009; Silvia
et al., 2008). This study aims to extend these developments by a
systematic examination of the psychometric properties of subjec-
tive scoring methods.
Divergent thinking tasks require participants to generate cre-
ative solutions to given open problems. A large number of differ-
ent divergent thinking tasks have been devised (e.g., the alternate
uses tasks ask to find creative uses for a commodity item such as
a brick; cf., Benedek, Fink & Neubauer, 2006), and a variety of
different measures have been proposed for scoring responses gen-
erated in these tasks (e.g., Torrance, 2008). These measures com-
monly involve a scoring of the fluency and originality (or creativ-
ity) of ideas, which can be considered to reflect the quantity and
quality of ideation performance. The scoring of ideational fluency
is straightforward as it essentially requires counting the number of
relevant responses. In contrast, the scoring of originality is more
complex and can be achieved by different methods. In the unique-
ness scoring, the originality of responses is defined by their sta-
tistical infrequency. For example, infrequent responses (p⬍
5–10%) are usually defined as unusual or unique, whereas more
frequent responses are considered to be common (e.g., Runco,
2008; Torrance, 1974). The originality score then is obtained by
counting the number of unique responses. Although this method
appears to allow for an objective scoring, a number of serious
objections have been raised including the issue that statistical
infrequency may not be a valid indicator of creativity because it
Mathias Benedek, Caterina Mühlmann, Emanuel Jauk, and Aljoscha C.
Neubauer, Department of Psychology, University of Graz, Graz, Austria.
This research was supported by a grant from the Austrian Science Fund
(FWF): P23914.
Correspondence concerning this article should be addressed to Mathias
Benedek, Department of Psychology, University of Graz, Maiffredygasse
12b, 8010 Graz, Austria. E-mail: mathias.benedek@uni-graz.at
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Psychology of Aesthetics, Creativity, and the Arts © 2013 American Psychological Association
2013, Vol. 7, No. 3, 000 1931-3896/13/$12.00 DOI: 10.1037/a0033644
1
does not account for the appropriateness of responses (Silvia et al.,
2008). As an alternative, in the subjective scoring method, external
judges are used to evaluate all responses for creativity (i.e., un-
usualness and appropriateness; cf., Amabile, 1982) and ratings are
finally summed. Good interrater reliability of this method can be
seen as an argument for a certain objectivity of this method, but the
evaluation of large amounts of responses by different judges is still
very laborious.
The uniqueness scoring and the subjective scoring method,
however, also face a more general methodological issue. Ideational
fluency has been realized to act as a contaminating factor for all
other scores (Hocevar, 1979a, 1979b; Kaufman, Plucker & Baer,
2008; Michael & Wright, 1989; Runco et al., 1987). According to
the scoring techniques outlined above, the scoring of originality is
directly related to the number of responses (i.e., fluency score). A
person who gives more responses thus is more likely to get points
for originality. This explains for the extremely high correlations of
fluency with originality scores, which often range from r⫽.80 to
.90 (e.g., Mouchiroud & Lubart, 2001; Torrance, 2008). It has been
argued that these marked correlations do not support discriminant
validity (Plucker et al., 2011; Silvia et al., 2008). Moreover, after
the effect of fluency is partialed out, the reliability evidence of the
originality score is usually very low (Hocevar, 1979a, 1979b;
Runco et al., 1987); one study found that reliability is still adequate
for gifted children performing figural tasks (Runco & Albert,
1985). The reliability and validity of originality scores hence
appear to be substantially affected by the correlation with ide-
ational fluency.
Because ideational originality is conceived as an essential qual-
itative factor of divergent thinking ability, a number of suggestions
were made on how to control for the confounding influence of
ideational fluency. One suggestion is that the evaluations should be
based on the entire set of ideas rather than single ideas (i.e., scoring
of ideational pools, or snapshot scoring; Runco & Mraz, 1992;
Silvia et al., 2009). This method allows for a very quick overall
assessment but was found to yield only moderate evidence of
reliability. It was also proposed to divide total originality by the
number of ideas (i.e., average scoring, or ratio scoring). This
method has some merits (e.g., Plucker et al., 2011; Silvia et al.,
2008), but again it sometimes was found to show very low reli-
ability evidence (Runco, Okudo & Thurston, 1987), and it should
be noted that average originality might not be valid for the ability
to come up with the most creative ideas. Another possibility is to
focus on a constant number of responses (Clark & Mirels, 1970).
The examinees can, for example, be instructed to produce a pre-
defined number of responses (e.g., generate three creative re-
sponses; Hocevar, 1979b). This method controls for fluency, but
no longer allows for the implicit assessment of fluency. As an
alternative, the scoring can be restricted to a predefined number of
responses from the entire response set (Michael & Wright, 1989).
This method can be called subjective top-scoring. Recently, Silvia
et al. (2008) have adopted this approach by proposing the top-2
scoring method. This method asks the examinees to indicate their
two most creative responses per task, and only these two responses
then are evaluated. The top-2 scoring of originality was shown to
avoid excessive correlations with fluency and to perform better
than the snapshot scoring or average scoring (Silvia, 2011; Silvia
et al., 2008, 2009). Similarly, Reiter-Palmon, Illies, Cross, Buboltz
and Nimps (2009), using more complex, real-life divergent think-
ing tasks, found that using the single most creative response (i.e.,
top-1 scoring) is suitable to overcome a confounding with fluency.
Evaluating people by their best responses reflects a maximum
performance condition (Runco, 1986) and acknowledges that the
ability to select one’s best ideas is important for creativity (Smith,
Ward & Finke, 1995). This may involve that generative and
evaluative processes become confounded (Runco, 2008), but ex-
aminees were found to be quite discerning in selecting their most
creative ideas which supports the validity of this procedure (Silvia,
2008). Finally, from a practical point of view, this method also
enhances the efficiency of the rating procedure. Silvia et al. (2008)
reported that by using top-2 scoring only about 28% of the total
number of ideas had to be evaluated.
Recently, Plucker et al. (2011) have compared different methods
of originality scoring with respect to reliability and validity. The
methods included uniqueness scoring, average uniqueness (i.e.,
dividing uniqueness by fluency), uniqueness of the first or last 10
ideas, and subjective, rater-based scorings of the entire response
set (summative score) or the first or last 10 ideas. Using only two
items of the instances task, reliabilities ranged between .37 and
.62. The average uniqueness score was found to show somewhat
higher correlations with self-report creativity measures and nega-
tive correlations with fluency. It was concluded that this method
could be favored over subjective rater-based methods. However,
because in this study participants were not asked to select their
most creative ideas, the authors suggested that examining the
reliability and validity of top-ideas “is also promising and should
be the subject of additional study” (p. 15).
Main Research Questions
The main idea underlying subjective top-scoring of originality is
to focus on a constant number of top-ideas to avoid a confounding
with fluency which questions its discriminant validity. The top-2
scoring method, which focuses on the two most creative ideas, was
found to perform well in terms of reliability and validity as
compared to other scoring methods such as uniqueness scoring
(Silvia et al., 2008). In recent investigations, we have also used a
subjective scoring of divergent thinking tasks and found that
correlations with intelligence crucially depended on the scoring
method (i.e., top-2 vs. average originality; Jauk, Benedek & Neu-
bauer, 2013). Moreover, we observed that top-3 scoring resulted in
a somewhat higher reliability evidence of the originality score as
compared to top-1 or top-2 scoring (Benedek, Franz, Heene &
Neubauer, 2012). This raises the question to what extent the
number of top-ideas actually affects reliability and maybe also the
validity of the originality score. Moreover, because the number
and originality of ideas depend on time-on-task (e.g., Beaty &
Silvia, 2012; Mednick, 1962), the most adequate number of top-
ideas might also depend on the duration of divergent thinking
tasks. Therefore, this study aims to have a close look on different
realizations of the subjective top-scoring method and their effects
on the psychometric properties of originality scores. Specifically,
we want to examine systematically to what extent (a) the actual
number of top-ideas and (b) the time-on-task affect (1) the corre-
lation of originality scores with fluency, (2) the reliability of
originality scores, and (3) the convergent validity of originality
scores. Additionally, we also examine the effect of task duration
on the reliability and validity of fluency scores. We thereby hope
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
2BENEDEK, MU
¨HLMANN, JAUK, AND NEUBAUER
to reveal further information about the adequate assessment of
ideational originality, ensuring high psychometric quality but also
efficient scoring procedures.
Method
Participants
A sample of 105 participants (51 females) took part in this
study. The age ranged from 18 to 51 years (M⫽23.80, SD ⫽
3.97). Forty-nine percent of participants were students of Psychol-
ogy at the University of Graz, 38% were majoring in different
fields, and the remaining 13% were nonstudents. Participants were
invited to take part in a study on creativity and personality and
were offered credits for participation in empirical investigations (if
applicable) and an individual feedback on personality structure in
exchange for participation. The only requirement for participation
was basic computer literacy. All participants gave written in-
formed consent. The study was approved by the local ethics
committee.
Tasks and Material
We used six divergent thinking tasks timed for five minutes
each. The tasks included three alternate uses tasks (“car tire,”
“glass bottle,” and “knife”) and three instances tasks (“what could
be round?”, “what could make a loud noise?”, “what could be used
for faster locomotion?”). Tasks were administered by a self-
devised computer program written in Matlab (The Mathworks;
Natick, MA), which allows for acquisition of time-stamped re-
sponses. There is evidence that computer-based assessment of
divergent thinking is highly comparable to a paper-pencil assess-
ment (Lau & Cheung, 2010).
In an initial general instruction participants were told that they
will be presented some questions for which they should try to
“generate as many different unusual and creative responses as
possible.” They were asked to express their ideas as succinctly as
possible, and to write each idea into the input box and then press
the enter-key to add it to their idea list. Participants were told that
there was “some minutes” time for each task and that the program
would proceed automatically as soon as time is over. By giving
participants no exact information about the total or remaining task
duration, we hoped that they would keep on entering every idea as
soon as it comes to mind, but not to develop specific task strategies
related to a five minutes task time. After a task was started,
participants were presented the specific task instructions on top of
the screen (e.g., “What could make a loud noise? Name all the
unusual and creative responses that you can think of.”). Below,
there was an editable input box where ideas could be entered.
Every idea was added to a list placed below the input box. Two
time events were recorded for each idea: (1) the time when the
participant started entering the idea, and (2) the time when writing
was complete and the idea was added to the list. We only consid-
ered the former time event, because this can be considered as the
time when the idea actually came to mind, whereas the latter time
event depends on the length of the idea and the typing speed.
After all tasks were completed, ideas were ranked for creativity
by the participants. To this end, participants were presented with
lists showing all their ideas within a single task with the ideas
being arranged in randomized order. They were asked to rearrange
the position of the ideas until the sequence of ideas in the list
reflected the creativity of ideas as subjectively appraised by the
participants. Ideas were rearranged by selecting them and moving
them up or down by means of specific buttons. At the end, the
topmost idea in the final list should be the most creative one, the
second idea in the list should be the second-most creative one, and
the last idea in the list should be the least creative one. This was
done for all six tasks, separately.
We also measured self-reported ideational behavior by means of
a German version of the Runco Ideational Behavior Scale (RIBS;
Runco, Plucker & Lim, 2000). Personality structure was assessed
by means of the five-factor inventory NEO-FFI (Borkenau &
Ostendorf, 1993).
Scoring of Divergent Thinking Tasks
The ideas generated in the divergent thinking tasks were scored
for fluency and originality. Fluency scores simply reflect the
number of ideas generated after a given time. Originality scores
were computed according to the subjective top-scoring method
using the creativity evaluations obtained by external judges.
External originality ratings. Participants generated a total of
10,921 ideas in the six divergent thinking tasks. All ideas were
pooled and identical ideas were removed, resulting in a final set of
nonredundant 6229 ideas. Eight external raters were asked to
evaluate the creativity of the ideas on a scale ranging from 0 (not
creative)to3(very creative). All raters received an initial train-
ing, which made them familiar with the scale (e.g., they were
informed that ideas can be considered “highly creative” when
they are perceived as original and useful, and probably only few
people will come up with them). The judges evaluated a small
subset of ideas and after that discussed their ratings. Because of
the large amount of ideas, each judge then evaluated the ideas
of only half of the tasks so that finally there were four inde-
pendent ratings for each idea. The interrater reliability between
the four judges was ICC ⫽.68, .80, .65, .60, .51, and .68 for the
tasks “car tire,” ‘glass bottle,” ‘knife,” “round,” “loud noise,”
and “faster locomotion,” respectively. The creativity of a single
idea was defined as the average creativity rating given by the
four external judges.
Subjective top-scoring. The main idea of the top-scoring
method is that the originality score is based on the creativity
evaluations of a predefined number of top-ideas. The top-ideas are
identified by the participants themselves according to their sub-
jective appraisal of the creativity of their ideas. For example, Silvia
et al. (2008) used a top-2 scoring, where participants marked their
two most creative ideas which then were considered for scoring.
Generally, all kinds of top-scores can be computed. For example,
for a top-1 score only the single most creative idea within a task
would be considered, whereas for a top-3 score the three most
creative ideas would be included.
A first specific aim of this study was to examine the effect of the
number of top-ideas. This was made possible by having partici-
pants sort all their ideas for creativity, as that allows for a post hoc
classification of any number of top-ideas. Second, we aimed to
also consider the effect of time-on-task. In this study, time-on-task
can theoretically vary from zero to five minutes (i.e., the total time
for each task). For a specific time-on-task lower than five minutes,
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
3
ASSESSMENT OF DIVERGENT THINKING
for example, 3 minutes, the scores were computed based on the
data available after 3 minutes. To illustrate this method, let us
consider the following example: Assume that a participant gener-
ated four ideas at 30, 60, 120 and 240 seconds, and afterward
ranked them 2., 4., 3., 1. (i.e., the first idea being second-most
creative, the second idea being least creative, and so on). Then, for
computing the top-2 originality score for a time-on-task of 3
minutes, only the two most creative ideas within the first 3 minutes
are considered, hence, the first and the third idea. The originality
score was finally computed by averaging the creativity evaluations
of the considered top-ideas. If a participant generated fewer ideas
than the number of top-ideas then the creativity evaluations of the
available ideas were averaged.
Procedure
Participants were tested in small groups of up to five people in
a computer room. They first performed the six divergent thinking
tasks, which were presented in a randomized order. The divergent
thinking tasks were preceded by a short exercise (enter words
starting with the letter “F”) to become familiar with the general
procedure of a computer-based idea generation task. After com-
pletion of the six tasks, the participants ranked their ideas for
creativity. Finally, the participants completed the personality in-
ventory, and the ideational behavior scale. The whole session took
about one hour.
Analysis Plan
The main analyses include correlation analysis of fluency and
originality scores, reliability analyses (i.e., internal consistency of
scores), and validity analyses (i.e., correlations with external cri-
teria). In all of these analyses two experimental factors are varied
systematically: First, the factor top-ideas is varied from 1 to 10
ideas (see section on top-scoring method for further details). Ad-
ditionally, this factor also includes the value “all ideas,” where all
ideas given by a participant were considered; this hence corre-
sponds to an average originality score (cf., Silvia et al., 2008). This
factor only applies for the originality score but not for the fluency
score. Second, the factor time-on-task is varied from 1 to 5 minutes
(i.e., scores are computed for time-on-task of 1, 2, 3, 4, and 5
minutes). In total, the scoring hence was computed for 55 different
conditions (11 top-idea conditions by 5 time-on-task conditions).
The results of these analyses are visualized by means of contour
plots (see Figures 1, 2, and 3).
Results
Descriptive Statistics and Preliminary Analyses
Participants generated on average 17 ideas (SD ⫽6.36) within
the task time of 5 minutes. The fluency of ideas was significantly
higher in the three instances tasks (M⫽21.95, SD ⫽8.61) than in
the three alternate uses tasks (M⫽12.72, SD ⫽4.96), t(104) ⫽
15.83, p⬍.01. As can be seen in Table 1, the total number of ideas
steadily increases over time, however the fluency of ideas also
steadily declines over time starting from the first minute. Con-
sidering the alternate uses tasks, half of participants had created
four ideas or more within the first minute of the task but the
increase flattened down to one additional idea in the last minute
of the task (difference between median values after 4 and 5
minutes; see Table 1). It can also be seen that there are large
individual differences in fluency scores. Whereas the least
fluent 10% of participants generated only nine ideas or less
after working on the instances task for five minutes, the most
productive 10% of the sample generated more than three times
as many ideas (i.e., 32.7).
A principal factor analysis with Varimax rotation and Kaiser
normalization was performed for the six fluency scores and the six
average originality scores derived from the alternate uses as well
as the instances tasks. The analysis extracted two factors (accord-
ing to the Kaiser criterion and according to the Scree test;
K-M-O ⫽.83), explaining 67% of total variance. Further evidence
for a two-factorial solution comes from the minimum average
partial test (MAP test: Velicer, Eaton, & Fava, 2000), which
returned two components as the number of factors to extract from
the 12 measures. This two-factor solution clearly revealed a flu-
ency factor and an originality factor with all six fluency and
originality scores loading on corresponding factors (unspecific
loadings were below .25). In other words, we obtained evidence
for score-specific factors rather than task-specific factors, which
supports the feasibility of aggregating fluency and originality
scores across tasks.
How Do Scoring Conditions Affect the Correlation of
Fluency and Originality?
We computed correlations between originality and fluency
scores for different numbers of top-responses at varying time-on-
task. It was assumed that the subjective top-scoring method can
avoid excessively high correlations between originality and flu-
ency because it focuses on a constant number of ideas. In line with
these expectations, correlations were found to be close to zero
ranging between –.30 and .06 (see Figure 1). Ideational fluency
and originality even showed significant negative correlations when
using the average score and considering task times of 3 minutes or
less.
For reasons of comparison with previous studies, we also com-
puted summative originality scores by summing up the creativity
evaluations of all ideas produced by participants within a task. As
expected, these summative creativity scores showed extremely
high correlations with the fluency scores of r⫽.83, .87, .90, .90,
.91 for time-on-task of 1, 2, 3, 4, and 5 minutes, respectively.
Table 1
Number of Ideas Generated After a Time-on-Task of 1 to 5 Minutes in the Alternate Uses Tasks and the Instances Tasks
1 min 2 min 3 min 4 min 5 min
Alternate uses 2.0/4.3/6.7 3.5/7.0/10.3 4.8/9.0/13.7 5.3/11.0/17.0 6.3/12.0/19.0
Instances 2.5/7.7/11.3 4.8/12.0/17.7 6.5/15.3/23.7 8.4/18.7/28.3 9.0/21.7/32.7
Note. The three values in each cell denote 10, 50, and 90 percentile values.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
4BENEDEK, MU
¨HLMANN, JAUK, AND NEUBAUER
How Do Scoring Conditions Affect Reliability?
To examine how the scoring conditions affect reliability of fluency
and originality scores, we computed their internal consistency (Cron-
bach’s alpha). The fluency score shows high reliability (␣⫽.83) even
for a short time-on-task of 1 minute. The reliability evidence increases
with time-on-task up to ␣⫽.89 for a time-on-task of 5 minutes (see
Figure 2A).
The reliability evidence of the originality score also was found to
generally increase with an increasing number of top-ideas and with
increasing time-on-task (see Figure 2B). Reliability was lowest when
only a single top-idea was considered (i.e., top-1 score) staying below
an alpha of .60, but it increased substantially by including some
additional top-ideas. For example, at a time-on-task of 2 minutes
using top-2, top-3, or top-4 scoring increased reliability to .70, .75, or
.77, respectively.
Time-on-task also generally increased reliability evidence, but this
is especially true when a larger number of top-ideas is considered. For
the top-1 scoring, the increase of time-on-task from 1 minute to 5
minutes only causes an increase in reliability from .56 to .59, whereas
it increases from .71 to .83 for the top-5 scoring. A decent alpha
coefficient of at least .75 could be obtained only by using a time-on-
task of 2 minutes (or higher) and when using at least the top-3 ideas.
An alpha of .80 was obtained when using top-4 scoring with a
time-on-task of 4 minutes or top-5 scoring with time-on task of 3
minutes. Reliability of the originality score peaked at an alpha of .87
when using the average score (i.e., using all ideas) at a time-on-task of
5 minutes.
How Do Scoring Conditions Affect Validity?
The effect of the scoring method on the convergent validity
evidence of ideational fluency and originality was tested by
means of correlations with the external criteria of self-reported
ideational behavior and the personality factor openness. The
fluency score showed significant positive correlations with both
external criteria ranging from .26 to .33 for the ideational
behavior scale and from .25 to .30 for openness, respectively
(see Figure 3A and 3C). At this, there is a small trend for
correlations to increase with time-on-task.
The originality score showed no significant correlations with
the ideational behavior scale, but just a weak trend toward
positive correlations (see Figure 3B). With respect to openness,
the originality scores generally showed significant positive
correlations (see Figure 3D). These correlations were highest
(r⫽.35 to .38) for a time-on-task of 2 minutes when using
scoring of top-2 to top-8. The correlations were substantially
lower but still significant (r⫽.21 to .26) for the average
scoring method which considers all ideas.
Figure 1. Correlation of fluency and originality scores depending on the
number of top-ideas and time-on-task. Correlation coefficients exceeding
r⫽.19 are considered statistically significant given the sample size of
n⫽105.
Figure 2. Reliability (Cronbach’s alpha) of (A) the fluency score and (B) the originality score depending on
the number of top-ideas and time-on-task.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
5
ASSESSMENT OF DIVERGENT THINKING
Discussion
Can Subjective Top-Scoring Avoid the Confounding of
Originality and Fluency?
One major aim of the subjective top-scoring method is to avoid
the usually high dependency of qualitative measures of divergent
thinking (e.g., ideational originality) from the number of ideas
generated by participants (i.e., ideational fluency). We were able to
replicate the common finding that a summative scoring of origi-
nality (i.e., computing a sum of the creativity evaluations of all
ideas generated by a participant) results in extremely high corre-
lation with the fluency scoring ranging between .80 and .90 (cf.,
Mouchiroud & Lubart, 2001; Torrance, 2008). In contrast, when
originality scores were computed by means of the top-scoring
method, correlations with fluency were largely close to zero. This
is in line with the finding of Silvia et al. (2008), who also obtained
no significant correlation with fluency when using top-2 scoring.
The results hence confirm that the subjective top-scoring method
avoids the confounding of originality scores with fluency.
The average score is a special case because it uses all ideas but
by averaging ratings rather than summing them, a high positive
correlation with fluency of ideas can be avoided. For the average
score (and to a smaller extent also for the top-9 or top-10 score) we
even observed small negative correlations at least when time-on-
task was short. This result may probably be attributed to the
existence of people who focus on fluency rather than creativity of
ideas and thus were able to generate large amounts of responses.
This strategy probably involves the generation of a large number
of highly common responses which then results in a low average
originality score as compared to those who rather focus on cre-
ativity of ideas (Reiter-Palmon et al., 2009).
Psychometric Properties of the Fluency Score
We obtained high internal consistency for the ideational fluency
scores of the six divergent thinking tasks. Alpha coefficients
slightly increased with time-on-task but already settled above .85
for times-on-task of 2 minutes or more. This suggests that ide-
ational fluency can be reliably assessed even with short divergent
thinking tasks. We further obtained significant positive correla-
tions of fluency with self-reported ideational behavior and open-
ness supporting the general validity of this score. These correla-
tions also showed a slight increase with time-on-task which can
probably be attributed to the corresponding increases in reliability.
Psychometric Properties of the Originality Score
The top-scoring method was found to result in dependable
originality scores. Although interrater reliability was moderate for
some tasks, the internal consistency between the six different
divergent thinking tasks reached Cronbach’s alpha levels well
Figure 3. Correlation of fluency (A, C) and originality (B, D) with self-reported ideational behavior and
openness depending on the number of top-ideas and time-on-task. Correlation coefficients exceeding r⫽.19 are
considered statistically significant given the sample size of n⫽105.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
6BENEDEK, MU
¨HLMANN, JAUK, AND NEUBAUER
beyond .80 for some scoring conditions. This level of reliability
bears comparison with other well-established constructs of cogni-
tive ability. Together with the findings derived from factor anal-
ysis, this indicates that originality scores coming from different
divergent thinking tasks share a substantial amount of common
variance. Although divergent thinking tasks may not be fully
interchangeable with respect to their cognitive demands (Guilford,
1967; Kuhn & Holling, 2009; Silvia, 2011), our results support the
feasibility of computing aggregate scores across different diver-
gent thinking task to obtain a reliable total originality score. The
reliability, however, was found to be sensitive to scoring condi-
tions (i.e., top-ideas and time-on-task). Reliability was lowest
when only a single top-idea was considered, but it could be
increased substantially by including some additional top-ideas
(Benedek, Franz et al., 2012), and was highest for the average
score which makes use of all ideas (Silvia et al., 2008). A straight-
forward explanation for this is that the aggregated evaluations of a
larger number of ideas allows for a more reliable assessment, just
as any test increases reliability by extending the number of rele-
vant items. Also, considering more ideas could compensate for any
discrepancies between the participants and the raters about what
are considered to be the most creative ideas.
A higher time-on-task was found to increase reliability at least
for scores using four or more top-ideas. This suggests that scoring
a high number of ideas makes more sense when there is enough
time for participants to generate large numbers of ideas. A task
time of 2 or 3 minutes apparently already worked quite well for
most scores; further increases in task time only added small
increases in reliability.
We also examined correlations with other common indicators of
creativity to estimate effects of task properties on the validity of
the originality score. A priori, one could assume that the correla-
tion pattern would generally match that of reliability as any lack of
reliability necessarily impairs validity coefficients. Interestingly,
this was not the case. Whereas the reliability evidence of original-
ity scores was highest for average scoring at 5 minutes time-on-
task, the correlation with openness for this score was lowest. The
highest validity coefficients were obtained for a task time of 2
minutes using a medium number of about 3 to 6 top-ideas. This
raises the question why correlations did not increase with increas-
ing number of top-ideas just as reliability did. It has to be remem-
bered that people were instructed to generate as many unusual and
creative ideas as possible. High creative people presumably were
able to generate many unusual ideas, of which, however, only
some are very creative and thus truly indicative of their potential
for creative thought. Hence, when all ideas are considered, such as
in the average scoring, the evaluations of more and less creative
ideas become mixed up. This would result in a moderate total
creativity score for a high creative person which could equally be
attained by a less creative person who just generated a few mod-
erately creative ideas. It hence can be concluded that subjective
top-scoring may provide more valid scores than average scoring,
even though the latter method may be somewhat more reliable in
terms of internal consistency.
The question remains why the validity did not increase steadily
with time-on-task like reliability did. This might be explained by
the fact that originality generally increases over time (e.g., Beaty
& Silvia, 2012; Mednick, 1962; Piers & Kirchner, 1971) but
creative people overcome common ideas more quickly than less
creative people (Benedek & Neubauer, in press). As a conse-
quence, after a short time-on-task, creative people may already
have come up with highly original ideas whereas less creative
people have not. As the time-on-task proceeds less creative people
eventually also come up with more creative ideas, whereas high
creative people can hardly further improve their performance to
the same extent. Hence, the discernment between high and low
creative people (i.e., validity) may be higher for shorter task times
than for excessively long ones.
Originality showed significant correlations only with openness
but not with self-reported ideational behavior. The absent signif-
icant correlations with ideational behavior suggest that the ide-
ational behavior questionnaire is more indicative of ideational
fluency (two sample items read “I come up with a lot of ideas or
solutions to problems” or ”I have always been an active thinker—I
have lots of ideas”; Runco et al., 2001).
Recommendations/Implications for Scoring of
Divergent Thinking Tasks
Some straightforward recommendations concerning the ade-
quate assessment of ideational fluency and originality can be
derived from the results of this study. For ideational fluency, it
appears to be quite simple to obtain a reliable and valid score. This
can be achieved by using divergent thinking tasks with short task
time of about two minutes. The originality score, however, appears
to be more sensitive to task and scoring properties. First, original-
ity scores were found to be more valid when using tasks with
durations of about 2 to 3 minutes. This substantiates the common
practice of using similar tasks durations. Using shorter or much
longer tasks, however, might negatively affect the validity of
scores. Second, the top-scoring method should consider a medium
number of about three to six ideas. Using much fewer or much
more ideas (e.g., Plucker et al., 2011) may result in less valid
scores. Considering that using a higher number of top-ideas also
implies that a higher total number of ideas has to be subjected to
ratings, it could be a good compromise to use three top-ideas. For
a time-on-task of 2 minutes participants generated on average 10
ideas. Using only the three most creative ideas would help to
reduce the rating effort by about 70% as compared to having to
evaluate all ideas. Similar rates were reported by Silvia et al.
(2008) for top-2 scoring. Moreover, more than 90% of participants
generated three or more ideas within two minutes.
Some Limitations of This Study and Future Directions
Some limitations of this study need to be addressed. Time-on-
task was varied as an experimental variable by analyzing the
performance data available at different times within the task. This
was done to estimate scores that could be obtained for tasks of
different length. Although this method is efficient, results obtained
for, for example, a time-on-task of 2 minutes might not fully
generalize to studies which explicitly use 2 minute tasks. Differ-
ences might for example relate to higher effects of fatigue, because
performing six divergent thinking tasks with five minutes probably
involves more cognitive effort than six tasks with only two min-
utes. Moreover, people might apply different idea generation strat-
egies when they know that tasks are shorter. We tried, however, to
minimize these effects by not telling participants about the exact
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
7
ASSESSMENT OF DIVERGENT THINKING
task time and by not giving them any information about the
remaining task time.
A similar argument applies to the experimental variation of the
number of top-ideas. The post hoc selection of a specific number
of top-ideas may not fully generalize to the corresponding instruc-
tion to select a specific number of top-ideas. Some people who had
generated large amounts of ideas reported that they found it
difficult to arrange them all properly for creativity. This issue
might be less prominent for shorter tasks and when people are just
asked to identify their three or five most creative ideas. Taken
together, it could be assumed that shorter task durations and the
selection of a low number of most creative ideas may cause lower
fatigue and more accurate judgments, which might eventually have
additional positive effects on the psychometric properties of the
originality score. Further limitations include the sample size, and
the specific tasks which were selected for this study. For example,
for more complex divergent thinking tasks (e.g., Reiter-Palmon et
al., 2009), which often show lower fluency, the most adequate
number of top-ideas might differ. The present findings hence await
replication with larger samples, using other divergent thinking
tasks, and employing further criteria for examining the validity of
scores.
There are also some additional methodological issues that could
be addressed in future research. First of all, one might consider
using separate tasks for assessing fluency and originality. We
derived both scores from the same tasks (e.g., Torrance, 2008) and
instructed participants to generate as many unusual and creative
ideas as possible. This could be considered as kind of a double task
which permits participants to use different strategies either focus-
ing on fluency or creativity of ideas. Future work might therefore
attempt to assess fluency and originality with separate tasks using
specific task instructions to focus either only on fluency or only on
originality of ideas. Although this procedure may require a larger
total number of tasks, it might help to further increase the validity
of scores. Finally, it should be noted that using a specific number
of top-ideas also implies the possibility that there are some par-
ticipants who actually do not generate as much ideas within the
given time. There are different ways to handle this. In this study,
we then used all available ideas of the participant. Another possi-
bility would be to assign missing ideas with the lowest possible
creativity rating (i.e., a creativity rating of zero). This would
implicitly penalize very low fluency. Some side analyses indicated
that such originality scores can again be highly correlated with
fluency, at least for high numbers of top-ideas. This scoring
approach could, however, be useful in studies that decide not to
use separate fluency scorings but still allow for a moderate influ-
ence of fluency on the originality score.
Conclusions
This study provides further evidence of the usefulness of the
subjective top-scoring method for the assessment of ideational
originality (cf., Silvia et al., 2008). Using subjective top-scoring
ensures that ideational originality scores overcome the issues often
associated with this score, such as a lack of discriminant validity
with respect to fluency. Moreover, adequate scoring methods help
to obtain a highly reliable and valid originality score. As an
example, a top-3 originality score for 2 minutes time-on-task
showed a higher correlation with openness than fluency did. Ad-
equate scoring of ideational originality hence may provide re-
searchers with a powerful indicator of creative potential, besides
and beyond fluency.
References
Amabile, T. (1982). Social psychology of creativity: A consensual assess-
ment technique. Journal of Personality and Social Psychology, 43,
997–1013. doi:10.1037/0022-3514.43.5.997
Beaty, R. E., & Silvia, P. J. (2012). Why do ideas get more creative across
time? An executive interpretation of the serial order effect in divergent
thinking tasks. Psychology of Aesthetics, Creativity, and the Arts. Ad-
vance online publication. doi:10.1037/a0029171
Benedek, M., Fink, A., & Neubauer, A. (2006). Enhancement of ideational
fluency by means of computer-based training. Creativity Research Jour-
nal, 18, 317–328. doi:10.1207/s15326934crj1803_7
Benedek, M., Franz, F., Heene, M., & Neubauer, A. C. (2012). Differential
effects of cognitive inhibition and intelligence on creativity. Personality
and Individual Differences, 53, 480 – 485. doi:10.1016/j.paid.2012.04
.014
Benedek, M., Könen, T., & Neubauer, A. C. (2012). Associative abilities
underlying creativity. Psychology of Aesthetics, Creativity, and the Arts,
6, 273–281. doi:10.1037/a0027059
Benedek, M., & Neubauer, A. C. (in press). Revisiting Mednick’s model on
creativity-related differences in associative hierarchies. Evidence for a
common path to uncommon thought. Journal of Creative Behavior.
doi:10.1002/jocb.35
Borkenau, P., & Ostendorf, F. (1993). NEO-Fünf-Faktoren Inventar (NEO-
FFI) nach Costa und McCrae [NEO-Five factor inventory after Costa
and McCrae]. Göttingen, Germany: Hogrefe.
Clark, P. M., & Mirels, H. L. (1970). Fluency as a pervasive element in the
measurement of creativity. Journal of Educational Measurement, 7,
83– 86. doi:10.1111/j.1745-3984.1970.tb00699.x
Fink, A., & Benedek, M. (in press). EEG Alpha power and creative
ideation. Neuroscience and Biobehavioral Reviews. Advance online
publication. doi:10.1016/j.neubiorev.2012.12.002
Gilhooly, K. J., Fioratou, E., Anthony, S. H., & Wynn, V. (2007). Diver-
gent thinking: Strategies and executive involvement in generating novel
uses for familiar objects. British Journal of Psychology, 98, 611– 625.
Guilford, J. P. (1967). The nature of human intelligence. New York, NY:
McGraw-Hill.
Hocevar, D. (1979a). A comparison of statistical infrequency and subjec-
tive judgment as criteria in the measurement of originality. Journal of
Personality Assessment, 43, 297–299. doi:10.1207/s15327752
jpa4303_13
Hocevar, D. (1979b). Ideational fluency as a confounding factor in the
measurement of originality. Journal of Educational Psychology, 71,
191–196. doi:10.1037/0022-0663.71.2.191
Jauk, E., Benedek, M., Dunst, B., & Neubauer, A. C. (2013). The rela-
tionship between intelligence and creativity: New support for the thresh-
old hypothesis by means of empirical breakpoint detection. Intelligence,
41, 212–221. doi:10.1016/j.intell.2013.03.003
Kaufman, J. C., Plucker, J. A., & Baer, J. (2008). Essentials of creativity
assessment. Hoboken, NJ: Wiley.
Kuhn, J. T., & Holling, H. (2009). Measurement invariance of divergent
thinking across gender, age, and school forms. European Journal of
Psychological Assessment, 25, 1–7. doi:10.1027/1015-5759.25.1.1
Lau, S., & Cheung, P. C. (2010). Creativity assessment: Comparability of
the electronic and paper-and-pencil versions of the Wallach–Kogan
Creativity Tests. Thinking Skills and Creativity, 5, 101–107. doi:
10.1016/j.tsc.2010.09.004
Mednick, S. A. (1962). The associative basis of the creative process.
Psychological Review, 69, 220 –232. doi:10.1037/h0048850
Michael, W. B., & Wright, C. R. (1989). Psychometric issues in the
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
8BENEDEK, MU
¨HLMANN, JAUK, AND NEUBAUER
assessment of creativity. In J. A. Glover, R. R. Ronning & C. R.
Reynolds (Eds.), Handbook of creativity (pp. 33–52). New York, NY:
Plenum Press. doi:10.1007/978-1-4757-5356-1_2
Mouchiroud, C., & Lubart, T. (2001). Children’s original thinking: An
empirical examination of alternative measures derived from divergent
thinking tasks. The Journal of Genetic Psychology: Research and The-
ory on Human Development, 162, 382– 401. doi:10.1080/
00221320109597491
Nusbaum, E. C., & Silvia, P. J. (2011). Are intelligence and creativity
really so different? Fluid intelligence, executive processes, and strategy
use in divergent thinking. Intelligence, 39, 36 – 45. doi:10.1016/j.intell
.2010.11.002
Piers, E. V., & Kirchner, E. P. (1971). Productivity and uniqueness in
continued word association as a function of subject creativity and
stimulus properties. Journal of Personality, 39(2), 264 –276. doi:
10.1111/j.1467-6494.1971.tb00041.x
Plucker, J. A., Qian, M., & Wang, S. (2011). Is originality in the eye of the
beholder? Comparison of scoring techniques in the assessment of diver-
gent thinking. The Journal of Creative Behavior, 45, 1–22. doi:10.1002/
j.2162-6057.2011.tb01081.x
Reiter-Palmon, R., Illies, M. Y., Cross, L. K., Buboltz, C., & Nimps, T.
(2009). Creativity and domain specificity: The effect of task type on
multiple indexes of creative problem-solving. Psychology of Aesthetics,
Creativity, and the Arts, 3, 73– 80. doi:10.1037/a0013410
Runco, M. A. (1986). Maximal performance on divergent thinking tests
by gifted, talented, and nongifted children. Psychology in the Schools,
23, 308 –315. doi:10.1002/1520-6807(198607)23:3⬍308::AID-
PITS2310230313⬎3.0.CO;2-V
Runco, M. A. (2008). Commentary: Divergent thinking is not synonymous
with creativity. Psychology of Aesthetics, Creativity, and the Arts, 2,
93–96. doi:10.1037/1931-3896.2.2.93
Runco, M. A., & Acar, S. (2012). Divergent thinking as an indicator of
creative potential. Creativity Research Journal, 24, 66 –75. doi:10.1080/
10400419.2012.652929
Runco, M. A., & Albert, R. S. (1985). The reliability and validity of
ideational originality in the divergent thinking of academically gifted
and nongifted children. Educational and Psychological Measurement,
45, 483–501.
Runco, M. A., & Mraz, W. (1992). Scoring divergent thinking tests using
total ideational output and a creativity index. Educational and Psycho-
logical Measurement, 52, 213–221.
Runco, M. A., Okuda, S. M., & Thurston, B. J. (1987). The psychometric
properties of four systems for scoring divergent thinking tests. Journal
of Psychoeducational Assessment, 5, 149 –156. doi:10.1177/
073428298700500206
Runco, M. A., Plucker, J. A., & Lim, W. (2001). Development and
psychometric integrity of a measure of ideational behavior. Creativity
Research Journal, 13, 393– 400. doi:10.1207/S15326934CRJ1334_16
Silvia, P. J. (2008). Discernment and creativity: How well can people
identify their most creative ideas? Psychology of Aesthetics, Creativity,
and the Arts, 2, 139 –146. doi:10.1037/1931-3896.2.3.139
Silvia, P. J. (2011). Subjective scoring of divergent thinking: Examining
the reliability of unusual uses, instances, and consequences tasks. Think-
ing Skills and Creativity, 6, 24 –30. doi:10.1016/j.tsc.2010.06.001
Silvia, P. J., Martin, C., & Nusbaum, E. C. (2009). A snapshot of creativity:
Evaluating a quick and simple method for assessing divergent thinking.
Thinking Skills and Creativity, 4, 79 – 85. doi:10.1016/j.tsc.2009.06.005
Silvia, P. J., Winterstein, B. P., Willse, J. T., Barona, C. M., Cram, J. T.,
Hess, K. I.,...Richard, C. A. (2008). Assessing creativity with
divergent thinking tasks: Exploring the reliability and validity of new
subjective scoring methods. Psychology of Aesthetics, Creativity, and
the Arts, 2, 68 – 85. doi:10.1037/1931-3896.2.2.68
Smith, S. M., Ward, T. B., & Finke, R. A. (1995). The creative cognition
approach. Cambridge, MA: MIT Press.
Torrance, E. P. (1974). Torrance Tests of Creative Thinking: Norms-
technical manual, verbal forms A and B. Bensenville, IL: Scholastic
Testing Service.
Torrance, E. P. (2008). Torrance Tests of Creative Thinking: Norms-
technical manual, verbal forms A and B. Bensenville, IL: Scholastic
Testing Service.
Velicer, W. F., Eaton, C. A., & Fava, J. L. (2000). Construct explication
through factor or component analysis: A review and evaluation of
alternative procedures for determining the number of factors or compo-
nents. In R. D. Goffin & E. Helmes (Eds.), Problems and solutions in
human assessment (pp. 41–71). Boston, MA: Kluwer. doi:10.1007/978-
1-4615-4397-8_3
Received October 22, 2012
Revision received January 28, 2013
Accepted March 4, 2013 䡲
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
9
ASSESSMENT OF DIVERGENT THINKING
A preview of this full-text is provided by American Psychological Association.
Content available from Psychology of Aesthetics, Creativity, and the Arts
This content is subject to copyright. Terms and conditions apply.