Content uploaded by Sebastiaan de Klerk
Author content
All content in this area was uploaded by Sebastiaan de Klerk on Feb 20, 2019
Content may be subject to copyright.
Journal of Applied Testing Technology, Vol 18(S1), 32-37, 2017
* Author for correspondence
1. Using Serious Games for
Assessment
Using serious games as a tool for assessment may both
expand and strengthen the domain of assessment. It is
hypothesized that the domain may be expanded because
serious games have the potential to reveal Knowledge,
Skills, and Attributes (KSAs) of students that are “invisible”
or hard to detect when assessed with more traditional
assessment methods. ey may strengthen the domain
because the measurement of particular KSAs can be
improved (i.e., increased validity and reliability) through
the use of technology in serious games as compared to
traditional assessments like paper-and-pencil tests or
performance-based assessments (Iseli, Koeig, Lee, &
Wainess, 2010; Levy, 2013; De Klerk, Veldkamp, & Eggen,
2014; Mislevy et al., 2015).
An example of how serious games can expand
the domain of assessment can be found in a game-
based assessment that has been developed by CRESST
(Iseli, Koenig, Lee, & Wainess, 2010). is game-based
assessment was used to assess marine personnel’s
reactions to emergency situations. In the game, the player
(i.e., the test taker) was required to react to a variety of
emergencies that could occur on a marine vessel (e.g., a
re) as a virtual character. rough an interactive interface
the marine personnel could indicate which actions they
thought should be taken and which priorities should be
set to achieve goals in the virtual emergency situations.
All of their (re)actions in the serious game were recorded
and analyzed in log les. is game-based assessment
technology, and others like it, confronts test takers with
more and more realistic situations in which to measure
behaviors that represent reactions, decisions, planning,
and prioritizing than would be possible in, for example,
a practical performance-based assessment. Serious games
thus have the capability to present users with an expanded
set of situations and contexts in which an expanded set of
behaviors and constructs can also be assessed compared
to traditional self-report methods of assessment.
An example of how serious games can strengthen
assessments can be seen in a game developed for
Abstract
Game-based assessments will most likely be an increasing part of testing programs in future generations because they
provide promising possibilities for more valid and reliable measurement of students’ skills as compared to the traditional
methods of assessment like paper-and-pencil tests or performance-based assessments. The current status of serious
games for assessment has been highlighted from several angles in the previous articles of this special issue. Here, we will
game-based assessments advantages can play a role in future testing, and in the second part we will address one of the
most daring challenges: the psychometrics behind the game. In a short conclusion section we will discuss how research
and practice should shape a future generation of game-based assessment.
Keywords: Evidence-Centered Design, Game-Based Assessment, Psychometrics, Serious Game, Training and
Assessment
The Future Value of Serious Games for
Assessment: Where Do We Go Now?
Sebastiaan de Klerk1* and Pamela M. Kato2
1University of Twente/eX:plain, Enschede, Netherlands; s.dklerk@explain.nl
2Coventry University, Coventry, United Kingdom
Sebastiaan de Klerk and Pamela M. Kato
Vol 18 (S1) | 2017 | www.jattjournal.com Journal of Applied Testing Technology 33
formative assessment by WestEd, called SimScientists
(Quellmalz, Timms, Silberglitt, & Buckley, 2012). is
game-based assessment was built for students around
the age of 12 and comprises a virtual environment in
which students can engage in science tasks. For example,
students are presented with an animation of the ocean
and are required to draw a “food web" between several
animals and plants by drawing arrows that connect them.
Assessment is strengthened by the fact that it is not only
the nal food web that is logged (i.e., the product data)
but also the process data like reaction times, navigation
paths through the game, and the intermediate steps taken
to arrive at the nal product. is new information may
have value with regard to the inferences made about the
KSAs of students, and can for example also be evaluated
as diagnostic evidence to identify misconceptions of
students. e articles in this special issue demonstrate
that both eects, expansion and strengthening of
measurement, can be achieved when both the application
of serious game design principles and assessment design
principles are combined.
ere is much enthusiasm in the eld of education
about game-based assessment (Mislevy et al., 2014)
because the current methods of assessment do not seem
to fully have the power to measure all aspects of students’
KSAs (De Klerk, Eggen, & Veldkamp, 2014). e results
of standardized tests are, to an increasing degree, not
only used to evaluate students, but also to evaluate
the success of schools to teach their students (Nelson,
Nugent, and Rupp, 2012). Yet, if we cannot even be sure
that standardized tests provide the most valid and reliable
indicators of students’ KSAs, then how can we see those
indicators as valid and reliable for evaluating schools? As
the research in this special issue shows, there might be an
important role for game-based assessment in lling the
reliability and validity gap created by the strong emphasis
on standardized testing in education. Considering the
strong improvements in statistical methodology and
technology over the recent years now may be the time
to capitalize on the full potential game-based assessment
may provide. e articles presented in this special issue
provide valuable insights regarding this potential.
With regard to statistical methodology for game-
based assessment there is much promise in the eld of
Educational Data Mining (EDM) (Rupp, Nugent, &
Nelson, 2012; Rupp, DiCerbo, Levy, Benson, Sweet,
Crawford et al., 2012). EDM is concerned with nding
meaningful relationships in big data that are logged
by educational applications. Several techniques and
(statistical) methodologies are grouped under the
broader concept of EDM. For example, cluster analysis
can be used to nd clusters of data, network analysis
is concerned with identifying how data elements are
connected, and regression trees can be used to predict
future performance (Kerr & Chung, 2012; Gobert, Sao
Pedro, Baker, Toto, & Montalvo, 2012; Mislevy, Behrens,
& DiCerbo, 2012). EDM techniques are for the most part
exploratory techniques that are applied as a rst step to
nd meaningful patterns in the data. Advances have also
been made in conrmatory statistical techniques. For
example, Bayesian network methods are oen used to
make probabilistic statements about knowledge, skills,
and abilities of students based on their performance in a
game-based assessment (De Klerk et al., 2015; Mislevy et
al., 2014).
A Bayesian Network is a graphical structure to
reason under uncertainty and is based on a Bayesian
psychometric modeling framework (Pearl, 1988). e
network consists of nodes (which are observable and
latent variables) that are connected through arcs (arrows)
which depict conditional probabilistic relationships
between the variables (Neapolitan, 2003). e observable
variables can be students’ actions in a serious game, while
the latent variables are the KSA’s that are the subject of
measurement. Students’ actions inuence the state of the
KSA’s t hrough the conditional probabilities that are dened
in the network. At rst, the conditional probabilities can
be based on subject matter expert input, and can later be
dened through data. In that way, a Bayesian Network
can be updated and improved continuously.
Although it still is a time and cost intensive process
to build a digital learning or assessment environment,
let alone a full immersive simulation, the technological
improvements driven by the digital revolution make
development of these environments more feasible. For
example, in the Netherlands, the commercial educational
serious game, Math Garden, was built by the University of
Amsterdam to have children playfully develop their math
skills (Straatemeier, 2014). As digital technology improves
and becomes more accessible, combining commercial
and educational opportunities might be a strategy for
more institutions and companies to start building serious
games for learning and assessment.
Another important benet of the increased
technological and statistical possibilities is that large
quantities of data can be processed, logged, and analyzed.
Vol 18 (S1) | 2017 | www.jattjournal.com Journal of Applied Testing Technology
34
The Future Value of Serious Games for Assessment: Where Do We Go Now?
Each interaction between student and computer, (e.g.,
mouse clicks, keystrokes, virtual location information,
spoken text data, etc.) can be recorded and analyzed,
both ad hoc and over time (Nelson et al., 2012).
However, as DiCerbo excellently discusses in this issue,
‘handling’ big data is not the highest purpose. e real
challenge lies in locating and supporting the evidentiary
structures between serious game performances and
measuring students’ KSAs. Answering this challenge can
be regarded as a key task of measurement experts.
e Evidence-Centered Design (ECD) framework
(Mislevy, Steinberg, & Almond, 2003) is an important
tool for building an evidentiary argument in game-based
assessment (DiCerbo, 2017). When strong evidentiary
arguments can be built, game-based assessment may t
very well into a new paradigm that integrates assessment
and training. e concept of continuous evaluation,
where assessment and training are two sides of the same
coin, goes well with serious games because a typical
student will likely spend more time in a serious game
than on a standardized test. Furthermore, the possibilities
discussed above may provide the opportunity to
continually evaluate students’ movements and actions
in the serious game over long periods of time, making
it possible to monitor students’ subtle improvements in
their knowledge and abilities. e extended sampling, as
compared to a traditional test, and the stronger alignment
between what has been taught and what is tested may
ultimately provide the most reliable and valid inferences
regarding what students learning outcomes are. e eld
is in need of serious games that do this well and studies
that document their improved reliability and validity as
assessment tools.
Interestingly, serious games have long been used for
e-learning purposes as it is broadly recognized that ‘fun’
is an important incentive for learning (Clark & Mayer,
2011; Kato, 2010). Generally speaking, learning and
assessment are not yet aligned, as many people do not
nd it very pleasurable to take a test. For many people
taking a (standardized) test is a tense experience, and so-
called test anxiety can have a negative impact on students’
performance on tests (Elliot & McGregor, 1999). Shute
(2011) indicates that ‘testing’ in a serious game can give
students more of a fun feeling and may challenge them
to perform at their best. When students experience this,
they may “forget” that they are in a testing situation and
test anxiety may be diminished. is eect has been
labeled stealth assessment by Shute. Besides standardized
testing, this eect can also hold for performance-based
assessment in which students’ practical performance
is observed and evaluated by a rater. However, game-
based assessment is less obtrusive as there is no rater
physically judging your performance, which in turn
may aect your performance. is may further increase
the validity of game-based assessment when compared
to performance-based assessment. is potential eect
should be investigated in future research endeavors.
An important issue is the narrow sampling of learning
objectives in a (standardized) test. e use of game-based
assessment can potentially increase the representativeness
of the assessment in two ways: by increasing the number
of tasks in the assessment, and by creating tasks that
could never be tested in a paper-and-pencil test or
a performance-based assessment. For example, the
serious game presented in the paper by Bauer, Wylie,
Jackson, Mislevy, Homan-John, and John (2017), called
Mars Generation One, was developed for teaching and
formative assessment of argumentation. Many dierent
kinds of argument and argumentation schemes have been
incorporated in the game, and as a student progresses
through the game he or she will encounter all types of
argumentation, "claims," and rebuttals. How these were
handled and used in the game is recorded and analyzed,
and the results are then used for teaching, both inside
and outside of the game environment. In a traditional
setting, be it a standardized test or a performance-based
assessment, a student could have been tested on only
one or two types of argumentation. However, because
the number of tasks in a serious game assessment can
be increased and broadened signicantly, the sampling
variability can be increased, thus, potentially improving
the representativeness of the measurements and validity
of the assessment.
2. Psychometric Considerations
In one of the biggest advantages of game-based
assessment also lays the greatest challenge. What to do
with and how to interpret the enormous quantities of
data that can be produced by performing a game-based
assessment (Levy, 2013). In contrast to a standardized
test, which only produces product data, a serious game
also provides process data. Product data are the observed
values that students produce by performing in a test (or
game) that give an indication of their performance. In
Sebastiaan de Klerk and Pamela M. Kato
Vol 18 (S1) | 2017 | www.jattjournal.com Journal of Applied Testing Technology 35
a standardized test this is usually the score of number
correct. Process data are the actual log les of data
collected that can, in great detail, when analyzed, show
how students have reached their product data. Process
data are mouse clicks, keystrokes, navigational behavior,
time stamps, etc. (Rupp et al., 2012). Performance in
a serious game can produce many pages of log le data
in just a short period of time. e challenge is to nd
meaningful relationships between the data presented
in the log les and their relationships to the constructs
to be measured in real life. As DiCerbo and colleagues
demonstrate in this issue (2017), the ECD framework
can be an excellent point of departure for building an
evidentiary argument in which game-based assessment
performance data is used as evidence for understanding
about students’ knowledge, skills and abilities.
In fact, based on the ECD framework, three challenges
have to be met to build a strong evidentiary structure: a
student model, a task model, and an evidence model (Levy,
2013). In the student model details and specications are
provided regarding what (combination of) knowledge,
skills, and abilities are being measured. ese KSAs are
called Student Model Variables or SMVs and can be
more generally qualied as latent variables (Mislevy et
al., 2014). ese variables cannot be directly observed
and are subject of indirect inference based on a statistical
model (which is part of the evidence model). An example
of such a variable is creativity: we cannot directly measure
creativity, as we can for example measure somebody’s
height. As a result, we have to infer a person’s level of
creativity from observing their behavior. e challenge
then of course is to dene which type(s) of behavior(s)
reveal something about a person’s ability to be creative and
then to create situations that would reveal those abilities
among those who are creative (and less of those behaviors
among those less of those abilities). Shute, Bauer, Ventura,
and Zapata-Rivera (2009), for example, had students
play a serious game in which tasks and objectives could
be completed in many dierent ways: some creative and
some not. is indicates that it is important to create tasks
that give students the opportunity to demonstrate their
SMVs.
is challenge can be met with the task model in the
ECD framework. e task model species which tasks,
assignments or objectives students have to complete in the
game. Completing these tasks, of course, has to yield data,
or observable variables (OVs), that provide information
about the latent SMVs. In contrast to a standardized test,
the tasks in game-based assessments generally cannot be
operationalized as questions (although questions can be
part of the assessment), but are an integrated part of the
game play. e challenge is to design and develop clear
tasks that can be scored inside the virtual environment
and provide the necessary information to make valid
inferences about the status of the SMVs.
e most important model in the ECD framework
could well be the evidence model. e evidence model
is where theory and data come together through the two
coherent processes of evidence identication and evidence
accumulation (Levy, 2013). As mentioned above, serious
games provide the opportunity to not only collect product
data, but also process data. e rst challenge here, the
evidence identication process, is to identify which
elements in the process data provide meaningful evidence
with regard to the SMVs. e identied elements are then
called the Observable Variables (OVs), and will later serve
as input for the psychometric model. e second process
within the evidence model is evidence accumulation.
When the OVs have been identied, the psychometric
model serves to transform them for each student in a
unidimensional or multidimensional score that validly
represents the SMVs. In a traditional standardized test
consisting of multiple choice questions the OVs are
generally zeros and ones, resulting in relatively simple
psychometric models with one or two item parameters,
However, in a game-based assessment, more variables can
be parameterized (i.e., as predictors) in the psychometric
model. Furthermore, in traditional testing most oen
the answers students give to questions are considered to
be independent of each other, that is, it is assumed that
there is no (statistical) relationship between the answers
to questions X and Y. is is generally not the case in a
serious game because the actions that somebody can
perform at a certain point of time in the game are oen
dependent on what has been done before. Mislevy et al.,
(2014) refer to this phenomena as the change state of a
serious game. Finally, the relationship between SMVs and
OVs is quite complex because multiple variables in the
game can be dependent upon and interact with multiple
(combinations of ) students’ knowledge, skills, and
abilities in real life. e process of building an evidentiary
argument for a game-based assessment is a dicult and
laborious process consisting of many iterations and tests.
DiCerbo’s article (2017) presents a nice overview of such
a process.
Vol 18 (S1) | 2017 | www.jattjournal.com Journal of Applied Testing Technology
36
The Future Value of Serious Games for Assessment: Where Do We Go Now?
3. The Future
We expect that game-based assessment will increasingly
evolve into interactive and immersive virtual environments
in which students can freely wander around to complete
tasks and assignments with their actions being scored
on the y with feedback being immediately available.
With digital and technological applications continually
improving, we may also see virtual reality moving more
into the domain of educational assessment. Immersive
virtual reality simulations can also oer solutions for the
measurement of complex practical skills like painting,
welding, or other industrial professions.
Many game-based assessments are still only used for
formative purposes, providing one summary score as an
indicator of a construct. With the psychometric models
improving (e.g., Mislevy et al., 2014), we might also see
game-based assessment being used for a summative or
credentialing purpose in the future. Future research
should therefore focus on investigating the extent
to which serious games can be used in high-stakes
assessment. Maybe formative and summative assessment
in serious games can be more integrated in the future.
For instance, students’ performance can be constantly
monitored and when they reach a certain standard they
automatically do some sort of summative module or they
immediately pass the ‘test’ when enough information has
been collected. It is further critical that these eorts to
develop game-based assessments be evaluated with high
standards of scientic integrity to ensure they are valid
and reliable (Kato, 2012).
4. Conclusions
e articles discussed in this special issue of JATT on
Serious Games in Assessment contribute to the increasing
body of research on game-based assessment. is is an
interesting eld of research that is full of promise and
that continues to require rigorous attention from the
perspectives of both research and practice.
5. References
Bauer, M., Wylie, C. Jackson, T., Mislevy, R. J., Homan-John,
E., & John, E. (2017). Journal of Applied Testing Technology.
Clark, R. C., & Mayer, R. E. (2011). E-learning and the sci-
ence of instruction. San Francisco: Pfeier. https://doi.
org/10.1002/9781118255971
De Klerk, S., Eggen, T. J. H. M., & Veldkamp, B. P. (2014).
A blending of computer-based assessment and perfor-
mance-based assessment: Multimedia-Based Performance
Assessment (MBPA). e introduction of a new method
of assessment in Dutch Vocational Education and Train-
ing (VET). Cadmo, 22(1), 39-56. https://doi.org/10.3280/
CAD2014-001006
De Klerk, S., Veldkamp, B. P., & Eggen, T. J. H. M. (2015).
Psychometric analysis of the performance data of simula-
tion-based assessment: A systematic review and a Bayes-
ian network example. Computers & Education, 85, 23–34.
https://doi.org/10.1016/j.compedu.2014.12.020
DiCerbo, K. (2017). Building the evidentiary argument in
game-based assessment. Journal of Applied Testing Tech-
nology.
Elliot, A. J., & McGregor, H. (1999). Test anxiety and the hi-
erarchical model of approach. Contemporary Educational
Psychology, 19, 430–446.
Gobert, J. D., Sao Pedro, M. A., Baker, R. S. J. D., Toto, E., &
Montalvo, O. (2012). Leveraging educational data mining
for real-time performance assessment of scientic inqui-
ry skills within microworlds. Journal of Educational Data
Mining, 4, 111–143.
Iseli, M. R., Koenig, A. D., Lee, J. J., & Wainess, R. (2010).
Automated assessment of complex task performance in
games and simulations (CRESST Research rep. No. 775).
Los Angeles: National Center for Research on Evalua-
tion, Standards, Student Testing. Retrieved from: http://
www.cse.ucla.edu/products/reports/R775.pdf PMCid:P-
MC3389788
Kato, P. M. (2012). Evaluating ecacy and validating health
games. Games for Health: Research, Development, and
Clinical Applications, 1(1), 74–76. https://doi.org/10.1089/
g4h.2012.1017 PMid:26196436
Kato, P. M. (2010). Video games in health care: Closing the gap.
Review of General Psychology, 14(2), 113–121. https://doi.
org/10.1037/a0019441
Kerr, D., & Chung, G. K. W. K. (2012). Identifying key features
of student performance in educational video games and
simulations through cluster analysis. Journal of Education-
al Data Mining, 4(1).
Levy, R. (2013). Psychometric and evidentiary advances, op-
portunities, and challenges for simulation-based assess-
ment. Educational Assessment, 18(3), 182–207. https://
doi.org/10.1080/10627197.2013.814517
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). Focus
article: On the structure of educational assessments. Mea-
surement: Interdisciplinary Research and Perspectives, 1(1),
3–62. https://doi.org/10.1207/S15366359MEA0101_02
Mislevy, R. J., Behrens, J. T., DiCerbo, K., & Levy, R. (2012).
Design and discovery in educational assessment: evi-
dence-centered design, psychometrics, and data mining.
Journal of Educational Data Mining, 4, 11–48.
Mislevy, R.J., Oranje, A., Bauer, M., von Davier, A.A., Hao, J.,
Corrigan, S., Homan, E., DiCerbo, K., & John, M. (2014).
Psychometric considerations in game-based assessment.
New York, NY: Institute of Play.
Sebastiaan de Klerk and Pamela M. Kato
Vol 18 (S1) | 2017 | www.jattjournal.com Journal of Applied Testing Technology 37
Neapolitan, R.E. (2003). Learning Bayesian networks. New
York, NY: Prentice-Hall.
Pearl, J. (1988). Probabilistic reasoning in intelligent sys-
tems: Networks of plausible inference. San Francisco, CA:
Morgan Kaufmann. https://doi.org/10.1016/B978-0-08-
051489-5.50012-6
Nelson, B., Nugent, B., & Rupp, A. A. (2012). On instruction-
al utility, statistical methodology, and the added value of
ECD: Lessons learned from the special issue. Journal of Ed-
ucational Data Mining, 7(4), 224–230.
Quellmalz, E. S., Timms, M. J., Silberglitt, M. D., & Buckley, B.
C. (2012). Science assessments for all: Integrating science
simulations into balanced state science assessment systems.
Journal of Research in Science Teaching, 49(3), 363–393.
https://doi.org/10.1002/tea.21005
Rupp, A. A., DiCerbo, K. E., Levy, R., Benson, M., Sweet, S.,
Crawford, A., et al. (2012). Putting ECD into practice: the
interplay of theory and data in evidence models within a
digital learning environment. Journal of Educational Data
Mining, 4, 49–110.
Rupp, A. A., Levy, R., DiCerbo, K., Sweet, S. J., Crawford, A.
V., Calico, T., Benson, M., Fay, D., Kunze, K. L., Mislevy, R.
J., & Behrens, J. T. (2012). Putting ECD into practice: e
interplay of theory and data in evidence models within a
digital learning environment. Journal of Educational Data
Mining, 4(1), 49–110.
Shute, V. J., Ventura, M., Bauer, M. I., & Zapata-Rivera, D.
(2009). Melding the power of serious games and embedded
assessment to monitor and foster learning: Flow and grow.
In U. Ritterfeld, M. J. Cody, & P. Vorderer (Eds.). Serious
games: Mechanisms and eects (pp. 295-321). Mahwah,
NJ: Routledge.
Shute, V. J. (2011). Stealth assessment in computer-based games
to support learning. In S. Tobias and J.D. Fletcher (Eds.),
Computer Games and Instruction (pp. 503-523). Charlotte,
NC: Information Age Publishing.
Straatemeier, M. (2014). Math garden: A new educational and
scientic instrument. Doctoral dissertation, UvA, e
Netherlands.