Conference PaperPDF Available

The Rise of the Super Experiment


Abstract and Figures

Traditional experimental paradigms have focused on executing experiments in a lab setting and eventually moving successful findings to larger experiments in the field. However, data from field experiments can also be used to inform new lab experiments. Now, with the advent of large student populations using internet-based learning software, online experiments can serve as a third setting for experimental data collection. In this paper, we introduce the Super Experiment Framework (SEF), which describes how internet-scale experiments can inform and be informed by classroom and lab experiments. We apply the framework to a research project implementing learning games for mathematics that is collecting hundreds of thousands of data trials weekly. We show that the framework allows findings from the lab-scale, classroom-scale and internet-scale experiments to inform each other in a rapid complementary feedback loop.
Content may be subject to copyright.
The Rise of the Super Experiment
John C. Stamper
Carnegie Mellon University
Pittsburgh, PA
Steve Ritter
Carnegie Learning, Inc.
Pittsburgh, PA
Derek Lomas
Carnegie Mellon University
Pittsburgh, PA
Kenneth R. Koedinger
Carnegie Mellon University
Pittsburgh, PA
Dixie Ching
New York University
New York, NY
Jonathan Steinhart
Vienna, Austria
Traditional experimental paradigms have focused on executing
experiments in a lab setting and eventually moving successful
findings to larger experiments in the field. However, data from
field experiments can also be used to inform new lab experiments.
Now, with the advent of large student populations using internet-
based learning software, online experiments can serve as a third
setting for experimental data collection. In this paper, we
introduce the Super Experiment Framework (SEF), which
describes how internet-scale experiments can inform and be
informed by classroom and lab experiments. We apply the
framework to a research project implementing learning games for
mathematics that is collecting hundreds of thousands of data trials
weekly. We show that the framework allows findings from the
lab-scale, classroom-scale and internet-scale experiments to
inform each other in a rapid complementary feedback loop.
eScience, experiment design, internet scale
Web-based software is creating an explosive growth in the use of
randomized controlled experiments in education, due to the
relative ease with which users can be randomly assigned to
different experimental conditions. Scientists are beginning to
recognize the coming data surge and developing new ways of
analyzing data at "internet scale." The vastly increased scale of
subject populations online can produce a categorically different
mode of experimentation in education. For this reason, we
propose a new experimental framework that takes advantage of
rapid internet-scale experimentation, while retaining the control of
lab-scale and classroom-scale experiments.
Randomized controlled trials are regularly used to drive design
decisions on the internet. In its simplest form, A/B testing is a
form of experimentation where one of two advertisements are
randomly delivered to each incoming site visitor. This allows
advertisers to determine which advertisement results in improved
outcomes (such as a greater click-through rate) [3]. Multiple tools
exist to support website optimization, including the free Google
Site Optimizer that supports both A/B tests and multi-variable
testing. Recently, free-to-play online game companies, such as
Zynga, have made use of large-scale optimization experiments
with their large number of online players. By randomly assigning
players to hundreds of different game design configurations, they
can optimize the game design to maximize the conversion of
players to paying customers [7].
2. Internet Scale Research in Education
Internet-scale research introduces new potential methods in
Educational Research. For instance, optimization experiments like
Response Surface Methods, are a common applied research
method for improving industrial process outcomes. These
experimental designs showed early promise for improving
educational outcomes [5], but because the designs would have
required many hundreds of students, they were expensive and
impractical. Internet-scale research can now support these
optimization experiments, along with these other experimental
Increased number of conditions. With tens of thousands of “user-
subjects,” internet-scale research studies present the opportunity
for researchers to run dozens—even hundreds—of different
experimental conditions simultaneously. This easily contrasts with
lab or field-scale studies, where available resources and subject
pools typically constrain experimental designs to fewer than 8
experimental conditions. Furthermore, with fewer conditions,
experiments can be conducted within days, rather than months.
Ability to measure “true” task engagement. Internet-scale
research is also uniquely suited for measuring task engagement.
Because the researcher typically lacks control over participants
(they can quit far more easily than in lab or classroom
experiments), the internet is an ideal setting for investigating user
motivation. If players assigned to condition A play significantly
longer than players in condition B (i.e., were engaged in the task
for longer), then condition A can be said to be more engaging than
condition B. The ability to measure and compare engagement
makes it possible to measure how different design elements and
configurations affect player engagement.
Increase in external validity. A third advantage of internet-scale
research is the high external validity—experiments are conducted
with actual “real-world” users. While the lack of control over
subjects can result in noisy data, this noise is useful for preventing
over the over-fitting of predictive models that constructed for use
“in the wild.”
Greater access to all users. A fourth advantage of internet-scale
research is the fact that informed consent is not required if the
users are anonymous. Even with educational exemptions to
informed consent, parental opt-out forms can still pose a barrier to
many field-based educational studies. While researchers could
potentially make use of informed consent (and thus obtain non-
anonymous data), anonymous data collection is likely to remain a
characteristic of most large internet-scale research.
Of course, the lack of information about participants is also a key
drawback of internet-scale research. Broadly speaking, internet
scale studies cannot collect rich information about participants.
Therefore, these studies are unlikely to be suitable when research
questions require demographic data, detailed pre/post tests,
participant observation, talk-aloud protocols, or any kind of
psychophysiological measure. Finally, the lack of participant
control means that internet scale studies may not be appropriate if
repeated participation over time is required.
Given these drawbacks, it is clear that traditional lab based
experiments and structured field trials still provide valuable data
that internet scale experiments cannot. However, there is much to
be gained from internet scale studies. The Super Experiment
Framework (SEF) seeks to illustrate how different scales of
experimentation can productively inform one another. The SEF
framework, seen in Figure 1, is split into three general
experimental parts that are roughly delineated by scale. Lab-Scale
experiments are smaller highly controlled studies that take place
in a lab or single classroom, generally not exceeding 50
participants. School-Scale experiments are formal experiments
that take place in multiple classrooms or schools consisting of
hundreds to thousands of participants. Internet-Scale experiments
are informally delivered online to thousands to millions if
Figure 1. The Super Experiment Framework showing how
each of the component scales informs the others.
In the SEF framework, each component provides an experimental
level that can be used to answer specific questions that might be
difficult or impossible to answer using one of the other
components. Further, the various components can be used to
expand or validate findings of the other components. A feedback
loop can also be used with the framework where internet scale
experiments can identify areas of focus for lab scale experiments,
which can then be validated in school scale experiments. An
overview of each of the SEF components can be seen in Table 1.
School scale and lab scale experiments typically recruit subjects
and then randomly assign them to different experimental
conditions as part of a single experiment. However, internet-scale
research creates situations where multiple experiments are
randomly drawing from the same pool of subjects. Just as a single
experiment contains multiple experimental conditions, the SEF
contains multiple experiments. Because the different experiments
are derived from the same pool of random assignment,
experimental conditions that are not part of the same experiment
may still be compared to one another, if desirable. While there
may be few immediate benefits of this comparison, the super
experiment is a unique characteristic of internet-scale research.
Therefore, the use of the term “super experiment” in the super
experiment framework simply refers to the broad network of
information flow between different scales of experimentation,
from the lab scale, to the school scale and to the internet scale.
Type Benefits Drawbacks
Lab Scale
Rich user data, Formal,
Controlled CTA, Talk
alouds, Psycho-
physiological studies
Effect Size, Replication,
Experimenter effects,
Threats to external
Formal, Controlled,
Validation, Good
randomization, Surveys,
Enforced participation
Expensive, Difficult to
replicate, Threats to
external validity
Informal, Large data
collection, Rapid, High
external validity,
Decreased Type II error
rate, High power
Anonymity, High
attrition, Data overload,
Threats to internal
Table 1. Components of the Super Experiment Framework
The need for the SEF framework was initiated through our work
in creating online games for learning. The number of potential
experiments was large and the opportunity to field the games at
each of the scales identified in the SEF framework provided the
need to build a feedback loop to execute many experiments at
internet scale in order to narrow down the potential experiments to
test at the more controlled school scale. “Battleship Numberline”
(BSNL), an online educational game, benefits from the super
experiment framework.
Designed to improve number sense among elementary and middle
school students, BSNL provides practice estimating numbers on a
number line within four content domains: whole numbers,
fractions, decimals and measurement [4]. The game narrative
involves defending Numbaland Island from invading robot pirates
by firing projectiles at their ships and submarines. BSNL involves
two basic modes: naming numbers and placing numbers. In the
naming condition, players type a number that corresponds to the
location of an enemy ship that is positioned on a number line
between two marked endpoints. In the placement mode, the player
is given the numeric location of a hidden submarine (e.g.,
“Submarine spotted at 1/3”) and needs to click on the location that
they believe corresponds to the number. After the player has typed
a number or clicked on the number line, a projectile drops
vertically from the top of the screen to the designated location on
the number line. Animation and text-based feedback
communicates the player’s accuracy after every round.
A primary goal of our research has been to understand how
different game design factors affect player learning and
engagement. In order to systematically investigate these factors,
we implement these design factors as flexible xml-based
parameters that can be determined at the game runtime. We are
then able to create online experiments that randomly assign new
players to a set of different game sequences.
During gameplay, BSNL generates an online data log of the task
context (the above xml parameters) along with data describing the
player’s performance on each opportunity. On each item, we log
the player’s reaction time, their accuracy, and a binary field
indicating whether the player was successful or not. Logs are then
imported into the PSLC Datashop [2], which allows for the
secondary analysis of player performance and learning. The hit
rate measure is essential for enabling Datashop to plot learning
curves of error rate over time. By labeling different items in the
game with different knowledge components (e.g., reducible
fractions, unit fractions, etc), we can plot learning curves for each
knowledge component. Learning curves can also be described
based on fluency [1], where we plot the reduction of reaction time
over opportunities played. In addition to these measures of
learning and performance, we investigate player engagement
through two measures: the total number of items played and the
total amount of time spent playing. These two metrics correspond
with our construct of intrinsic motivation or player engagement.
The number of potential parameter settings in BSNL makes it a
great tool to answer many research questions, but at the same time
the number of possible settings make it difficult to decide on what
settings to in traditional lab or school settings. For this reason, it is
a perfect candidate for use in the SEF. Next, we show how the
results of different types of experiments at one scale inform new
experiments on a different scale.
Lab Scale informing School Scale. The use of a lab experiment to
inform a field trial at a school is one of the most common types of
experimental design. It is still an important part of the SEF. We
performed a lab scale experiment, which is now being validated at
the school scale. This experiment was conducted at a small
Catholic liberal arts University. Although the college is co-
educational, its focus is on women’s education, and 89% of the
participants were women. Participants were 18 students in an
eight-week first-year seminar course, which met once per week.
Students chose for this seminar period to focus on mathematics
games. Over 5 weeks, we administered a short (typically one
minute) paper-and-pencil pretest, asked students to play a specific
fluency game for approximately one-half hour and then gave a
posttest which was identical in content to the pretest. In all but the
first week, the pretest was preceded by a delayed post-test, which
was a repeat of the posttest from the previous week’s materials.
In four of the five experiments significant improvement was
shown on a delayed post-test, and three of the five showed
immediate results. Effect sizes were also quite large, ranging from
0.4 to 2.4, indicating that these results are not only significant but
substantial. Prior to the first experiment, students were given a
survey about their confidence in mathematics (containing
questions like “I am sure that I can learn math.”) and about text
anxiety (containing questions like “I am so nervous during a test
that I cannot remember facts that I have learned”). The two scales
were mixed in a 16-item form. Students were asked to rate each
statement from 1 (“strongly disagree”) to 5 (“strongly agree”).
Student confidence increased significantly, t(14)=-3.2, p<.01,
d=0.4, but there was no change in test anxiety, t(14)=-3.1, n.s.
Due to the success of this lab scale experiment, a similar school
scale experiment is now being conducted in multiple college
classrooms over an entire semester. Unlike the lab scale, the
researchers are not present in these classrooms, but we expect to
see similar results.
School Scale informing Internet Scale. BSNL was designed based
on an existing body of literature that investigated number line
estimation in the laboratory [6]. The game was playtested with 8
elementary school students, to refine usability issues in the design.
Following this, a school scale study was conducted with 119
students in grades 4-6. Students showed significant improvement
in hit rate form the first to second opportunity (see Figure 2), and
students demonstrated significant improvements in the estimation
of fractions on a number line after 20 minutes of gameplay.
Moreover, 82% of players (74% females, 92% males) reported
that they wanted to play the game again [4]. The data from these
classroom studies was imported into the PSLC Datashop to test
various knowledge component (KC) models. We identified a KC
model based on the various regions of the number line. This
knowledge component model was then used to produce a
Bayesian Knowledge Tracing adaptive sequencing algorithm.
This algorithm was then tested online in comparison with a
randomly sequenced level. Preliminary results suggest that the
BKT adaptive sequence did not result in significantly greater
player engagement than the random sequence.
Figure 2. Illustrates the average improvement from the first
opportunity to the second opportunity, by item presented. The
clear patterns of difficulty are used to generate knowledge
component models in Datashop.
Internet Scale informing School Scale or Lab Scale. Internet-scale
experiments can be useful for documenting the difficulty of
different task configurations. This is useful in the field of EDM,
as it allows for the generation of knowledge component models.
Different tasks are said to require different knowledge
components if and only if the tasks result in different performance
rates or learning curves. Therefore, by assessing the difficulty of
instances over a broad task design space, we can understand how
the task design space maps to various KC models.
For example, Rittle-Johnson, Siegler and Alibali found that
tickmarks supported the estimation of decimals on a number line
[6]. In order to replicate this work and extend it, we randomly
assigned online players to 6 different conditions in both the
decimal and whole number domain. Players either encountered
tickmarks dividing the number line into tenths, fourths, thirds,
halves (midpoint), or no tickmarks at all. Finally, an additional
two conditions looked at the interaction of an adaptive sequencing
algorithm with tickmarks at the midpoint. An overview of the
experiments and conditions can be seen in Table 2. Over 80,000
internet users participated in the experiment.
An experiment with this many conditions would be difficult to
replicate in a lab or classroom. This broad investigation of the
effects of guides enabled us to observe two unusual outcomes.
First, there was an apparent interaction effect between our
adaptive sequencing condition (termed “ITS”) and the midpoint
guides. Neither Second, the 10
guides apparently increased
player engagement in the decimal condition, but decreased
engagement in the whole number condition. These insights have
led us to execute similar lab scale experiments to replicate and
better understand these specific results.
Experiment Name Conditions Players
Adaptive Sequencing 15 19,856
Difficulty Sequencing 6 6,302
Difficulty Comparison 6 6,234
Expanded fraction set 4 5,596
Guides Engagement 10 11,386
Guides Learning 20 22,441
Measurement Study 3 10,014
Total 64 81,829
Table 2. List of experiments running concurrently with a total
of 64 conditions.
Technology is forever changing the way we conduct experiments.
The traditional paradigm is no longer the best way to do things.
Data is coming in faster, larger, and more fine grained. Instead of
focusing eScience efforts in just analyzing we have created a
framework to exploit internet scale experiments, while still
creating valid findings in real classrooms.
The main contribution of this work is the development of the
Super Experiment Framework which incorporates a feedback loop
allowing for experiments of different scales to inform each other.
This has become possible, and even necessary, with the use of the
internet to collect a large amount of experimental data. Internet
scale allows for optimization experiments that would be too
expensive to do at field level. This is truly applied educational
research that, as we have shown, provides insights that can inform
more controlled lab or school scale experiments. We also
explained our initial implementation of the SEF with a large
project with broad scope and many interesting research questions.
Traditional "one-way street" experiments of lab to school are slow
to findings and outdated. Our work shows how utilizing all three
scales of experiments leads to rapid findings that can lead to real
implementable insights efficiently.
Making the framework possible is the accessibility of internet
scale experiments. The key barrier to internet scale educational
research is attracting large numbers of users. Research projects
rarely invest in high-quality software design and usability, which
is usually necessary to achieve widespread adoption. However,
once this quality is developed, large numbers of users can be
reached through collaborations with one of many internet portals
that seek to aggregate educational content (e.g.,
Another challenge is instrumenting software for generating data
logs that measure player performance, learning and engagement.
Log files should capture not only correctness information, but the
amount of time that players spend on an activity, as well as the
number of opportunities attempted to make these measures.
A third challenge is the configuration of the software to allow for
experimental designs. This involves the abstraction of design
variables in the software’s design space, such that different
instances of the software can be created quickly. For instance, we
use xml to define game levels at run-time. These configurations
can then serve as different experimental conditions that can be
randomly deployed to online users.
Finally, one unusual new challenge in internet scale research is
the efficiency of subject-pool utilization. While lab or school scale
researchers expend significant effort to recruit a sufficient number
of subjects in order to achieve statistical significance, internet
scale researchers increasingly face the challenge of making use of
tens of thousands of subjects in an efficient manner. Certain types
of experimentation may result in inconsistent user experiences
that reduce overall participation.
Some challenges will be particular to individual experiments. For
instance, in our online experiments we observe strong seasonal
effects of weekends and school holidays, where the number of
players is greatly reduced. This suggests that certain experimental
comparisons should be sensitive to the time period of the study,
not merely the number of subjects.
Many of these challenges can be mitigated by validating the
results of internet scale experiments with controlled classroom
experiments. As shown in the experiment section, we are
continuing to run a number of experiments of scales based on
findings of different scales. This feedback loop will continue in
the future as we strive to optimize the games to maximize
learning. We believe this framework will rapidly lead to
significant discoveries that are replicable at each of the scales.
We would like to thank the Pittsburgh Science of Learning
Center, the DataShop staff, the Next Generation Learning
Challenge, Carlow University, and Pellisippi State University for
supporting this research.
[1] Baker, R., Habgood, M., Ainsworth, S., & Corbett, A.
Modeling the acquisition of fluent skill in educational action
games. User Modeling, 4511, (2007), 17-26.
[2] Koedinger, K.R., Baker, R.S.J.d., Cunningham, K.,
Skogsholm, A., Leber, B., Stamper, J. (2011) A Data
Repository for the EDM commuity: The PSLC DataShop.
Handbook of Educational Data Mining. CRC Press
[3] Kohavi, R., Longbotham, R., Sommerfield, D., and Henne,
R. M. Controlled experiments on the web: survey and
practical guide. Data Mining and Knowledge Discovery,
18(1), (2008), 140-181.
[4] Lomas D., Ching D., Stampfer, E., Sandoval, M., Koedinger,
K. Battleship Numberline: A Digital Game for Improving
Estimation Accuracy on Fraction Number Lines. Conference
of the American Education Research Association (AERA)
[5] Meyer, Donald L. Response Surface Methodology in
Education and Psychology, Journal of Experimental
Education, 31, 4, (1963), 329-336.
[6] Rittle-Johnson, B., Siegler, R. S., and Alibali, M. W.
Developing conceptual understanding and procedural skill in
mathematics: An iterative process, Journal of Educational
Psychology, 93, (2001), 346-362.
[7] Sheffield, B. GDC Canada: Bill Mooney Outlines Zynga’s
Methodology For Success, Gamasutra, May 6, 2010.
Retrieved 2/10/12:
... A large user base has also enabled researchers to conduct automated experiments on a much larger scale in authentic learning settings. The increase in learners using educational platforms has created these two major research opportunities, enabling a plethora of scientific studies to investigate student learning and behaviors in specific educational contexts [24]. This work can be categorized into two broad types of studies: a) A/B studies that conduct experiments on online platforms, and b) secondary data analysis (SDA) on large-scale datasets. ...
... Some platforms like ASSISTments' E-TRIALS [18], and Terracotta [14] have created tools to make the process of setting up and running A/B tests easier for external educational researchers [18]. Beyond this, publicly available large educational datasets with rich fine-grained data have opened avenues for secondary post-hoc analyses by researchers to find meaningful insights on learner processes and performance [24]. ...
... These papers found that A/B papers were cited more often to provide background and context for a study, while SDA papers were cited to use past specific core ideas, theories, and findings in the field. The research questions of this study broadly fall within the area of Scientometrics, the study of the properties of scientific publications using statistical and (more recently) data science methods [16], which has been used in EDM/LAK/AIED to assess the progress and development of a research community [1,6,12,22,27], evaluate contributions to the field [2,7,23,24,27], and to identify the common topics that are published at conferences and conferences' trajectories of evolution over time [21,23,27]. ...
Conference Paper
Full-text available
Online learning platforms have facilitated A/B and secondary data analysis (SDA) studies, which contribute to science differently. This paper compares these types of research within the context of 123 studies conducted in ASSISTments, analyzing how these two types of research differ in research topics, the institution location and affiliation of researchers, citations, and whether these studies serve as a first entry to the field for new researchers or a first opportunity to use new methods. We find all A/B studies are from the USA, while the majority of SDA studies come from China, particularly after 2020. Over half of SDA studies involve Knowledge Tracing (KT), especially in China. In contrast, USA SDA studies involve a broader range of topics. Finally, first-time researchers are more likely to publish SDA than A/B studies, and are more likely to publish at EDM than other conferences.
... Research-based platforms such as intelligent tutoring systems tend to lead to substantial learning benefits, an average of 0.76 standard deviations better than traditional curricula [33]. Even beyond these benefits, AIED learning platforms provide opportunities for enhancing learning through research [32] and can support it by iterative refinement through A/B tests and secondary data analysis. A large number of automated experiments have been conducted on these online learning platforms. ...
... For example, a paper may have been cited because of its author's political power, but that citation may then be justified within the paper in terms of some scientific aspect of the paper, such as category P7 (citations to a paper as an example of some more general category, without further discussion). As such, determining if a citation is author-based probably depends on other forms of data collection such as anonymous surveys [32]. ...
Conference Paper
Full-text available
Recent years have seen a surge in research conducted on intelligent online learning platforms, with a particular expansion of research conducting A/B testing to decide which design to use, and research using secondary platform data in analyses. This scientometric study aims to investigate how scholarship builds on these two different types of research. We collected papers for both categories-A/B testing, and educational data mining (EDM) on log data-in the context of the same learning platform. We then collected a randomized stratified sample of papers citing those A/B and EDM papers, and coded the reason for each citation. On comparing the frequency of citation categories between the two types of papers, we found that A/B test papers were cited more often to provide background and context for a study, whereas the EDM papers were cited to use past specific core ideas, theories, and findings in the field. This paper establishes a method to compare the contribution of different types of research on AIED systems such as interactive learning platforms.
... Learning engineering is an emerging field that uses evidence to inform educational design. For example, researchers developed and disseminated different versions of a game's script [12] and various conditions of difficulty and support [13] to large audiences to determine which versions produced desirable outcomes and inform design theory. ...
... For these reasons alone, online learning has made a positive impact. However, there is another major benefit of online learning: facilitation of basic research on learning (Stamper et al., 2012). Broadly, there have been two primary uses of online learning platforms to support research-making it easier to conduct experimental (or quasi-experimental) research on learning design, and the availability of data that makes secondary analyses possible. ...
Full-text available
Background In recent years, research on online learning platforms has exploded in quantity. More and more researchers are using these platforms to conduct A/B tests on the impact of different designs, and multiple scientific communities have emerged around studying the big data becoming available from these platforms. However, it is not yet fully understood how each type of research influences future scientific discourse within the broader field. To address this gap, this paper presents the first scientometric study on how researchers build on the contributions of these two types of online learning platform research (particularly in STEM education). We selected a pair of papers (one using A/B testing, the other conducting learning analytics (LA), on platform data of an online STEM education platform), published in the same year, by the same research group, at the same conference. We then analyzed each of the papers that cited these two papers, coding from the paper text (with inter-rater reliability checks) the reason for each citation made. Results After statistically comparing the frequency of each category of citation between papers, we found that the A/B test paper was self-cited more and that citing papers built on its work directly more frequently, whereas the LA paper was more often cited without discussion. Conclusions Hence, the A/B test paper appeared to have had a larger impact on future work than the learning analytics (LA) paper, even though the LA paper had a higher count of total citations with a lower degree of self-citation. This paper also established a novel method for understanding how different types of research make different contributions in learning analytics, and the broader online learning research space of STEM education.
... That is, researchers would need to maintain the ability to randomly assign versions of PL that systematically vary in their features to learners and observe their effects. This kind of design is possible in some circumstances when PL involves a single platform that is in sufficient demand that researchers can toggle and study individual features without disrupting users' experience (e.g., ASSISTments Testbed; Ostrow & Heffernan, 2016), or when a broad, systematic study is conducted with thousands of users in a form of super-experiment (Stamper et al., 2012). However, these are uncommon opportunities, and the more common district and school-level designs present a pernicious challenge to the conceptualization and study of PL. ...
Full-text available
Teachers, schools, districts, states, and technology developers endeavor to personalize learning experiences for students, but definitions of personalized learning (PL) vary and designs often span multiple components. Variability in definition and implementation complicate the study of PL and the ways that designs can leverage student characteristics to reliably achieve targeted learning outcomes. We document the diversity of definitions of PL that guide implementation in educational settings and review relevant educational theories that could inform design and implementation. We then report on a systematic review of empirical studies of personalized learning using PRISMA guidelines. We identified 376 unique studies that investigated one or more PL design features and appraised this corpus to determine (1) who studies personalized learning; (2) with whom, and in what contexts; and (3) with focus on what learner characteristics, instructional design approaches, and learning outcomes. Results suggest that PL research is led by researchers in education, computer science, engineering, and other disciplines, and that the focus of their PL designs differs by the learner characteristics and targeted outcomes they prioritize. We further observed that research tends to proceed without a priori theoretical conceptualization, but also that designs often implicitly align to assumptions posed by extant theories of learning. We propose that a theoretically guided approach to the design and study of PL can organize efforts to evaluate the practice, and forming an explicit theory of change can improve the likelihood that efforts to personalize learning achieve their aims. We propose a theory-guided method for the design of PL and recommend research methods that can parse the effects obtained by individual design features within the “many-to-many-to-many” designs that characterize PL in practice.
Full-text available
Data science techniques, nowadays widespread across all fields, can also be applied to the wealth of information derived from student interactions with serious games. Use of data science techniques can greatly improve the evaluation of games, and allow both teachers and institutions to make evidence-based decisions. This can increase both teacher and institutional confidence regarding the use of serious games in formal education, greatly raising their attractiveness. This paper presents a systematic literature review on how authors have applied data science techniques on game analytics data and learning analytics data from serious games to determine: (1) the purposes for which data science has been applied to game learning analytics data, (2) which algorithms or analysis techniques are commonly used, (3) which stakeholders have been chosen to benefit from this information and (4) which results and conclusions have been drawn from these applications. Based on the categories established after the mapping and the findings of the review, we discuss the limitations of the studies analyzed and propose recommendations for future research in this field.
Full-text available
The authors propose that conceptual and procedural knowledge develop in an iterative fashion and that improved problem representation is 1 mechanism underlying the relations between them. Two experiments were conducted with 5th- and 6th-grade students learning about decimal fractions. In Experiment 1, children's initial conceptual knowledge predicted gains in procedural knowledge, and gains in procedural knowledge predicted improvements in conceptual knowledge. Correct problem representations mediated the relation between initial conceptual knowledge and improved procedural knowledge. In Experiment 2, amount of support for correct problem representation was experimentally manipulated, and the manipulations led to gains in procedural knowledge. Thus, conceptual and procedural knowledge develop iteratively, and improved problem representation is 1 mechanism in this process. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Full-text available
The authors propose that conceptual and procedural knowledge develop in an iterative fashion and that improved problem representation is 1 mechanism underlying the relations between them. Two experiments were conducted with 5th- and 6th-grade students learning about decimal fractions. In Experiment 1, children's initial conceptual knowledge predicted gains in procedural knowledge, and gains in procedural knowledge predicted improvements in conceptual knowledge. Correct problem representations mediated the relation between initial conceptual knowledge and improved procedural knowledge. In Experiment 2, amount of support for correct problem representation was experimentally manipulated, and the manipulations led to gains in procedural knowledge. Thus, conceptual and procedural knowledge develop iteratively, and improved problem representation is 1 mechanism in this process. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Conference Paper
Full-text available
There has been increasing interest in using games for education, but little investigation of how to model student learning within games (cf. 6). We investigate how existing techniques for modeling the acquisition of fluent skill can be adapted to the context of an educational action game, Zombie Division. We discuss why this adaptation is necessarily different for educational action games than for other types of games, such as turn-based games. We demonstrate that gain in accuracy over time is straightforward to model using exponential learning curves, but that models of gain in speed over time must also take gameplay learning into account.
Full-text available
The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, Control/Treatment tests, MultiVariable Tests (MVT) and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments.
Conference Paper
Body Background / Context: In 2008, the National Mathematics Advisory Panel stated: " the most important foundational skill not presently developed appears to be proficiency with fractions. " In response to numerous studies describing the challenges faced by American students in fractions learning, in 2010 the Institute for Education Sciences (IES) released a practice guide for " Developing Effective Fractions Instruction for Kindergarten through 8th Grade. " This practice guide strongly advocates the use of number lines for improving students' understanding of fractions (Siegler et al., 2010). Teachers in America tend to use part-whole representations of fractions (e.g., pizza slices) as opposed to number lines, which are a more common instructional tool in Asian countries (Ma, 1999; Moseley, Okamoto, & Ishida, 2007; Watanabe, 2006). Number lines are also used as assessment tools for measuring number sense. Notably, recent work by Siegler, Thompson, and Schneider (in press) shows that the accuracy of number line estimation with fractions correlates with standardized test scores in 6-8 th grade. This finding extends prior research on number line estimation with decimals (Schneider, Grabner, & Paetsch, 2009) and whole numbers (Booth & Siegler, 2008), which found that accuracy predicted standardized test scores in grades K-5.
A Data Repository for the EDM commuity: The PSLC DataShop. Handbook of Educational Data Mining
  • K R Koedinger
  • R S J D Baker
  • K Cunningham
  • A Skogsholm
  • B Leber
  • J Stamper
Koedinger, K.R., Baker, R.S.J.d., Cunningham, K., Skogsholm, A., Leber, B., Stamper, J. (2011) A Data Repository for the EDM commuity: The PSLC DataShop. Handbook of Educational Data Mining. CRC Press
Bill Mooney Outlines Zynga's Methodology For Success, Gamasutra
  • B Sheffield
  • Canada
Sheffield, B. GDC Canada: Bill Mooney Outlines Zynga's Methodology For Success, Gamasutra, May 6, 2010. Retrieved 2/10/12: