PreprintPDF Available

Not whether, but where: Scaling-up how we think about effects and relationships in natural educational contexts

Preprints and early-stage research may not have been peer reviewed yet.


This paper presents a brief discussion of "effects" and "relationships" in authentic educational contexts, and endeavors to scale-up our thinking about the meaning of these constructs. To discover the mere presence of a reliable main effect relating two variables in natural educational practice is often a feeble pursuit, for any effect might be observable in variable contexts with a sufficiently narrow analysis plan or with a sufficiently large sample size. In turn, this paper argues that researchers should place less emphasis on the mere discovery of relationships, and more emphasis on the analysis of the generalizability of these relationships, the ways that the relationships under investigation may interact with educationally-relevant covariates, and the identification of authentic edge cases where an expected relationship may disappear or reverse.
Companion Proceedings 9th International Conference on Learning Analytic s & Knowledge (LAK19)
Creative Comm ons License, Attribution - NonCommerci al-NoDerivs 3.0 Unported (CC B Y-NC-ND 3.0)
Not whether, but where: Scaling-up how we think about effects and
relationships in natural educational contexts
Benjamin A. Motz
Indiana University
Paulo F. Carvalho
Carnegie Mellon University
ABSTRACT: This paper presents a brief discussion of “effects and “relationships” in
authentic educational contexts, and endeavors to scale-up our thinking about the meaning of
these constructs. To discover the mere presence of a reliable main effect relating two
variables in natural educational practice is often a feeble pursuit, for any effect might be
observable in variable contexts with a sufficiently narrow analysis plan or with a sufficiently
large sample size. In turn, this paper argues that researchers should place less emphasis on
the mere discovery of relationships, and more emphasis on the analysis of the
generalizability of these relationships, the ways that the relationships under investigation
may interact with educationally-relevant covariates, and the identification of authentic edge
cases where an expected relationship may disappear or reverse.
Keywords: generalizability, interaction effects, meta-analysis
And so these men of Indostan
Disputed loud and long,
Each in his own opinion
Exceeding stiff and strong,
Though each was partly in the right,
And all were in the wrong!
- From The Blind Men and the Elephant (Saxe, 1873; p. 260)
Under just the right conditions in natural educational settings, it is possible that any variable could
be associated with significant changes, in either direction, for students’ learning outcomes. For
example, research into the duration of inactivity in a course site (Conjin et al., 2017), the access of
assignments after the deadline (Motz et al., 2019), the order of exemplars during study (Carvalho &
Goldstone, 2017), and the immersiveness of instructional examples (Day, Motz, & Goldstone, 2015),
have all found opposing benefits in different contexts. Whether a researcher observes positive
evidence of such an effect, fails to observe a significant effect, or observes the opposite effect, may
be principally determined by the scope of the researcher’s analysis, and not by whether the effect
“exists.” Like the ancient parable of blind men developing opposing theories of a single elephant
(e.g., Saxe, 1873), analytical research on student learning risks a similarly-absurd dispute about the
observation of effects (or lacks thereof) in isolated studies, and what these opposing observations
might mean.
Companion Proceedings 9th International Conference on Learning Analytic s & Knowledge (LAK19)
Creative Comm ons License, Attribution - NonCommerci al-NoDerivs 3.0 Unported (CC B Y-NC-ND 3.0)
The goal of this essay is to recommend a shift in thinking about “effects” and “relationships” as
observed in authentic educational contexts, moving past thinking of these in binary terms (they are
or aren’t observed; or they do or don’t replicate), to thinking of these as existing in varying degrees
in different contexts. There is no single relationship between educationally-relevant variables that
would hold constant across all learners and learning environments. The question for those analyzing
data from authentic educational environments should not be whether such relationships exist, but
instead, where they exist, to what degree. Furthermore, educational research, including learning
analytics, must exist in the context of strong theories and models of learner’s cognition that can
predict and explain why these dependencies exist, toward proposals of interventions that can
leverage dependencies, instead of being hampered by them.
Authentic classrooms are not randomly sampled from the space of all possible educational
dimensions. Curriculum and course structures are engineered by teachers, administrators, faculty
committees, software designers, and textbook publishers to produce positive gains for enrolled
students. Rather than being random points in the multidimensional landscape of educational
contexts, classrooms are architected learning factories; courses are designed in just such a way so
that learning activities, the behaviors of the instructor, the supporting materials, and the
surrounding environment all shuttle the enrolled students in the direction of positive learning
outcomes. For example, teachers who assign weekly graded practice quizzes are crafting
fundamentally different systems than teachers who assign ungraded weekly practice quizzes. The
differences between these classes are not limited only to this single dimension of whether the
weekly quizzes are credited or not. Both could be reasonably beneficial design solutions in different
contexts. Just as the same musical note can elicit different emotions in different chords, any
educationally-relevant variable could be inconsequential, or could be engineered to benefit learning,
in different classrooms.
When one accepts that classes are not randomly drawn instances from some grand educational
roulette wheel, two corollaries follow: (1) Any naturally-occurring variable may be architected in an
educational context so as to produce a larger effect on learning outcomes, β, than the same
variable’s effect in a different context. And thus (2) the measurement of effect β in an authentic
classroom is an interaction between the variable under analysis and the class’s other covariates, not
a main effect that should be expected to generalize across contexts.
Let’s consider an example. Imagine that an intrepid team of researchers aims to examine the
relationship between some variable, perhaps class attendance, and learning outcomes. They
aggregate attendance records and final exam scores for a large course whose data were convenient
to access. If the observed effect of attendance on exam performance is 0%, 0.1%, -1% or 10%, what
might they claim in these scenarios? Surely these are not generalizable estimates of the effect that
attendance could have on learning performance in other classes (what if students had no access to
learning materials outside class? or what if the class activities only involved review of take-home
readings?) as was compellingly demonstrated by Gašević et al. (2016). That any particular main
effect is observed for any limited sample is rather unremarkable, because the estimate of that effect
is determined largely by the context in which it is measured. Indeed, in the context of course design,
Companion Proceedings 9th International Conference on Learning Analytic s & Knowledge (LAK19)
Creative Comm ons License, Attribution - NonCommerci al-NoDerivs 3.0 Unported (CC B Y-NC-ND 3.0)
if a teacher (not the research team) finds that attendance is not related to learning outcomes in the
intended way, the teacher might change the relative value of attendance marks, increase active and
collaborative problem solving in the classroom, or design other contextual modifications rather than
simply conclude that attendance doesn’t “work.” The intrepid research team should also avoid the
latter conclusion, which would be a severe out-of-sample overgeneralization.
The concerns discussed thus far are sometimes cast as criticisms against the broader research
enterprise of mining and analyzing authentic learning data (for a discussion, see Morrison & van der
Werf, 2016). But just as the intrepid research team should avoid making overgeneralizations about
effects from limited samples, so too should theorists avoid making overgeneralizations about a
complex domain of applied research from its youthful foibles. On the one hand, analyses of a
relationship in a limited sample could be a very fruitful activity when a teacher seeks to engage in
more data-driven design solutions within that precise context (Halverson et al., 2007), or when a
limited sample is highly representative of a conventional instructional system that is theoretically
interesting or practically relevant, perhaps because of its applicability to specific goals of education
(e.g., 9th grade Algebra 1 or Introductory Chemistry recitations as gateways to STEM disciplines).
But on the other hand, the broader activities of learning analytics, educational data mining, and
other forms of education research utilizing big data could probably benefit from a reconsideration of
how effects are analyzed and interpreted (see also Koedinger, Booth, & Klahr, 2013). Such
reconsiderations may involve estimating effects separately for different kinds of courses (Motz et al.,
2018c, 2019), developing new context-dependent theories of learning (Carvalho, 2018), and
expanding the scope of experimental analyses to include a wide pool of independent samples (Motz
et al., 2018b).
In the remaining sections, we attempt to motivate these reconsiderations by expanding on the
possibility that any effect might be observed in some classroom, that thus, what may appear as main
effects are more likely to be interaction effects, and then we discuss analytical tools that may
scaffold a more robust and scalable perspective on effects and relationships in natural educational
When approaching a big dataset of natural behaviors, such as those increasingly available from e-
learning environments, things will get messy. It might be tempting to view a theoretically-
interesting effect or relationship as a needle in a haystack, but a more apt perspective might view
the effect as a needle in a big stack of needles (which may also include some hay). There are no
shortages of possible effects to be “discovered” during the analysis of a natural dataset, leading us
to assert that in such a dataset, any effect might be observed (or might not be observed) in some
Consider the recent work of Silberzahn & Uhlmann (2015; et al. 2017), who recruited 29 different
research teams to answer a single research question from a single dataset: Are soccer referees more
likely to give penalty cards to dark-skin-toned players than light-skin-toned players? The dataset
contained the full history of player/referee interactions for over 2,000 professional soccer players in
four European countries, as well as the players’ demographics, photos, classification of skin tone
Companion Proceedings 9th International Conference on Learning Analytic s & Knowledge (LAK19)
Creative Comm ons License, Attribution - NonCommerci al-NoDerivs 3.0 Unported (CC B Y-NC-ND 3.0)
(determined by independent raters), and a variety of additional covariates (team, position, etc.).
The research analysts submitted their analytical plans (but withheld their provisional results) to a
round-robin peer review and subsequently had the opportunity to revise their analyses.
Nevertheless, despite this opportunity to converge on analytical approaches, final results varied
widely among the participating researchers: effects ranged from 0.89 to 2.93 in odds ratio units (1.0
indicates no effect), with roughly two thirds of teams observing a significant effect, and one third
finding no significant effect. The differences in outcomes resulted primarily from whether the
analysis was sensitive to covariates and grouping variables present in the data.
While differences in analytical approaches will surely contribute to variability in measured effects,
another factor influencing whether relationship are “discovered” is the size of the dataset. With
increasing class sizes, and correspondingly increasing sample sizes, effects are more likely to fall
beneath decision thresholds for statistical significance (commonly, the alpha-level), including
spurious results and trivially small effects. For example, when analyzing the characteristics of digital
camera auctions on eBay, Lin, Lucas, and Shmueli (2013) found that the magnitude of p-values in
their analysis became meaninglessly close to zero when n > 700 (in a dataset containing over
300,000 observations). With a large enough sample, any scant difference is enough to claim
statistical significance. In the case where an analyst might contrast two groups, A and B, Tukey
(1991) observed, “The effects of A and B are always different - in some decimal place - for any A and
B.” (p. 100) In this frame, whether someone detects an effect or relationship is really a question of
sample size and these days, behavioral researchers have access to some very large datasets. The
observation of an effect is a fundamentally different issue from the relevance of an effect, leading
many behavioral scientists toward new statistical standards concerned with effect size rather than
effect presence (Serlin & Lapsley, 1985; Cumming, 2014).
The possibility of observing an “effect” is not only inflated by large samples and analytical variability;
evidence for a spurious effect may also sprout in the soil of atheoretic exploratory analysis
(Anderson et al., 2001). The paucity of theory in some applications of learning analytics and
educational data mining yields fertile grounds for the discovery of effects and relationships that
might be statistically-significant, but have no value for educational practice or for our understanding
of educational systems (Wise & Shaffer, 2015). In the future, researchers will find ever-increasing
opportunities to “discover” something practically meaningless as institutions continue to develop
sprawling data warehouses to support as-yet-undefined future initiatives around learning analytics.
For an analyst who wonders whether an effect can be observed, the answer is surely “Yes.” In the
absence of theory, in the presence of large datasets, and without clear methodological standards
guiding our analytical plans, we should expect to find anything we want to find from natural
educational data. In turn, researchers can benefit from a reconsideration of what is meant by the
word “finding” in authentic learning contexts.
Toward the goal of reconsidering what is meant by a “finding,” one useful tack might be to
reimagine all main effects in our analyses as being interaction effects within educational systems.
For the most part, educational research has embraced the existence of individual differences in
Companion Proceedings 9th International Conference on Learning Analytic s & Knowledge (LAK19)
Creative Comm ons License, Attribution - NonCommerci al-NoDerivs 3.0 Unported (CC B Y-NC-ND 3.0)
education. It is not controversial that different students will approach learning in a different way,
and benefit differently from interventions. For example, Steyvers & Benjamin (2018) demonstrated
that improvements in online brain training games interacts with the learner’s age, and Kalyuga et al.
(2003) demonstrated that low-knowledge students benefit more from studying worked examples
than high-knowledge students.
However, this embrace of dependencies has not expanded to include the effects that different
contexts (i.e., what is learned, how it is learned) have on the effectiveness of the same learning
approach (Jonassen, 1982). Carvalho (2018) proposed that if we use learning theory to guide
exploration of content-treatment interactions, we can not only gain a deeper understanding of the
learning process, but also how it can be improved in a general and scalable way. Take, for example,
the interleaving effect (see Dunlosky et al., 2013). By using an interaction design approach, Carvalho
& Goldstone (2013) were able to demonstrate that the interleaved effect did not generalize to all
learning materials. Moreover, and perhaps more importantly, their analyses propose a model of
learning over time that can account for content-dependencies and suggests that learning does not
always happen by discrimination, leading to clear predictions of when interleaved study will and will
not improve learning (Carvalho & Goldstone, 2014; 2017).
Theories that embrace content-dependencies have great potential for learning analytics. If, when
approaching a research question, one questions not if A “works,” but instead if A differs in context X
vs Y, one can learn not only that A works, but also why it works. This is because interactions help us
understand the mechanism by which A works if A works in X but not in Y, what is about X that
makes it work? However, it is important to note that interactions (albeit statistically less likely to be
found than main effects, especially with large samples) are not always relevant. Interaction designs
should be used with theory-building in mind, and not to dismiss theory by saying “it all depends,”
which would be reductio ad absurdum. While every educational effect may depend on a contextual
variable, many dependencies are generalizable and are relevant to practice and theory, which is why
we advocate for a science that systematically examines where these effects exist.
That any effect might exist in some context, and that these effects are context-dependent may also
be viewed as precipitants of Rossi’s Iron Law of Evaluation (1987): “The expected value of any net
impact assessment of any large-scale social program is zero” (p.4). If an educational intervention’s
relationship with learning outcomes is variable across different classes, at large scales the aggregate
(net) benefit of an intervention will tend toward zero. Just as analysts ought to think critically about
the discovery of an effect, so too should analysts be skeptical when measuring the absence of a
reliable main effect at scale. Favorable conditions for an effect are unlikely to be universally-present
across large samples of classrooms, and identifying the conditions for an effect’s observation is an
important pursuit if we are to make precise predictions about what “works.”
Discussion thus far has been occupied with the discovery of effects during the observation of natural
datasets, but another research method bears mentioning: the experimental manipulation of a
variable to produce an effect. In laboratory studies, where the setting is artificial and the
environmental regime is tightly-controlled according to experimental standards, there may be less
Companion Proceedings 9th International Conference on Learning Analytic s & Knowledge (LAK19)
Creative Comm ons License, Attribution - NonCommerci al-NoDerivs 3.0 Unported (CC B Y-NC-ND 3.0)
risk of variability in outcomes; indeed, laboratory studies are designed precisely so that the observed
effects will replicate if all procedures are repeated with a new sample. But when conducting an
embedded experiment in an authentic educational context (Motz et al., 2018a), the generalizability
of an observed effect is much less certain.
In fairness, it should be noted that effects produced in embedded experiments have important
advantages over effects found during the passive observation of natural datasets (Gordon et al.,
2018). In particular, in an experiment, the context is held constant across experimental treatments,
manipulating only those variables under analysis. However, the observed effect in that controlled
context still may not be expected to generalize to different classes, because the size of any one
measured effect is (as previously discussed) something that interacts with the structure of the class
under observation.
Anecdotally, one of us recently discussed the design of an embedded experiment with another
researcher, who was considering implementing the experiment in two of his sections during an
upcoming semester. The researcher wanted to find a robust effect of his manipulation, so he was
examining how he might structure the sections to facilitate this outcome. These considerations
included: modifying the syllabus to highlight the experimental variable, emphasizing the variable
with a take-home assignment, dedicating class time to a brief discussion of the variable, increasing
the weight of grades more closely associated with the variable… At a certain point, we might
wonder whether the observation of this effect would require an experiment in the first place! If a
class can be architected to facilitate the observation of an effect, why should a researcher bother
with the great effort and difficulty of demonstrating that effect?
For an effect observed in one class to be useful and generalizable, that class must be highly
representative of a conventional instructional system that is theoretically interesting or practically
relevant. Toward this goal, researchers should include documentation of the instructional context
wherein an effect is observed. For example, in postsecondary learning environments, at minimum,
authors should provide copies of class syllabi to accompany published reports from their embedded
experiments, and moreover, they should highlight any course modifications made in support of the
experimental contrast. But in keeping with the theme of this essay, rather than examining whether
an effect is observed in a specific context, it might be more interesting to cast a wider net, examining
where an experimental manipulation has different effects. But what might this “net” look like?
A scalable research model for evaluating experimental effects across a variety of authentic learning
contexts is currently under development, called ManyClasses (Motz et al., 2018b). As with similar
efforts in psychology (Many Labs, Klein et al., 2014; Many Babies, Frank et al., 2017), the core
feature of ManyClasses is that researchers measure an experimental effect across many
independent samples in this case, across many classes. Rather than conducting an embedded
learning study in just one educational context, a ManyClasses study would examine the same
experimental contrast in dozens of contexts, spanning a range of courses, institutions, formats, and
student populations. By inserting the same experimental manipulation across a diversity of
educational implementations, and then analyzing pooled results, researchers can assess the degree
to which an effect might yield benefits across a range of specific contexts. In addition to
contributing to an estimation of the generalizable effect size of manipulations beyond particular
Companion Proceedings 9th International Conference on Learning Analytic s & Knowledge (LAK19)
Creative Comm ons License, Attribution - NonCommerci al-NoDerivs 3.0 Unported (CC B Y-NC-ND 3.0)
classroom implementations, a ManyClasses study will also systematically investigate how a
manipulation might be more or less effective for different students in different situations.
This ManyClasses model shares common ground with a nascent analytical strategy called a
metastudy, also used for analyzing the robustness of an empirical claim across contexts (Baribault et
al., 2018). A metastudy involves the radical randomization of experimental design decisions; rather
than fixing the study context across conditions (which might include the number of trials, properties
of the stimuli, incentives for participating, etc.), these facets are randomly drawn for each
observation. In turn, data obtained from a metastudy goes beyond addressing whether an effect
exists, to directly estimating the contextual dependencies of the observed effect. By embracing the
view that effects will vary across contexts, and directly manipulating and quantifying this variability,
researchers can develop a much more complete understanding of the causal chains under analysis.
So, oft in theologic wars
The disputants, I ween,
Rail on in utter ignorance
Of what each other mean;
And prate about an Elephant
Not one of them has seen!
- Final stanza from The Blind Men and the Elephant (Saxe, 1873)
Instructional technologists are oft to advertise new teaching and learning tools with the confident
certification, “it works!” Data scientists implementing a new technique for predicting academic risk
will claim, “our model works!” Psychologists examining students’ studying behaviors in a real class
will conclude, “the strategy works!” In response, some skeptical and empirically-minded members
of the education research community may scoff, “How do you know?” or “What is your evidence?”
But all of these stances seem like non sequiturs, for any such instrument, activity, modeling
approach, or strategy might “work” or might fail to “work” in different natural learning contexts. In
scaling-up our perspective of these effects, perhaps learning analytics can avoid the dilemma of the
blind men and the elephant, by accepting that different observations will necessarily yield different
effects and relationships, and that these context-dependencies are theoretically-attractive objects of
inquiry. In this paper, we hope to have motivated the view that where an effect exists in a real
classroom, and to what degree, are much more meaningful concerns than whether that effect exists.
Anderson, D.R., Burnham, K.P., Gould, W.R., & Cherry, S. (2001). Concerns about finding effects that
are actually spurious. Wildlife Society Bulletin, 29(1), 311-316.
Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., van Ravenzwaaij, D., ... &
Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the
National Academy of Sciences, 115(11), 2607-2612.
Companion Proceedings 9th International Conference on Learning Analytic s & Knowledge (LAK19)
Creative Comm ons License, Attribution - NonCommerci al-NoDerivs 3.0 Unported (CC B Y-NC-ND 3.0)
Carvalho, P.F. (2018). Understanding the dynamics of learning: The case for studying interactions. In
T.T. Rogers, M. Rau, X. Zhu, & C. W. Kalish (Eds.), Proceedings of the 40th Annual Conference
of the Cognitive Science Society (pp. 51-52). Cognitive Science Society.
Carvalho, P.F., & Goldstone, R.L. (2013). How to present exemplars of several categories? Interleave
during active learning and block during passive learning. In M. Knauff, M. Pauen, N. Sebanz,
& I. Wachsmuth (Eds.), Proceedings of the 35th Annual Conference of the Cognitive Science
Society. Cognitive Science Society.
Carvalho, P.F. & Goldstone, R.L. (2014). Effects of interleaved and blocked study on delayed test of
category learning generalization. Frontiers in Psychology, 5(936), 1-10.
Carvalho, P.F. & Goldstone, R.L. (2017). The most efficient sequence of study depends on the type of
test. In G. Gunzelmann, A. Howes,, T. Tenbrink, & E. Davelaar (Eds.), Proceedings of the 39th
Annual Conference of the Cognitive Science Society (pp 198-203). Cognitive Science Society.
Conijn, R., Snijders, C., Kleingeld, A., & Matzat, U. (2017). Predicting student performance from LMS
data: A comparison of 17 blended courses using Moodle. IEEE Transactions on Learning
Technologies, 10(1), 17-29.
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7-29.
Day, S. B., Motz, B. A., & Goldstone, R. L. (2015). The cognitive costs of context: The effects of
concreteness and immersiveness in instructional examples. Frontiers in Psychology, 6, 1876.
Dunlosky, J., Rawson, K.A., Marsh, E.J., Nathan, M.J., & Willingham, D.T. (2013). Improving students’
learning with effective learning techniques. Psychological Science in the Public Interest,
14(1), 458.
Frank, M., Bergelson, E., Bergmann, C., Cristia, A., Floccia, C., Gervain, J., Lew-Williams, C., Nazzi, T.,
Panneton, R., Rabagliati, H., Soderstrom, M., Sullivan, J., Waxman, S., & Yurovsky, D. (2017).
A collaborative approach to infant research: Promoting reproducibility, best practices, and
theory building. Infancy, 22, 421-435.
Gašević, D., Dawson, S., Rogers, T., & Gasevic, D. (2016). Learning analytics should not promote one
size fits all: The effects of instructional conditions in predicting academic success. The
Internet and Higher Education, 28, 68-84.
Gordon, B. R., Zettelmeyer, F., Bhargava, N., & Chapsky, D. (2018). A comparison of approaches to
advertising measurement: Evidence from big field experiments at Facebook. Forthcoming at
Marketing Science. Available at SSRN: or
Halverson, R., Grigg, J., Prichett, R., & Thomas, C. (2007). The new instructional leadership: Creating
data-driven instructional systems in school. Journal of School Leadership, 17(2), 159.
Jonassen, D. H. (1982). Aptitude-versus content-treatment interactions. Journal of Instructional
Development, 5(4), 15.
Kalyuga, S., Ayres, P., Chandler, P., & Sweller, J. (2003). The expertise reversal effect. Educational
Psychologist, 38, 23-31.
Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Jr., Bahník, Š., Bernstein, M. J., Nosek, B. A.
(2014). Investigating variation in replicability: A “many labs” replication project. Social
Psychology, 45(3), 142-152.
Companion Proceedings 9th International Conference on Learning Analytic s & Knowledge (LAK19)
Creative Comm ons License, Attribution - NonCommerci al-NoDerivs 3.0 Unported (CC B Y-NC-ND 3.0)
Koedinger, K. R., Booth, J. L., & Klahr, D. (2013). Instructional complexity and the science to constrain
it. Science, 342(6161), 935-937.
Lin, M., Lucas Jr., H.C., Shmueli, G. (2013). Too big to fail: Large samples and the p-value problem.
Information Systems Research, 24(4), 906-917.
Morrison, K., & van der Werf, G. (2016). Large-scale data, “wicked problems,” and “what works” for
educational policy making. Educational Research and Evaluation, 22(5/6), 255259.
Motz, B. A., Carvalho, P. F., de Leeuw, J. R., & Goldstone, R. L. (2018a). Embedding experiments:
Staking causal inference in authentic educational contexts. Journal of Learning Analytics,
5(2), 47-59.
Motz, B., de Leeuw, J., Carvalho, P., Fyfe, E., & Goldstone, R., (2018b). ManyClasses: A model for
abstracting generalizable research principles from different learning contexts. A Workshop on Large Scale Education Replication. Buffalo, New York.
Motz, B., Busey, T., Rickert, M., Landy, D. (2018c). Finding topics in enrollment data. Proceedings of
the 11th International Conference on Educational Data Mining. Buffalo, New York.
Motz, B., Quick, J., Schroeder, N., Zook, J., Gunkel, M. (2019). The validity and utility of activity logs
as a measure of student engagement. In Proceedings of the 9th International Conference on
Learning Analytics and Knowledge. ACM.
Rossi, P. (1987). The iron law of evaluation and other metallic rules. Research in Social Problems and
Public Policy, 4, 3-20.
Saxe, J.G. (1873). The poems of John Godfrey Saxe. Boston: James R Osgood & Company.
Silberzahn, R. & Uhlmann, E.L. (2015). Crowdsourced research: Many hands make tight work.
Nature, 526(7572), 189-191.
Silberzahn, R., Uhlmann, E.L., Martin, D.P., Anselmi, P., Aust, F., Awtrey, E.C., … Nosek, B.A. (2018).
Many analysts, one dataset: Making transparent how variations in analytical choices affect
results (preprint). PsyArXiv.
Serlin, R.C., & Lapsley, D.K. (1985). Rationality in psychological research: The good-enough principle.
American Psychologist, 40(1), 73-83.
Steyvers, M. & Benjamin, A.S. (2018). The joint contribution of participation and performance to
learning functions: Exploring the effects of age in large-scale data sets. Behavior Research
Tukey, J. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100116.
Wise, A. F., & Shaffer, D. W. (2015). Why theory matters more than ever in the age of big data.
Journal of Learning Analytics, 2(2), 5-13.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Twenty-nine teams involving 61 analysts used the same data set to address the same research question: whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players. Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units. Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability. These findings suggest that significant variation in the results of analyses of complex data may be difficult to avoid, even by experts with honest intentions. Crowdsourcing data analysis, a strategy in which numerous research teams are recruited to simultaneously investigate the same research question, makes transparent how defensible, yet subjective, analytic choices influence research results.
Conference Paper
Full-text available
Analyses of student data in post-secondary education should be sensitive to the fact that there are many different topics of study. These different areas will interest different kinds of students, and entail different experiences and learning activities. However, it can be challenging to identify the distinct academic themes that students might pursue in higher education, where students commonly have the freedom to sample from thousands of courses in dozens of degree programs. In this paper, we describe the use of topic modeling to identify distinct themes of study and classify students according their observed course enrollments, and present possible applications of this technique for the broader field of educational data mining.
Full-text available
To identify the ways teachers and educational systems can improve learning, researchers need to make causal inferences. Analyses of existing datasets play an important role in detecting causal patterns, but conducting experiments also plays an indispensable role in this research. In this article, we advocate for experiments to be embedded in real educational contexts, allowing researchers to test whether interventions such as a learning activity, new technology, or advising strategy elicit reliable improvements in authentic student behaviours and educational outcomes. Embedded experiments, wherein theoretically relevant variables are systematically manipulated in real learning contexts, carry strong benefits for making causal inferences, particularly when allied with the data-rich resources of contemporary e-learning environments. Toward this goal, we offer a field guide to embedded experimentation, reviewing experimental design choices, addressing ethical concerns, discussing the importance of involving teachers, and reviewing how interventions can be deployed in a variety of contexts, at a range of scales. Causal inference is a critical component of a field that aims to improve student learning; including experimentation alongside analyses of existing data in learning analytics is the most compelling way to test causal claims.
Full-text available
We describe and demonstrate an empirical strategy useful for discovering and replicating empirical effects in psychological science. The method involves the design of a metastudy, in which many independent experimental variables—that may be moderators of an empirical effect—are indiscriminately randomized. Radical randomization yields rich datasets that can be used to test the robustness of an empirical claim to some of the vagaries and idiosyncrasies of experimental protocols and enhances the generalizability of these claims. The strategy is made feasible by advances in hierarchical Bayesian modeling that allow for the pooling of information across unlike experiments and designs and is proposed here as a gold standard for replication research and exploratory research. The practical feasibility of the strategy is demonstrated with a replication of a study on subliminal priming.
Conference Paper
Learning management system (LMS) web logs provide granular, near-real-time records of student behavior as learners interact with online course materials in digital learning environments. However, it remains unclear whether LMS activity indeed reflects behavioral properties of student engagement, and it also remains unclear how to deal with variability in LMS usage across a diversity of courses. In this study, we evaluate whether instructors' subjective ratings of their students' engagement are related to features of LMS activity for 9,021 students enrolled in 473 for-credit courses. We find that estimators derived from LMS web logs are closely related to instructor ratings of engagement, however, we also observe that there is not a single generic relationship between activity and engagement, and what constitutes the behavioral components of "engagement" will be contingent on course structure. However, for many of these courses, modeled engagement scores are comparable to instructors' ratings in their sensitivity for predicting academic performance. As long as they are tuned to the differences between courses, activity indices from LMS web logs can provide a valid and useful proxy measure of student engagement.
Large-scale data sets from online training and game platforms offer the opportunity for more extensive and more precise investigations of human learning than is typically achievable in the laboratory. However, because people make their own choices about participation, any investigation into learning using these data sets must simultaneously model performance–that is, the learning function–and participation. Using a data set of 54 million gameplays from the online brain training site Lumosity, we show that learning functions of participants are systematically biased by participation policies that vary with age. Older adults who are poorer performers are more likely to drop out than older adults who perform well. Younger adults show no such effect. Using this knowledge, we can extrapolate group learning functions that correct for these age-related differences in dropout.
The ideal of scientific progress is that we accumulate measurements and integrate these into theory, but recent discussion of replicability issues has cast doubt on whether psychological research conforms to this model. Developmental research—especially with infant participants—also has discipline-specific replicability challenges, including small samples and limited measurement methods. Inspired by collaborative replication efforts in cognitive and social psychology, we describe a proposal for assessing and promoting replicability in infancy research: large-scale, multi-laboratory replication efforts aiming for a more precise understanding of key developmental phenomena. The ManyBabies project, our instantiation of this proposal, will not only help us estimate how robust and replicable these phenomena are, but also gain new theoretical insights into how they vary across ages, linguistic communities, and measurement methods. This project has the potential for a variety of positive outcomes, including less-biased estimates of theoretically important effects, estimates of variability that can be used for later study planning, and a series of best-practices blueprints for future infancy research.