Analyzing FineGrained Skill Models Using Bayesian and Mixed Effects Methods.
ABSTRACT Two modelling methods were employed to answer the same research question of how accurate the various grained WPI 1, 5, 39 and 106 skill models are at assessing student knowledge in the ASSISTment online tutoring system and predicting their performance on the 2005 state MCAS test. One method, used by the second author, is mixed effect statistical modelling. The first author evaluated the problem with a Bayesian networks machine learning approach. We compare the two results to identify benefits and drawbacks of either method and to find out if the two results agree. We report that both methods showed compelling similarity in results especially with regard to residuals on the test. Our analysis of these residuals and our online skills allows us to better understand our model and conclude with recommendations for improving the tutoring system, as well as implications for state testing programs.

Article: Towards Assessing Students' Fine Grained Knowledge: Using an Intelligent Tutor for Assessment
 SourceAvailable from: Michel Desmarais
Conference Paper: Adaptive Test Design with a Naive Bayes Framework.
Educational Data Mining 2008, The 1st International Conference on Educational Data Mining, Montreal, Québec, Canada, June 2021, 2008. Proceedings; 01/2008  SourceAvailable from: iaied.org[Show abstract] [Hide abstract]
ABSTRACT: Composite concepts result from the integration of multiple basic concepts by students to form high level knowledge, so information about how students learn composite concepts can be used by instructors to facilitate students' learning, and the ways in which computational techniques can assist the study of the integration process are therefore intriguing for learning, cognition, and computer scientists. We provide an exploration of this problem using heuristic methods, search methods, and machinelearning techniques, while employing Bayesian networks as the language for representing the student models. Given experts' expectation about students and simulated students' responses to test items that were designed for the concepts, we try to find the Bayesiannetwork structure that best represents how students learn the composite concept of interest. The experiments were conducted with only simulated students. The accuracy achieved by the proposed classification methods spread over a wide range, depending on the quality of collected input evidence. We discuss the experimental procedures, compare the experimental results observed in certain experiments, provide two ways to analyse the influences of Qmatrices on the experimental results, and we hope that this simulationbased experience may contribute to the endeavours in mapping the human learning process.I. J. Artificial Intelligence in Education. 01/2008; 18:237285.
Page 1
Educational Data Mining
Cecily Heiner (Cochair)
Neil Heffernan (Cochair)
Tiffany Barnes (Cochair)
Supplementary Proceedings of the 13th International
Conference of Artificial Intelligence in Education.
Marina del Rey, CA. USA. July 2007
Workshop of Educational Data Mining
Page 2
Educational Data Mining Workshop
(http://www.educationaldatamining.org/)
Cecily Heiner; University of Utah; cecily@cs.utah.edu (Cochair)
Neil Heffernan; Worcester Polytechnic Institute; nth@wpi.edu (Cochair)
Tiffany Barnes; University of North Carolina at Charlotte, tbarnes2@uncc.edu (Cochair)
Introduction
Educational data mining is the process of converting raw data from educational systems to useful
information that can be used to inform design decisions and answer research questions. Data mining
encompasses a wide range of research techniques that includes more traditional options such as database
queries and simple automatic logging as well as more recent developments in machine learning and
language technology.
Educational data mining techniques are now being used in ITS and AIED research worldwide. For
example, researchers have used educational data mining to:
o Detect affect and disengagement
o Detect attempts to circumvent learning called "gaming the system"
o Guide student learning efforts
o Develop or refine student models
o Measure the effect of individual interventions
o Improved teaching support
o Predict student performance and behavior
However, these techniques could achieve greater use and bring wider benefits to the ITS and AIED
communities. We need to develop standard data formats, so that researchers can more easily share data
and conduct metaanalysis across tutoring systems, and we need to determine which data mining
techniques are most appropriate for the specific features of educational data, and how these techniques
can be used on a wide scale. The workshop will provide a forum to present preliminary but promising
results that can advance our knowledge of how to appropriately conduct educational data mining and
extend the field in new directions.
Topics
They include, but are not limited to:
o What new discoveries does data mining enable?
o What techniques are especially useful for data mining?
o How can we integrate data mining and existing educational theories?
o How can data mining improve teacher support?
o How can data mining build better student models?
o How can data mining dynamically alter instruction more effectively?
o How can data mining improve systems and interventions evaluations?
o How do these evaluations lead to system and intervention improvements?
o How is data mining fundamentally different from other research methods?
Workshop of Educational Data Mining
Page 3
TABLE OF CONTENTS
Evaluating problem difficulty rankings using sparse student data………….…………1
Ari BaderNatal, Jordan Pollack
Toward the extraction of production rules for solving logic proofs…………………...11
Tiffany Barnes, John Stamper
Difficulties in inferring student knowledge from observations (and why you should
care)………………………………………………………………………………………...21
Joseph E. Beck
What’s in a word? Extending learning factors analysis to modeling reading
transfer………………………………………………………………………..…………….31
James M. Leszczenski, Joseph E. Beck
Predicting student engagement in intelligent tutoring systems using teacher expert
knowledge ……….………………………………………………………………………...40
Nicholas Lloyd, Neil Heffernan, Carolina Ruiz
Analyzing finegrained skill models using Bayesian and mixed effects methods
……………………………………………………………………………………………….50
Zachary Pardos, Mingyu Feng, Neil Heffernan, Cristina Heffernan, Carolina Ruiz
Mining learners’ traces from an online collaboration tool……………………………..60
Dilhan Perera, Judy Kay, Kalina Yacef, Irena Koprinska
Mining online discussions: Assessing technical quality for student scaffolding and
classifying messages for participation profiling………………………………………. .70
Sujith Ravi, Jihie Kim, Erin Shaw
All in the (word) family: Using learning decomposition to estimate transfer between
skills in a reading tutor that listens………………………………………………..……..80
Xiaonan Zhang, Jack Mostow, Joseph Beck
Workshop of Educational Data Mining
Page 4
Evaluating Problem Difficulty Rankings
Using Sparse Student Response Data
Ari BADERNATAL1, Jordan POLLACK
DEMO Lab, Brandeis University
Abstract. Problem difficulty estimates play important roles in a wide variety of
educational systems, including determining the sequence of problems presented to
students and the interpretation of the resulting responses. The accuracy of these
metrics are therefore important, as they can determine the relevance of an educa
tional experience. For systems that record large quantities of raw data, these obser
vations can be used to test the predictive accuracy of an existing difficulty metric.
In this paper, we examine how well one rigorously developed – but potentially out
dated – difficulty scale for AmericanEnglish spelling fits the data collected from
seventeen thousand students using our SpellBEE peertutoring system. We then at
tempt to construct alternate metrics that use collected data to achieve a better fit.
The domainindependent techniques presented here are applicable when the ma
trix of available studentresponse data is sparsely populated or nonrandomly sam
pled. We find that while the original metric fits the data relatively well, the data
driven metrics provide approximately 10% improvement in predictive accuracy.
Using these techniques, a difficulty metric can be periodically or continuously re
calibrated to ensure the relevance of the educational experience for the student.
1. Introduction
Estimates of student proficiency and problem difficulty play central roles in Item Re
sponse Theory (IRT) [11]. Several current educational systems make use of this theory,
including our own BEEweb peertutoring activities [2,8,9,13]. IRTbased analysis often
focuses on estimating student proficiency in the task domain, but the challenge of esti
mating problem difficulty should not be overlooked. While student proficiency estimates
can inform assessment, problem difficulty estimates can be used to refine instruction:
these metrics can affect the selection and ordering of problems posed and can influ
ence the interpretation of the resulting responses [6]. It is therefore important to choose
a good difficulty metric initially and to periodically evaluate the accuracy of a chosen
metric with respect to available student data. In this paper, we examine how accurately
one rigorously developed – but potentially outdated – difficulty scale for the domain of
AmericanEnglish spelling predicts the data collected from students using our SpellBEE
system [1]. The defining challenge in providing this assessment lies in the nature of the
data. As SpellBEE is a peertutoring system, the challenges posed to students are deter
mined by other students, resulting in data that is neither random nor complete. In this
1Correspondence to: Ari BaderNatal, Brandeis University, Computer Science Department – MS 018.
Waltham, MA 02454. USA. Tel.: +1 781 736 3366; Fax: +1 781 736 2741; Email: ari@cs.brandeis.edu.
Workshop of Educational Data Mining
1
Page 5
work, we rely on a pairwise comparison technique designed to be robust to data with
these characteristics. After assessing the relevance of this existing metric (in terms of
predictive accuracy), we will examine some related techniques for initially constructing
a difficulty metric based on nonrandom, incomplete samples of observed student data.
2. AmericanEnglish spelling: A sample task domain
The educational system examined here, SpellBEE, was designed to address the task do
main of AmericanEnglish spelling [1]. SpellBEE is the oldest of a growing suite of
webbased reciprocal tutoring systems using the Teacher’s Dilemma as a motivational
mechanism [2]. For the purposes of this paper, however, the mechanisms for motivation
and interaction can be ignored, and the SpellBEE system and the difficulty metric used
by it can be specifically recharacterized for an educational data mining audience.
2.1. Relevant characteristics of the SpellBEE system
Students access SpellBEE online at SpellBEE.org from their homes or schools. As of
May 2007, over 17,000 students have actively participated. After creating a user account,
a student is able to log in, choose a partner, and begin the activity.2During the activity,
students take turns posing and solving spelling problems. When posing a problem, the
student selects from a short list of words randomly drawn from the database of word
challenges. This database is comprised of 3,129 words drawn from Greene’s New Iowa
Spelling Scale (NISS), which will be discussed in the next section [12].3When respond
ing to a problem, the student types in what they believe to be the correct spelling of the
challenge word. The accuracy of the response is assessed to be either correct or incorrect.
Figure 1 presents a list of the relevant data stored in the SpellBEE server logs.
To date, we have observed over 64,000 unique (caseinsensitive) responses to the
challenges posed,4distributed across over 22,000 completed games consisting of seven
questions attempted per student. Student participation, measured in games completed,
has not been uniform, however. Of the challenges in the space, most students have only
attempted a very small fraction. In fact, when examining the response matrix of every
student by every challenge, less than 1% of the matrix data is known. An important
characteristicoftheSpellBEEdata,then,isthattheresponsematrixisremarkablysparse.
Given that the students acting as tutors are able to – and systemically motivated to –
express their preferences and hunches through the problems that they select, another
important characteristic of the SpellBEE data is that the data present in the student
challenge response matrix is also biased. The effects of this bias can be found in the
following example: 16% of student attempts to spell the word “file” were correct, while
66% of attempts to spell the word “official” were correct. The average grade level among
the first set of students was 3.9, while for the second set it was 6.4. In Section 3.2 we
2In the newer BEEweb activities, if no one else is present, a student can practice alone on problems randomly
drawn from the database of challenges posed in the past.
3In SpellBEE, the wordchallenges are presented in the context of a sentence, and so of the words in Greene’s
list, we only use those found in the seven publicdomain books that we parsed for sentences.
4Of these, 17,391 were observed more than once. In this paper, we restrict the set of responses that we
consider to this subset. See Footnote 7 for the rationale behind this.
Workshop of Educational Data Mining
2
Page 6
Figure 1. The SpellBEE server logs data about each turn taken by each student, as shown in the first list. The
data in the first list is sufficient to generate the data included in the second list.
1. time : a timestamp allows responses to be ordered
2. game : identifies the game in which this turn occurred
3. tutor : identifies the student acting as the tutor in this turn
4. tutee : identifies the student acting as the tutee in this turn
5. challenge : identifies the challenge posed by the tutor
6. response : identifies the response offered by the tutee
1. difficulty : the difficulty rating of the challenge posed by the tutor
2. accuracy : the accuracy rating of the response offered by the tutee
will present techniques designed to draw more meaningful difficulty information from
this type of data.
2.2. Origin, use, and application of the problem difficulty metric
When trying to define a measure of problem difficulty for the wellstudied domain of
AmericanEnglish spelling, we were able to benefit from earlier research in the field.
Greene’s “New Iowa Spelling Scale” provides a rich source of data on word spelling dif
ficulty, drawn from a vast study published in 1954. Greene first developed a methodol
ogy for selecting words for his list (5,507 were eventually used.) Approximately 230,000
students from 8,800 classrooms (grades 2 through 8) around the United States partici
pated in the study, totally over 23 million spelling responses [12]. From these, Greene
calculated the percentage of correct responses for each word for each grade. This table
of success rates is used in SpellBEE to calculate the difficulty of each spelling problem
for students, whose grade level is known.
3. Techniques for assessing relative challenge difficulty
The research questions addressed in this paper focus on the fit of the difficulty model
based on the NISS data to the observed SpellBEE student data. Two different techniques
are involved in the calculating this fit. The first converts the graded NISS data to a linear
scale. The second identifies from the observed student data a difficulty ordering over
pairs of problems, in a manner appropriate for a sparse and biased data matrix. Both will
be employed to address the research questions in the following sections.
3.1. Linearly ordering challenges using the difficulty metric
Many subsequent studies have explored various aspects of Greene’s study and the data
that it produced. Cahen, Craun, and Johnson [5] and, later, Wilson and Bock [14] explore
the degree to which various combinations of domainspecific predictors could account
for Greene’s data. Initially starting with 20 predictors, Wilson and Bock work down
to a regression model with an adjusted R2value of 0.854.5Here, we not interested in
5The most influential of which being the length of the word.
Workshop of Educational Data Mining
3
Page 7
Figure 2. Difficulty data for two words from the NISS study are plotted, and the I50statistics are calculated.
0 %
50 %
100 %
2 3 4 5 6 7 8
Percentage of students with correct responses
Student grade level
I50("above")=3.7I50("acknowledge")=7.875
"above"
"acknowledge"
predicting the NISS results, but instead are interested in assessing the fit (or predictive
power) of the 1954 NISS results to observations made of students using SpellBEE over
50 years later. We will drawn upon one statistic used by Wilson and Bock: the one
dimensional flattening of the sevengraded NISS data. This statistic, which they refer to
as the “location” of the word, is the (fractional) grade level at which 50% of the NISS
students correctly spell word w.6We denote this as I50(w). Figure 2 illustrates how the
graded difficulty data that is used to derive this statistic for two different words. The
value of this statistic is that it provides a single gradeindependent difficulty value for a
word that can be compared directly to that of other words.
3.2. Identifying pairwise difficulty orderings using observed student data
Given the characteristics of the data collected from the SpellBEE system, identifying the
more difficult of a pair of problems based on this data is not trivial. The percentage of
correct responses to a challenge, the calculation used to generate the NISS data, is not
appropriate here, as the assignment of challenges to students was done in a biased, non
random manner (recall the “file”/“official” example from Section 2.1.) Tutors, in fact,
are motivated to base their challenge selection on the response accuracies that they an
ticipate. A more appropriate measure, rooted in several different literatures, is to assess
pairwise problem difficulties on distinctions indirectly indicated by the students. In the
statistics literature, McNemar’s test provides a statistic based on this concept [10], in the
IRT literature, this is used as a data reduction strategy for Rasch model parameter esti
mation [7], and the Machine Learning literature includes various approaches to learning
6Wilson and Bock calculate the 50% threshold based on a logistic model fit to the discrete gradelevel data,
while we calculate the threshold slightly differently, based on a linear interpolation of the gradelevel data.
Workshop of Educational Data Mining
4
Page 8
Table 1. While the I50metric flattens the gradespecific NISS data to a single dimension, the relative difficulty
ordering of most wordpairs based on the graded NISS data is the same as when based on the I50scale. In
this table, we quantify the amount of agreement between I50and each set of gradespecific NISS data using
Spearman’s rank correlation coefficient. The strong correlations observed suggest that the unidimensional scale
sufficiently captures the relative difficulty information from the original NISS dataset. (The number of words,
N, varies by grade, as the NISS study did not show several of the harder words to the younger students.)
Grade
2
3
4
5
6
7
8
NSpearman’s ρ
0.751
0.933
0.977
0.974
0.960
0.935
0.915
2218
3059
3126
3129
3129
3129
3129
rankings based on pairwise preferences [4]. Assume that for some specific pair of prob
lems, such as the spelling of the words “about” and “acknowledge”, we first identify all
students in the SpellBEE database who have attempted both words. Given that response
accuracy is dichotomous, there are only four possible configurations of a student’s re
sponse accuracy to the pair of challenges. In the cases where the student responds to both
correctly or incorrectly, no distinction is made between the pair. But in the cases where
the student correctly responds to one but incorrectly to the other, we classify this as a
distinction indicating a difficulty ordering between the two problems.7
It is also worth stating that in this study, we assume a “static student” model, so we
are not concerned with the order of these two responses. At the cost of some data loss,
one could instead assume a “learning student” model, for which only a correct response
ononeproblemfollowedbyanincorrectresponseontheotherwoulddefineadistinction.
Had the incorrect response been observed first, we could not rule out the possibility that
the difference was due to a change in the student’s abilities over time, and not necessarily
an indication of difference in problem difficulties.8
An example may clarify. If counting the number of both directional distinctions
made by all students (e.g. 12 students in SpellBEE spelled “about” correctly and “ac
knowledge” incorrectly, while 2 students spelled “about” incorrectly and “acknowledge”
correctly), we have a strong indication of relative problem difficulty. McNemar’s test as
signs a significance to this pair of distinction counts. In this work, we more closely fol
low the IRT approach, relying only the relative size of the two counts (and not the signifi
cance.) Thus, since 12 distinctions were found in one direction and only 2 in the other, we
say that we observed the word “about” to be easier than the word “acknowledge” based
on collected SpellBEE student data. If distinctions were available for every problem pair,
7We recognize that some distinctions are spurious, for which the incorrect response was not reflective of
the student’s abilities. Here we take a simplistic approach of identifying and ignoring nonresponses (in which
the student typed nothing) and globallyunique responses (which no other student ever responded, to any chal
lenge.) Globallyunique responses encompass responses from students who don’t yet understand the activity,
responses from students who did not hear the audio recording, responses from student attempting to use the
response field as a chat interface, and responses from students making no effort to engage in the activity.
8Another possible model is a “dynamic student” model, for which student abilities may get better or worse
over time. Under this model, no distinctions can be definitively attributed to difference in problem difficulty.
Workshop of Educational Data Mining
5
Page 9
a total of 3,129 × 3,128 = 9,787,512 pairwise problem orderings could be expressed. In
our collected data so far, we have 3,349,602 of these problem pairs for which we have
distinctions recorded. In the subsequent sections, we measure the fitness of a predictive
model (like I50) based on how many of these pairwise orderings are satisfied.9
4. Assessing the fit of the NISSbased I50model to the SpellBEE student data
GiventheNISSbasedI50difficultymodelofproblemdifficultyandthedatadriventech
nique for turning observed distinctions recorded in the SpellBEE database into pairwise
difficulty orderings, we can now explore various methods to assess the applicability of
the model to the data.
4.1. Assessing fit with a regression model
The first method is to construct a regression model that uses I50to predict observed diffi
culty. Since observed difficulty is currently available only in pairwise form, this requires
an additional step in which we flatten these pairwise orderings into one total ordering
over all problems. As this is a highly nontrivial step, the results should be interpreted
tentatively. Here, we accomplish a flattening by calculating, for each challenge, the per
centage of available pairwise orderings for which the given challenge was the more diffi
cult of the pair. So if 100 pairwise orderings involve the challenge word “acknowledge”,
and 72 of these found “acknowledge” to be the harder of the pair, we would mark “ac
knowledge” as harder than 72% of other words. A regression model was then built on
this,usingI50asapredictorofthepairwisederivedpercentage.Themodel,afterfiltering
out data points causing ceiling and floor effects (i.e. I50(w) = 2.0 or I50(w) = 8.0), had
an adjusted R2value of 0.337 (p < 0.001 for the model). The corresponding scatterplot
is shown in Figure 3.10The relatively low adjusted R2value is likely at least partially a
result of the flattening step (rather than solely due to poor fit.) Had we flattened the data
differently, this value would clearly change. In order to obtain a more reliable measure
of model fitness, we seek to avoid any unnecessary processing of the mined data.
4.2. Assessing fit with as the percentage of agreements on pairwise difficulties
The second method that we explore provides a more direct comparison, without any fur
ther flattening of the student data. Here, we simply calculate the percentage of observed
pairwise difficulty orderings (across all challenges) for which the I50model correctly
predicts the observed pairwise difficulty ordering. When we do this across all of the
3,349,602 difficulty orderings that we have constructed from the student data, we find
that the I50model correctly predicts 2,534,228 of these pairwise orderings, providing a
75.66% agreement with known pairwise orderings from the mined data. Remarkably, we
found that the predictive accuracy of the I50model did not significantly change as the
9Note that it is not be possible to achieve a 100% fit, as some cycles exist among these pairwise orderings.
10The outliers in this plot mark the problems that are ranked most differently by the two measures. The word
“arithmetic”, for example, was found to be difficult by SpellBEE students, but was not found to be particularly
difficult for the students in the NISS study. Variations like this one may reflect changes in the teaching or in the
frequency of usage since the NISS study was performed 50 years ago.
Workshop of Educational Data Mining
6
Page 10
Figure 3. Words are plotted by their difficulty on the I50scale and by the percentage of other words for which
the observed pairwise orderings found the word to be the harder of the pair. An adjusted R2value of 0.490 was
calculated for this model. (When ignoring the words affected by a ceiling or floor effect in either variable, the
adjusted R2value drops to 0.377.)
0 %
50 %
100 %
2 3 4 5 6 7 8
Percentage of comparisons finding word harder
I50 Difficulty Value
quantityofstudentdatausedforthedistinctionvaried.75.1%ofpredictionsbasedonone
distinction were accurate, while 74.7% of predictions based on 25 distinctions were ac
curate (intermediate values ranged from 71.0% to 77.6%). This flat relationship suggests
that pairwise difficulty orderings constructed from a minimal amount of observed data
may be just as accurate, in the aggregate, as those orderings constructed when additional
data is available.
5. Incorporating SpellBEE student data in a revised difficulty model
We now know that there is a 75.66% agreement in pairwise difficulty orderings between
the I50difficulty metric derived from the NISS data and the observed pairwise prefer
ences mined from the SpellBEE database. Can we improve upon this? We will present
an approach that iteratively updates the I50problem difficulty estimates using the mined
data and a logistic regression model. Rather than producing a single predictive model, we
construct one logistic model for each challenge, and use these fitted model to update our
estimates of the problem difficulty. Applied iteratively, we hope to converge on problem
difficulty metric that better fits the observed data. This process is inspired by the param
eter estimation procedures for Rasch models [11], which may not be directly applicable
due to the large size of our problem space.
For a given challenge c1(e.g. “acknowledge”), we can first generate the list of all
other challenges for which SpellBEE students have expressed distinctions (in either di
rection.) In Section 3.2, we chose to censor these distinctions in order to generate a bi
Workshop of Educational Data Mining
7
Page 11
Figure 4. A logistic regression model is used to estimate the difficulty of the word “abandon.” At left, the first
estimate is based on the original I50difficulty values. At right, the third iteration of the estimate is constructed
based on data from the previous best estimate. The point estimate dropped from 8.0 (from I50) to 7.06 (from
iteration 1) to 6.81 (from iteration 3.)
0 %
50 %
100 %
2 3 4 5 6 7 8
Percentage of distinctions finding word harder
Difficulty Estimate
0 %
50 %
100 %
2 3 4 5 6 7 8
Percentage of distinctions finding word harder
Difficulty Estimate
nary value representing the difficulty ordering. Here we will make use of the actual dis
tinction counts in each direction. For each challenge with which pairwise distinctions for
c1are available, we note our currentbest estimate of the difficulty of c2(initially, using
I50values), and note the number of distinctions indicating that c1is the more difficult
challenge. We can then regress the grouped distinction data on the problem difficulty es
timate data to construct a logistic model relating the two. For some c1, if the relationship
is statistically significant, we can use it to generate a revised estimate for the difficulty of
that challenge. By solving the regression equation for the c2problem difficulty value for
which 50% of distinctions find c1harder, we can calculate the difficulty of a problem for
which relativedifficulty distinctions are equally likely in either direction. This provides
a revised estimate for the difficulty of the original problem, c1. We use this procedure to
calculate revised estimates for every challenge in the space (unless the resulting logis
tic regression model is statistically not significant, in which case we retain our previous
difficulty estimate.) This process can be iteratively repeated, using the revised difficulty
estimates as the basis of the new regression models. Figure 4 plots this data for one word,
using the difficulty estimates resulting from the third iteration of the estimation.
A second approach towards incorporating observed distinction data into a unified
problem difficulty scale is briefly introduced and compared to the other metrics. Here,
we recast the estimation problem as a sorting problem, and use a probabilistic variant
of the bubblesort algorithm to reorder consecutive challenges based on available dis
tinction data. Initially ordering the challenge words alphabetically, we repeatedly step
through the list, reordering challenges at indices i and i + 1 with a probability based on
the proportion of distinctions finding the first challenge harder than the second.11After
“bubbling” through the ordered list of challenges 200,000 times, we interpret the rank
order of each challenge as a difficulty index. These indices provide a metric of difficulty
(which we refer to as ProbBubble), and a means for predicting the relative difficulty of
any pair of challenges (based on index ordering.)
11If distinctions have been observed in both directions, the challenges are reordered with a probability de
termined by the proportion of distinctions in that direction. If no distinctions in either direction have been ob
served, the challenges are reordered with a probability of p = 0.5. If distinctions have been observed in one
direction but not the other, the challenges are reordered with a fixed minimal probability (p = 0.1).
Workshop of Educational Data Mining
8
Page 12
Table 2. Summary table for the predictive accuracy of various difficulty metrics. For each metric, the percent
age of accurate predictions of pairwise difficulty orderings is noted. The accuracy of the I50metric is measured
against all of the 3,349,602 pairwise orderings identified by student distinctions. The accuracy of the data
driven metrics (I50rev.1 and ProbBubble) are based on the average results from a 5fold crossvalidation, in
which the metrics are constructed or trained on a subset of the pairwise distinction data and are evaluated on a
different set of pairwise data (the remaining portion.)
Difficulty Model Predictive Accuracy
75.66%
84.79%
84.98%
I50
I50rev.1
ProbBubble
Table 3. Spearman’s rank correlation coefficient between pairs of problem difficulty rankorderings (N =
3129, p < 0.01, twotailed.)
Metric 1Metric 2Spearman’s ρ
0.677
0.673
0.908
I50
I50
I50rev.3
ProbBubble
ProbBubbleI50rev.3
Given the pairwise technique used in Section 4.2 for analyzing the fit of a diffi
culty metric for a set of pairwise difficulty orderings, we can examine how these two
datadriven models compare to the original I50difficulty metric. Table 2 summarizes our
findings. Here we observe that the datadriven approaches provide an improvement of
almost 10% accuracy with regard to the prediction of pairwise difficulty orderings. As
was noted earlier, cycles in the observed pairwise difficulty orderings prevent any linear
metric from achieving 100% prediction accuracy, and the maximum achievable accuracy
for the SpellBEE student data is not know. We do note that two different datadriven
approaches, logistic regressionbased iterative estimation and the probabilistic sorting,
arrived at very similar levels of predictive accuracy. Table 3 uses Spearman’s rank corre
lation coefficient as a tool to quantitatively compare the three metrics. One notable find
ing here is the extremely high rank correlation between the ProbBubble and I50rev.3
datadriven metrics.
6. Conclusion
The findings from the research questions posed here are both reassuring and revealing.
Although the NISS study was done over 50 years ago, much of its value seems to have
been retained. The NISSbased I50difficulty metric was observed to correctly predict
76% of the pairwise difficulty orderings mined from SpellBEE student data. Many of
the challenges for which the difficulty metric achieved low predictive accuracies corre
sponded with words whose cultural relevance or prominence has changed over the past
few decades. The datadriven techniques presented in Section 5 offers a means for in
corporating these changes back into a difficulty metric. After doing so, we found the
predictive accuracy increased approximately 10%, to the 85% agreement level.
The key technique used here to enable the assessment and improvement of problem
difficulty estimates works even when not all students have attempted all challenges or
Workshop of Educational Data Mining
9
Page 13
when the selection of challenges for students is highly biased. It is datadriven, based
on identifying and counting pairwise distinctions indicated indirectly through observa
tions of student behavior over the duration of use of an education system. The pairwise
distinctionbased techniques for estimating problem difficulty information explored here
is a part of a larger campaign to develop methods for constructing educational systems
that require a minimal amount of expert domain knowledge and modelbuilding. Our
BEEweb model is but one such approach, the Qmatrix method is another [3], and most
the IRTbased systems discussed in the introduction are, also. Designing BEEweb activ
ities only requires domain knowledge in the form of a problem difficulty function and a
response accuracy function. The latter can usually be created without expertise, and the
former can now be approached, even when collected data is sparse and biased, using the
techniques discussed in this paper.
References
[1]Ari BaderNatal and Jordan B. Pollack. Motivating appropriate challenges in a reciprocal tutoring sys
tem. In C.K. Looi, G. McCalla, B. Bredeweg, and J. Breuker, editors, Proceedings of the 12th Interna
tional Conference on Artificial Intelligence in Education (AIED2005), pages 49–56, Amsterdam, July
2005. IOS Press.
Ari BaderNatal and Jordan B. Pollack. BEEweb: A multidomain platform for reciprocal peerdriven
tutoring systems. In M. Ikeda, K. Ashley, and T.W. Chan, editors, Proceedings of the 8th International
Conference on Intelligent Tutoring Systems (ITS2006), pages 698–700. SpringerVerlag, June 2006.
Tiffany Barnes. The qmatrix method: Mining student response data for knowledge. Technical Report
WS0502, AAAI05 Workshop on Educational Data Mining, Pittsburgh, 2005.
Klaus Brinker, Johannes Fürnkranz, and Eyke Hüllermeier. Label ranking by learning pairwise prefer
ences. Journal of Machine Learning Research, 2005.
Leonard S. Cahen, Marlys J. Craun, and Susan K. Johnson. Spelling difficulty – a survey of the research.
Review of Educational Research, 41(4):281–301, October 1971.
ChihMing Chen, ChaoYu Liu, and MeiHui Chang. Personalized curriculum sequencing utilizing
modified item response theory for webbased instruction. Expert Systems with Applications, 30, 2006.
Bruce Choppin. A fully conditional estimation procedure for rasch model parameters. CSE Report 196,
Center for the Study of Evaluation, University of California, Los Angeles, 1983.
Ricardo Conejo, Eduardo Guzmán, Eva Millán, Mónica Trella, José Luis PérezDeLaCruz, and Anto
nia Ríos. Siette: A webbased tool for adaptive testing. International Journal of Artificial Intelligence
in Education, 14:29–61, 2004.
Michel C. Desmarais, Shunkai Fu, and Xiaoming Pu. Tradeoff analysis between knowledge assessment
approaches. In C.K. Looi, G. McCalla, B. Bredeweg, and J. Breuker, editors, Proceedings of the 12th
International Conference on Artificial Intelligence in Education (AIED2005). IOS Press, 2005.
B. S. Everitt. The Analysis of Contingency Tables. Chapman and Hall, 1977.
Gerhard H. Fischer and Ivo W. Molenaar, editors. Rasch Models: Foundations, Recent Developments,
and Applications. SpringerVerlag, New York, 1995.
Harry A. Greene. New Iowa Spelling Scale. State University of Iowa, Iowa City, 1954.
Jeff Johns, Sridhar Mahadevan, and Beverly Woolf. Estimating student proficiency using an item re
sponse theory model. In M. Ikeda, K. Ashley, and T.W. Chan, editors, Proceedings of the 8th Interna
tional Conference on Intelligent Tutoring Systems (ITS2006), pages 473–480, 2006.
Mark Wilson and R. Darrell Bock. Spellability: A linearly ordered content domain. American Educa
tional Research Journal, 22(2):297–307, Summer 1985.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
Workshop of Educational Data Mining
10
Page 14
Toward the extraction of production rules
for solving logic proofs
Tiffany Barnes, John Stamper
Department of Computer Science, University of North Carolina at Charlotte
tbarnes2@uncc.edu, jcstampe@uncc.edu
Abstract: In building intelligent tutoring systems, it is critical to be able to
understand and diagnose student responses in interactive problem solving.
However, building this understanding into the tutor is a timeintensive process
usually conducted by subject experts. Much of this time is spent in building
production rules that model all the ways a student might solve a problem. We
propose a novel application of Markov decision processes (MDPs), a
reinforcement learning technique, to automatically extract production rules for an
intelligent tutor that learns. We demonstrate the feasibility of this approach by
extracting MDPs from student solutions in a logic proof tutor, and using these to
analyze and visualize student work. Our results indicate that extracted MDPs
contain many production rules generated by domain experts and reveal errors that
experts do not always predict. These MDPs also help us identify areas for
improvement in the tutor.
Keywords: educational data mining, Markov decision processes
1. Introduction
According to the ACM computing curriculum, discrete mathematics is a core course in
computer science, and an important topic in this course is solving formal logic proofs.
However, this topic is of particular difficulty for students, who are unfamiliar with
logic rules and manipulating symbols. To allow students extra practice and help in
writing logic proofs, we are building an intelligent tutoring system on top of our
existing proof verifying program. Our experience in teaching discrete math, and in
student surveys, indicate that students particularly need feedback when they get stuck.
The problem of offering individualized help and feedback is not unique to logic
proofs. Through adaptation to individual learners, intelligent tutoring systems (ITS)
can have significant effects on learning [1]. However, building one hour of adaptive
instruction takes between 1001000 hours of work of subject experts, instructional
designers, and programmers [2], and a large part of this time is used in developing
production rules that are used to model student behavior and progress. A variety of
approaches have been used to reduce the development time for ITSs, including ITS
authoring tools (such as ASSERT and CTAT), or building constraintbased student
models instead of production rule systems. ASSERT is an ITS authoring system that
uses theory refinement to learn student models from an existing knowledge base and
student data [3]. Constraintbased tutors, which look for violations of problem
constraints, require less time to construct and have been favorably compared to
cognitive tutors, particularly for problems that may not be heavily procedural [4].
Workshop of Educational Data Mining
11
Page 15
Some systems, including RIDES, DIAG, and CTAT use teacherauthored or
demonstrated examples to develop ITS production rules. RIDES is a “Tutor in a Box”
system used to build training systems for military equipment usage, while DIAG was
built as an expert diagnostic system that generates contextspecific feedback for
students [2]. These systems cannot be easily generalized, however, to learn from
student data. CTAT has been used to develop “pseudotutors” for subjects including
genetics, Java, and truth tables [5]. This system has also been used with data to build
initial models for an ITS, in an approach called Bootstrapping Novice Data (BND) [6].
Similar to the goal of BND, we seek to use student data to directly create student
models for an ITS. However, instead of feeding student behavior data into CTAT to
build a production rule system, we propose to generate Markov Decision Processes that
represent all student approaches to a particular problem, and use these MDPs directly
to generate feedback. We believe one of the most important contributions of this work
is the ability to generate feedback based on frequent, lowerror student solutions.
We propose a method of automatically generating production rules using previous
student data to reduce the expert knowledge needed to generate intelligent, context
dependent feedback. The system we propose is capable of continued refinement as
new data is provided. We illustrate our approach by applying MDPs to analyze student
work in solving formal logic proofs. This example is meant to demonstrate the
applicability of using MDPs to collect and model student behavior and generate a
graph of student responses that can be used as the basis for a production rule system.
2. Background and Proofs Tutorial Context
Several computerbased teaching systems, including Deep Thought [7], CPT [8] and
the LogicITA [9] have been built to support teaching and learning of logic proofs. Of
these, the LogicITA is the most intelligent, verifying proof statements as a student
enters them, and providing feedback after the proof is complete on student
performance. LogicITA also has facilities for considerable logging and teacher
feedback to support exploration of student performance [9], but does not offer students
help in planning their work. In this research, we propose to augment our own existing
Proofs Tutorial, with a cognitive architecture derived using educational data mining,
that can provide students feedback to avoid errorprone solutions, find optimal
solutions, and inform students of other student approaches.
In [10], the first author has applied educational data mining to analyze completed
formal proof solutions for automatic feedback generation. However, this work did not
take into account student errors, and could only provide general indications of student
approaches, as opposed to feedback tailored to a student’s current progress. In this
work, we explore all student attempts at proof solutions, including partial proofs and
incorrect rule applications, and use visualization tools to learn how this work can be
extended to automatically extract a production rule system to add to our logic proof
tutorial. In [11], the second author performed a pilot study to extract Markov decision
processes for a simple proof from three semesters of student data from Deep Thought,
and verified that the rules extracted by the MDP conformed with expertderived rules
and generated buggy rules that surprised experts. In this work, we apply the technique
and extend it with visualization tools to new data from the Proofs Tutorial.
The Proofs Tutorial is a computeraided learning tool implemented on NovaNET
(http://www.pearsondigital.com/novanet/). This program has been used for practice and
feedback in writing proofs in university discrete mathematics courses taught by the
Workshop of Educational Data Mining
12
View other sources
Hide other sources
 Available from Neil T. Heffernan · May 31, 2014
 Available from wpi.edu