Page 1

Educational Data Mining

Cecily Heiner (Co-chair)

Neil Heffernan (Co-chair)

Tiffany Barnes (Co-chair)

Supplementary Proceedings of the 13th International

Conference of Artificial Intelligence in Education.

Marina del Rey, CA. USA. July 2007

Workshop of Educational Data Mining

Page 2

Educational Data Mining Workshop

(http://www.educationaldatamining.org/)

Cecily Heiner; University of Utah; cecily@cs.utah.edu (Co-chair)

Neil Heffernan; Worcester Polytechnic Institute; nth@wpi.edu (Co-chair)

Tiffany Barnes; University of North Carolina at Charlotte, tbarnes2@uncc.edu (Co-chair)

Introduction

Educational data mining is the process of converting raw data from educational systems to useful

information that can be used to inform design decisions and answer research questions. Data mining

encompasses a wide range of research techniques that includes more traditional options such as database

queries and simple automatic logging as well as more recent developments in machine learning and

language technology.

Educational data mining techniques are now being used in ITS and AIED research worldwide. For

example, researchers have used educational data mining to:

o Detect affect and disengagement

o Detect attempts to circumvent learning called "gaming the system"

o Guide student learning efforts

o Develop or refine student models

o Measure the effect of individual interventions

o Improved teaching support

o Predict student performance and behavior

However, these techniques could achieve greater use and bring wider benefits to the ITS and AIED

communities. We need to develop standard data formats, so that researchers can more easily share data

and conduct meta-analysis across tutoring systems, and we need to determine which data mining

techniques are most appropriate for the specific features of educational data, and how these techniques

can be used on a wide scale. The workshop will provide a forum to present preliminary but promising

results that can advance our knowledge of how to appropriately conduct educational data mining and

extend the field in new directions.

Topics

They include, but are not limited to:

o What new discoveries does data mining enable?

o What techniques are especially useful for data mining?

o How can we integrate data mining and existing educational theories?

o How can data mining improve teacher support?

o How can data mining build better student models?

o How can data mining dynamically alter instruction more effectively?

o How can data mining improve systems and interventions evaluations?

o How do these evaluations lead to system and intervention improvements?

o How is data mining fundamentally different from other research methods?

Workshop of Educational Data Mining

Page 3

TABLE OF CONTENTS

Evaluating problem difficulty rankings using sparse student data………….…………1

Ari Bader-Natal, Jordan Pollack

Toward the extraction of production rules for solving logic proofs…………………...11

Tiffany Barnes, John Stamper

Difficulties in inferring student knowledge from observations (and why you should

care)………………………………………………………………………………………...21

Joseph E. Beck

What’s in a word? Extending learning factors analysis to modeling reading

transfer………………………………………………………………………..…………….31

James M. Leszczenski, Joseph E. Beck

Predicting student engagement in intelligent tutoring systems using teacher expert

knowledge ……….………………………………………………………………………...40

Nicholas Lloyd, Neil Heffernan, Carolina Ruiz

Analyzing fine-grained skill models using Bayesian and mixed effects methods

……………………………………………………………………………………………….50

Zachary Pardos, Mingyu Feng, Neil Heffernan, Cristina Heffernan, Carolina Ruiz

Mining learners’ traces from an online collaboration tool……………………………..60

Dilhan Perera, Judy Kay, Kalina Yacef, Irena Koprinska

Mining on-line discussions: Assessing technical quality for student scaffolding and

classifying messages for participation profiling………………………………………. .70

Sujith Ravi, Jihie Kim, Erin Shaw

All in the (word) family: Using learning decomposition to estimate transfer between

skills in a reading tutor that listens………………………………………………..……..80

Xiaonan Zhang, Jack Mostow, Joseph Beck

Workshop of Educational Data Mining

Page 4

Evaluating Problem Difficulty Rankings

Using Sparse Student Response Data

Ari BADER-NATAL1, Jordan POLLACK

DEMO Lab, Brandeis University

Abstract. Problem difficulty estimates play important roles in a wide variety of

educational systems, including determining the sequence of problems presented to

students and the interpretation of the resulting responses. The accuracy of these

metrics are therefore important, as they can determine the relevance of an educa-

tional experience. For systems that record large quantities of raw data, these obser-

vations can be used to test the predictive accuracy of an existing difficulty metric.

In this paper, we examine how well one rigorously developed – but potentially out-

dated – difficulty scale for American-English spelling fits the data collected from

seventeen thousand students using our SpellBEE peer-tutoring system. We then at-

tempt to construct alternate metrics that use collected data to achieve a better fit.

The domain-independent techniques presented here are applicable when the ma-

trix of available student-response data is sparsely populated or non-randomly sam-

pled. We find that while the original metric fits the data relatively well, the data-

driven metrics provide approximately 10% improvement in predictive accuracy.

Using these techniques, a difficulty metric can be periodically or continuously re-

calibrated to ensure the relevance of the educational experience for the student.

1. Introduction

Estimates of student proficiency and problem difficulty play central roles in Item Re-

sponse Theory (IRT) [11]. Several current educational systems make use of this theory,

including our own BEEweb peer-tutoring activities [2,8,9,13]. IRT-based analysis often

focuses on estimating student proficiency in the task domain, but the challenge of esti-

mating problem difficulty should not be overlooked. While student proficiency estimates

can inform assessment, problem difficulty estimates can be used to refine instruction:

these metrics can affect the selection and ordering of problems posed and can influ-

ence the interpretation of the resulting responses [6]. It is therefore important to choose

a good difficulty metric initially and to periodically evaluate the accuracy of a chosen

metric with respect to available student data. In this paper, we examine how accurately

one rigorously developed – but potentially outdated – difficulty scale for the domain of

American-English spelling predicts the data collected from students using our SpellBEE

system [1]. The defining challenge in providing this assessment lies in the nature of the

data. As SpellBEE is a peer-tutoring system, the challenges posed to students are deter-

mined by other students, resulting in data that is neither random nor complete. In this

1Correspondence to: Ari Bader-Natal, Brandeis University, Computer Science Department – MS 018.

Waltham, MA 02454. USA. Tel.: +1 781 736 3366; Fax: +1 781 736 2741; E-mail: ari@cs.brandeis.edu.

Workshop of Educational Data Mining

1

Page 5

work, we rely on a pairwise comparison technique designed to be robust to data with

these characteristics. After assessing the relevance of this existing metric (in terms of

predictive accuracy), we will examine some related techniques for initially constructing

a difficulty metric based on non-random, incomplete samples of observed student data.

2. American-English spelling: A sample task domain

The educational system examined here, SpellBEE, was designed to address the task do-

main of American-English spelling [1]. SpellBEE is the oldest of a growing suite of

web-based reciprocal tutoring systems using the Teacher’s Dilemma as a motivational

mechanism [2]. For the purposes of this paper, however, the mechanisms for motivation

and interaction can be ignored, and the SpellBEE system and the difficulty metric used

by it can be specifically re-characterized for an educational data mining audience.

2.1. Relevant characteristics of the SpellBEE system

Students access SpellBEE online at SpellBEE.org from their homes or schools. As of

May 2007, over 17,000 students have actively participated. After creating a user account,

a student is able to log in, choose a partner, and begin the activity.2During the activity,

students take turns posing and solving spelling problems. When posing a problem, the

student selects from a short list of words randomly drawn from the database of word-

challenges. This database is comprised of 3,129 words drawn from Greene’s New Iowa

Spelling Scale (NISS), which will be discussed in the next section [12].3When respond-

ing to a problem, the student types in what they believe to be the correct spelling of the

challenge word. The accuracy of the response is assessed to be either correct or incorrect.

Figure 1 presents a list of the relevant data stored in the SpellBEE server logs.

To date, we have observed over 64,000 unique (case-insensitive) responses to the

challenges posed,4distributed across over 22,000 completed games consisting of seven

questions attempted per student. Student participation, measured in games completed,

has not been uniform, however. Of the challenges in the space, most students have only

attempted a very small fraction. In fact, when examining the response matrix of every

student by every challenge, less than 1% of the matrix data is known. An important

characteristicoftheSpellBEEdata,then,isthattheresponsematrixisremarkablysparse.

Given that the students acting as tutors are able to – and systemically motivated to –

express their preferences and hunches through the problems that they select, another

important characteristic of the SpellBEE data is that the data present in the student-

challenge response matrix is also biased. The effects of this bias can be found in the

following example: 16% of student attempts to spell the word “file” were correct, while

66% of attempts to spell the word “official” were correct. The average grade level among

the first set of students was 3.9, while for the second set it was 6.4. In Section 3.2 we

2In the newer BEEweb activities, if no one else is present, a student can practice alone on problems randomly

drawn from the database of challenges posed in the past.

3In SpellBEE, the word-challenges are presented in the context of a sentence, and so of the words in Greene’s

list, we only use those found in the seven public-domain books that we parsed for sentences.

4Of these, 17,391 were observed more than once. In this paper, we restrict the set of responses that we

consider to this subset. See Footnote 7 for the rationale behind this.

Workshop of Educational Data Mining

2

Page 6

Figure 1. The SpellBEE server logs data about each turn taken by each student, as shown in the first list. The

data in the first list is sufficient to generate the data included in the second list.

1. time : a time-stamp allows responses to be ordered

2. game : identifies the game in which this turn occurred

3. tutor : identifies the student acting as the tutor in this turn

4. tutee : identifies the student acting as the tutee in this turn

5. challenge : identifies the challenge posed by the tutor

6. response : identifies the response offered by the tutee

1. difficulty : the difficulty rating of the challenge posed by the tutor

2. accuracy : the accuracy rating of the response offered by the tutee

will present techniques designed to draw more meaningful difficulty information from

this type of data.

2.2. Origin, use, and application of the problem difficulty metric

When trying to define a measure of problem difficulty for the well-studied domain of

American-English spelling, we were able to benefit from earlier research in the field.

Greene’s “New Iowa Spelling Scale” provides a rich source of data on word spelling dif-

ficulty, drawn from a vast study published in 1954. Greene first developed a methodol-

ogy for selecting words for his list (5,507 were eventually used.) Approximately 230,000

students from 8,800 classrooms (grades 2 through 8) around the United States partici-

pated in the study, totally over 23 million spelling responses [12]. From these, Greene

calculated the percentage of correct responses for each word for each grade. This table

of success rates is used in SpellBEE to calculate the difficulty of each spelling problem

for students, whose grade level is known.

3. Techniques for assessing relative challenge difficulty

The research questions addressed in this paper focus on the fit of the difficulty model

based on the NISS data to the observed SpellBEE student data. Two different techniques

are involved in the calculating this fit. The first converts the graded NISS data to a linear

scale. The second identifies from the observed student data a difficulty ordering over

pairs of problems, in a manner appropriate for a sparse and biased data matrix. Both will

be employed to address the research questions in the following sections.

3.1. Linearly ordering challenges using the difficulty metric

Many subsequent studies have explored various aspects of Greene’s study and the data

that it produced. Cahen, Craun, and Johnson [5] and, later, Wilson and Bock [14] explore

the degree to which various combinations of domain-specific predictors could account

for Greene’s data. Initially starting with 20 predictors, Wilson and Bock work down

to a regression model with an adjusted R2value of 0.854.5Here, we not interested in

5The most influential of which being the length of the word.

Workshop of Educational Data Mining

3

Page 7

Figure 2. Difficulty data for two words from the NISS study are plotted, and the I50statistics are calculated.

0 %

50 %

100 %

2 3 4 5 6 7 8

Percentage of students with correct responses

Student grade level

I50("above")=3.7I50("acknowledge")=7.875

"above"

"acknowledge"

predicting the NISS results, but instead are interested in assessing the fit (or predictive

power) of the 1954 NISS results to observations made of students using SpellBEE over

50 years later. We will drawn upon one statistic used by Wilson and Bock: the one-

dimensional flattening of the seven-graded NISS data. This statistic, which they refer to

as the “location” of the word, is the (fractional) grade level at which 50% of the NISS

students correctly spell word w.6We denote this as I50(w). Figure 2 illustrates how the

graded difficulty data that is used to derive this statistic for two different words. The

value of this statistic is that it provides a single grade-independent difficulty value for a

word that can be compared directly to that of other words.

3.2. Identifying pairwise difficulty orderings using observed student data

Given the characteristics of the data collected from the SpellBEE system, identifying the

more difficult of a pair of problems based on this data is not trivial. The percentage of

correct responses to a challenge, the calculation used to generate the NISS data, is not

appropriate here, as the assignment of challenges to students was done in a biased, non-

random manner (recall the “file”/“official” example from Section 2.1.) Tutors, in fact,

are motivated to base their challenge selection on the response accuracies that they an-

ticipate. A more appropriate measure, rooted in several different literatures, is to assess

pairwise problem difficulties on distinctions indirectly indicated by the students. In the

statistics literature, McNemar’s test provides a statistic based on this concept [10], in the

IRT literature, this is used as a data reduction strategy for Rasch model parameter esti-

mation [7], and the Machine Learning literature includes various approaches to learning

6Wilson and Bock calculate the 50% threshold based on a logistic model fit to the discrete grade-level data,

while we calculate the threshold slightly differently, based on a linear interpolation of the grade-level data.

Workshop of Educational Data Mining

4

Page 8

Table 1. While the I50metric flattens the grade-specific NISS data to a single dimension, the relative difficulty

ordering of most word-pairs based on the graded NISS data is the same as when based on the I50scale. In

this table, we quantify the amount of agreement between I50and each set of grade-specific NISS data using

Spearman’s rank correlation coefficient. The strong correlations observed suggest that the unidimensional scale

sufficiently captures the relative difficulty information from the original NISS dataset. (The number of words,

N, varies by grade, as the NISS study did not show several of the harder words to the younger students.)

Grade

2

3

4

5

6

7

8

NSpearman’s ρ

0.751

0.933

0.977

0.974

0.960

0.935

0.915

2218

3059

3126

3129

3129

3129

3129

rankings based on pairwise preferences [4]. Assume that for some specific pair of prob-

lems, such as the spelling of the words “about” and “acknowledge”, we first identify all

students in the SpellBEE database who have attempted both words. Given that response

accuracy is dichotomous, there are only four possible configurations of a student’s re-

sponse accuracy to the pair of challenges. In the cases where the student responds to both

correctly or incorrectly, no distinction is made between the pair. But in the cases where

the student correctly responds to one but incorrectly to the other, we classify this as a

distinction indicating a difficulty ordering between the two problems.7

It is also worth stating that in this study, we assume a “static student” model, so we

are not concerned with the order of these two responses. At the cost of some data loss,

one could instead assume a “learning student” model, for which only a correct response

ononeproblemfollowedbyanincorrectresponseontheotherwoulddefineadistinction.

Had the incorrect response been observed first, we could not rule out the possibility that

the difference was due to a change in the student’s abilities over time, and not necessarily

an indication of difference in problem difficulties.8

An example may clarify. If counting the number of both directional distinctions

made by all students (e.g. 12 students in SpellBEE spelled “about” correctly and “ac-

knowledge” incorrectly, while 2 students spelled “about” incorrectly and “acknowledge”

correctly), we have a strong indication of relative problem difficulty. McNemar’s test as-

signs a significance to this pair of distinction counts. In this work, we more closely fol-

low the IRT approach, relying only the relative size of the two counts (and not the signifi-

cance.) Thus, since 12 distinctions were found in one direction and only 2 in the other, we

say that we observed the word “about” to be easier than the word “acknowledge” based

on collected SpellBEE student data. If distinctions were available for every problem pair,

7We recognize that some distinctions are spurious, for which the incorrect response was not reflective of

the student’s abilities. Here we take a simplistic approach of identifying and ignoring non-responses (in which

the student typed nothing) and globally-unique responses (which no other student ever responded, to any chal-

lenge.) Globally-unique responses encompass responses from students who don’t yet understand the activity,

responses from students who did not hear the audio recording, responses from student attempting to use the

response field as a chat interface, and responses from students making no effort to engage in the activity.

8Another possible model is a “dynamic student” model, for which student abilities may get better or worse

over time. Under this model, no distinctions can be definitively attributed to difference in problem difficulty.

Workshop of Educational Data Mining

5

Page 9

a total of 3,129 × 3,128 = 9,787,512 pairwise problem orderings could be expressed. In

our collected data so far, we have 3,349,602 of these problem pairs for which we have

distinctions recorded. In the subsequent sections, we measure the fitness of a predictive

model (like I50) based on how many of these pairwise orderings are satisfied.9

4. Assessing the fit of the NISS-based I50model to the SpellBEE student data

GiventheNISS-basedI50difficultymodelofproblemdifficultyandthedata-driventech-

nique for turning observed distinctions recorded in the SpellBEE database into pairwise

difficulty orderings, we can now explore various methods to assess the applicability of

the model to the data.

4.1. Assessing fit with a regression model

The first method is to construct a regression model that uses I50to predict observed diffi-

culty. Since observed difficulty is currently available only in pairwise form, this requires

an additional step in which we flatten these pairwise orderings into one total ordering

over all problems. As this is a highly non-trivial step, the results should be interpreted

tentatively. Here, we accomplish a flattening by calculating, for each challenge, the per-

centage of available pairwise orderings for which the given challenge was the more diffi-

cult of the pair. So if 100 pairwise orderings involve the challenge word “acknowledge”,

and 72 of these found “acknowledge” to be the harder of the pair, we would mark “ac-

knowledge” as harder than 72% of other words. A regression model was then built on

this,usingI50asapredictorofthepairwise-derivedpercentage.Themodel,afterfiltering

out data points causing ceiling and floor effects (i.e. I50(w) = 2.0 or I50(w) = 8.0), had

an adjusted R2value of 0.337 (p < 0.001 for the model). The corresponding scatterplot

is shown in Figure 3.10The relatively low adjusted R2value is likely at least partially a

result of the flattening step (rather than solely due to poor fit.) Had we flattened the data

differently, this value would clearly change. In order to obtain a more reliable measure

of model fitness, we seek to avoid any unnecessary processing of the mined data.

4.2. Assessing fit with as the percentage of agreements on pairwise difficulties

The second method that we explore provides a more direct comparison, without any fur-

ther flattening of the student data. Here, we simply calculate the percentage of observed

pairwise difficulty orderings (across all challenges) for which the I50model correctly

predicts the observed pairwise difficulty ordering. When we do this across all of the

3,349,602 difficulty orderings that we have constructed from the student data, we find

that the I50model correctly predicts 2,534,228 of these pairwise orderings, providing a

75.66% agreement with known pairwise orderings from the mined data. Remarkably, we

found that the predictive accuracy of the I50model did not significantly change as the

9Note that it is not be possible to achieve a 100% fit, as some cycles exist among these pairwise orderings.

10The outliers in this plot mark the problems that are ranked most differently by the two measures. The word

“arithmetic”, for example, was found to be difficult by SpellBEE students, but was not found to be particularly

difficult for the students in the NISS study. Variations like this one may reflect changes in the teaching or in the

frequency of usage since the NISS study was performed 50 years ago.

Workshop of Educational Data Mining

6

Page 10

Figure 3. Words are plotted by their difficulty on the I50scale and by the percentage of other words for which

the observed pairwise orderings found the word to be the harder of the pair. An adjusted R2value of 0.490 was

calculated for this model. (When ignoring the words affected by a ceiling or floor effect in either variable, the

adjusted R2value drops to 0.377.)

0 %

50 %

100 %

2 3 4 5 6 7 8

Percentage of comparisons finding word harder

I50 Difficulty Value

quantityofstudentdatausedforthedistinctionvaried.75.1%ofpredictionsbasedonone

distinction were accurate, while 74.7% of predictions based on 25 distinctions were ac-

curate (intermediate values ranged from 71.0% to 77.6%). This flat relationship suggests

that pairwise difficulty orderings constructed from a minimal amount of observed data

may be just as accurate, in the aggregate, as those orderings constructed when additional

data is available.

5. Incorporating SpellBEE student data in a revised difficulty model

We now know that there is a 75.66% agreement in pairwise difficulty orderings between

the I50difficulty metric derived from the NISS data and the observed pairwise prefer-

ences mined from the SpellBEE database. Can we improve upon this? We will present

an approach that iteratively updates the I50problem difficulty estimates using the mined

data and a logistic regression model. Rather than producing a single predictive model, we

construct one logistic model for each challenge, and use these fitted model to update our

estimates of the problem difficulty. Applied iteratively, we hope to converge on problem

difficulty metric that better fits the observed data. This process is inspired by the param-

eter estimation procedures for Rasch models [11], which may not be directly applicable

due to the large size of our problem space.

For a given challenge c1(e.g. “acknowledge”), we can first generate the list of all

other challenges for which SpellBEE students have expressed distinctions (in either di-

rection.) In Section 3.2, we chose to censor these distinctions in order to generate a bi-

Workshop of Educational Data Mining

7

Page 11

Figure 4. A logistic regression model is used to estimate the difficulty of the word “abandon.” At left, the first

estimate is based on the original I50difficulty values. At right, the third iteration of the estimate is constructed

based on data from the previous best estimate. The point estimate dropped from 8.0 (from I50) to 7.06 (from

iteration 1) to 6.81 (from iteration 3.)

0 %

50 %

100 %

2 3 4 5 6 7 8

Percentage of distinctions finding word harder

Difficulty Estimate

0 %

50 %

100 %

2 3 4 5 6 7 8

Percentage of distinctions finding word harder

Difficulty Estimate

nary value representing the difficulty ordering. Here we will make use of the actual dis-

tinction counts in each direction. For each challenge with which pairwise distinctions for

c1are available, we note our current-best estimate of the difficulty of c2(initially, using

I50values), and note the number of distinctions indicating that c1is the more difficult

challenge. We can then regress the grouped distinction data on the problem difficulty es-

timate data to construct a logistic model relating the two. For some c1, if the relationship

is statistically significant, we can use it to generate a revised estimate for the difficulty of

that challenge. By solving the regression equation for the c2problem difficulty value for

which 50% of distinctions find c1harder, we can calculate the difficulty of a problem for

which relative-difficulty distinctions are equally likely in either direction. This provides

a revised estimate for the difficulty of the original problem, c1. We use this procedure to

calculate revised estimates for every challenge in the space (unless the resulting logis-

tic regression model is statistically not significant, in which case we retain our previous

difficulty estimate.) This process can be iteratively repeated, using the revised difficulty

estimates as the basis of the new regression models. Figure 4 plots this data for one word,

using the difficulty estimates resulting from the third iteration of the estimation.

A second approach towards incorporating observed distinction data into a unified

problem difficulty scale is briefly introduced and compared to the other metrics. Here,

we recast the estimation problem as a sorting problem, and use a probabilistic variant

of the bubble-sort algorithm to reorder consecutive challenges based on available dis-

tinction data. Initially ordering the challenge words alphabetically, we repeatedly step

through the list, reordering challenges at indices i and i + 1 with a probability based on

the proportion of distinctions finding the first challenge harder than the second.11After

“bubbling” through the ordered list of challenges 200,000 times, we interpret the rank-

order of each challenge as a difficulty index. These indices provide a metric of difficulty

(which we refer to as ProbBubble), and a means for predicting the relative difficulty of

any pair of challenges (based on index ordering.)

11If distinctions have been observed in both directions, the challenges are reordered with a probability de-

termined by the proportion of distinctions in that direction. If no distinctions in either direction have been ob-

served, the challenges are reordered with a probability of p = 0.5. If distinctions have been observed in one

direction but not the other, the challenges are reordered with a fixed minimal probability (p = 0.1).

Workshop of Educational Data Mining

8

Page 12

Table 2. Summary table for the predictive accuracy of various difficulty metrics. For each metric, the percent-

age of accurate predictions of pairwise difficulty orderings is noted. The accuracy of the I50metric is measured

against all of the 3,349,602 pairwise orderings identified by student distinctions. The accuracy of the data-

driven metrics (I50rev.1 and ProbBubble) are based on the average results from a 5-fold cross-validation, in

which the metrics are constructed or trained on a subset of the pairwise distinction data and are evaluated on a

different set of pairwise data (the remaining portion.)

Difficulty Model Predictive Accuracy

75.66%

84.79%

84.98%

I50

I50rev.1

ProbBubble

Table 3. Spearman’s rank correlation coefficient between pairs of problem difficulty rank-orderings (N =

3129, p < 0.01, two-tailed.)

Metric 1Metric 2Spearman’s ρ

0.677

0.673

0.908

I50

I50

I50rev.3

ProbBubble

ProbBubbleI50rev.3

Given the pairwise technique used in Section 4.2 for analyzing the fit of a diffi-

culty metric for a set of pairwise difficulty orderings, we can examine how these two

data-driven models compare to the original I50difficulty metric. Table 2 summarizes our

findings. Here we observe that the data-driven approaches provide an improvement of

almost 10% accuracy with regard to the prediction of pairwise difficulty orderings. As

was noted earlier, cycles in the observed pairwise difficulty orderings prevent any linear

metric from achieving 100% prediction accuracy, and the maximum achievable accuracy

for the SpellBEE student data is not know. We do note that two different data-driven

approaches, logistic regression-based iterative estimation and the probabilistic sorting,

arrived at very similar levels of predictive accuracy. Table 3 uses Spearman’s rank corre-

lation coefficient as a tool to quantitatively compare the three metrics. One notable find-

ing here is the extremely high rank correlation between the ProbBubble and I50rev.3

data-driven metrics.

6. Conclusion

The findings from the research questions posed here are both reassuring and revealing.

Although the NISS study was done over 50 years ago, much of its value seems to have

been retained. The NISS-based I50difficulty metric was observed to correctly predict

76% of the pairwise difficulty orderings mined from SpellBEE student data. Many of

the challenges for which the difficulty metric achieved low predictive accuracies corre-

sponded with words whose cultural relevance or prominence has changed over the past

few decades. The data-driven techniques presented in Section 5 offers a means for in-

corporating these changes back into a difficulty metric. After doing so, we found the

predictive accuracy increased approximately 10%, to the 85% agreement level.

The key technique used here to enable the assessment and improvement of problem

difficulty estimates works even when not all students have attempted all challenges or

Workshop of Educational Data Mining

9

Page 13

when the selection of challenges for students is highly biased. It is data-driven, based

on identifying and counting pairwise distinctions indicated indirectly through observa-

tions of student behavior over the duration of use of an education system. The pairwise

distinction-based techniques for estimating problem difficulty information explored here

is a part of a larger campaign to develop methods for constructing educational systems

that require a minimal amount of expert domain knowledge and model-building. Our

BEEweb model is but one such approach, the Q-matrix method is another [3], and most

the IRT-based systems discussed in the introduction are, also. Designing BEEweb activ-

ities only requires domain knowledge in the form of a problem difficulty function and a

response accuracy function. The latter can usually be created without expertise, and the

former can now be approached, even when collected data is sparse and biased, using the

techniques discussed in this paper.

References

[1]Ari Bader-Natal and Jordan B. Pollack. Motivating appropriate challenges in a reciprocal tutoring sys-

tem. In C.-K. Looi, G. McCalla, B. Bredeweg, and J. Breuker, editors, Proceedings of the 12th Interna-

tional Conference on Artificial Intelligence in Education (AIED-2005), pages 49–56, Amsterdam, July

2005. IOS Press.

Ari Bader-Natal and Jordan B. Pollack. BEEweb: A multi-domain platform for reciprocal peer-driven

tutoring systems. In M. Ikeda, K. Ashley, and T.-W. Chan, editors, Proceedings of the 8th International

Conference on Intelligent Tutoring Systems (ITS-2006), pages 698–700. Springer-Verlag, June 2006.

Tiffany Barnes. The q-matrix method: Mining student response data for knowledge. Technical Report

WS-05-02, AAAI-05 Workshop on Educational Data Mining, Pittsburgh, 2005.

Klaus Brinker, Johannes Fürnkranz, and Eyke Hüllermeier. Label ranking by learning pairwise prefer-

ences. Journal of Machine Learning Research, 2005.

Leonard S. Cahen, Marlys J. Craun, and Susan K. Johnson. Spelling difficulty – a survey of the research.

Review of Educational Research, 41(4):281–301, October 1971.

Chih-Ming Chen, Chao-Yu Liu, and Mei-Hui Chang. Personalized curriculum sequencing utilizing

modified item response theory for web-based instruction. Expert Systems with Applications, 30, 2006.

Bruce Choppin. A fully conditional estimation procedure for rasch model parameters. CSE Report 196,

Center for the Study of Evaluation, University of California, Los Angeles, 1983.

Ricardo Conejo, Eduardo Guzmán, Eva Millán, Mónica Trella, José Luis Pérez-De-La-Cruz, and Anto-

nia Ríos. Siette: A web-based tool for adaptive testing. International Journal of Artificial Intelligence

in Education, 14:29–61, 2004.

Michel C. Desmarais, Shunkai Fu, and Xiaoming Pu. Tradeoff analysis between knowledge assessment

approaches. In C.-K. Looi, G. McCalla, B. Bredeweg, and J. Breuker, editors, Proceedings of the 12th

International Conference on Artificial Intelligence in Education (AIED-2005). IOS Press, 2005.

B. S. Everitt. The Analysis of Contingency Tables. Chapman and Hall, 1977.

Gerhard H. Fischer and Ivo W. Molenaar, editors. Rasch Models: Foundations, Recent Developments,

and Applications. Springer-Verlag, New York, 1995.

Harry A. Greene. New Iowa Spelling Scale. State University of Iowa, Iowa City, 1954.

Jeff Johns, Sridhar Mahadevan, and Beverly Woolf. Estimating student proficiency using an item re-

sponse theory model. In M. Ikeda, K. Ashley, and T.-W. Chan, editors, Proceedings of the 8th Interna-

tional Conference on Intelligent Tutoring Systems (ITS-2006), pages 473–480, 2006.

Mark Wilson and R. Darrell Bock. Spellability: A linearly ordered content domain. American Educa-

tional Research Journal, 22(2):297–307, Summer 1985.

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Workshop of Educational Data Mining

10

Page 14

Toward the extraction of production rules

for solving logic proofs

Tiffany Barnes, John Stamper

Department of Computer Science, University of North Carolina at Charlotte

tbarnes2@uncc.edu, jcstampe@uncc.edu

Abstract: In building intelligent tutoring systems, it is critical to be able to

understand and diagnose student responses in interactive problem solving.

However, building this understanding into the tutor is a time-intensive process

usually conducted by subject experts. Much of this time is spent in building

production rules that model all the ways a student might solve a problem. We

propose a novel application of Markov decision processes (MDPs), a

reinforcement learning technique, to automatically extract production rules for an

intelligent tutor that learns. We demonstrate the feasibility of this approach by

extracting MDPs from student solutions in a logic proof tutor, and using these to

analyze and visualize student work. Our results indicate that extracted MDPs

contain many production rules generated by domain experts and reveal errors that

experts do not always predict. These MDPs also help us identify areas for

improvement in the tutor.

Keywords: educational data mining, Markov decision processes

1. Introduction

According to the ACM computing curriculum, discrete mathematics is a core course in

computer science, and an important topic in this course is solving formal logic proofs.

However, this topic is of particular difficulty for students, who are unfamiliar with

logic rules and manipulating symbols. To allow students extra practice and help in

writing logic proofs, we are building an intelligent tutoring system on top of our

existing proof verifying program. Our experience in teaching discrete math, and in

student surveys, indicate that students particularly need feedback when they get stuck.

The problem of offering individualized help and feedback is not unique to logic

proofs. Through adaptation to individual learners, intelligent tutoring systems (ITS)

can have significant effects on learning [1]. However, building one hour of adaptive

instruction takes between 100-1000 hours of work of subject experts, instructional

designers, and programmers [2], and a large part of this time is used in developing

production rules that are used to model student behavior and progress. A variety of

approaches have been used to reduce the development time for ITSs, including ITS

authoring tools (such as ASSERT and CTAT), or building constraint-based student

models instead of production rule systems. ASSERT is an ITS authoring system that

uses theory refinement to learn student models from an existing knowledge base and

student data [3]. Constraint-based tutors, which look for violations of problem

constraints, require less time to construct and have been favorably compared to

cognitive tutors, particularly for problems that may not be heavily procedural [4].

Workshop of Educational Data Mining

11

Page 15

Some systems, including RIDES, DIAG, and CTAT use teacher-authored or

demonstrated examples to develop ITS production rules. RIDES is a “Tutor in a Box”

system used to build training systems for military equipment usage, while DIAG was

built as an expert diagnostic system that generates context-specific feedback for

students [2]. These systems cannot be easily generalized, however, to learn from

student data. CTAT has been used to develop “pseudo-tutors” for subjects including

genetics, Java, and truth tables [5]. This system has also been used with data to build

initial models for an ITS, in an approach called Bootstrapping Novice Data (BND) [6].

Similar to the goal of BND, we seek to use student data to directly create student

models for an ITS. However, instead of feeding student behavior data into CTAT to

build a production rule system, we propose to generate Markov Decision Processes that

represent all student approaches to a particular problem, and use these MDPs directly

to generate feedback. We believe one of the most important contributions of this work

is the ability to generate feedback based on frequent, low-error student solutions.

We propose a method of automatically generating production rules using previous

student data to reduce the expert knowledge needed to generate intelligent, context-

dependent feedback. The system we propose is capable of continued refinement as

new data is provided. We illustrate our approach by applying MDPs to analyze student

work in solving formal logic proofs. This example is meant to demonstrate the

applicability of using MDPs to collect and model student behavior and generate a

graph of student responses that can be used as the basis for a production rule system.

2. Background and Proofs Tutorial Context

Several computer-based teaching systems, including Deep Thought [7], CPT [8] and

the Logic-ITA [9] have been built to support teaching and learning of logic proofs. Of

these, the Logic-ITA is the most intelligent, verifying proof statements as a student

enters them, and providing feedback after the proof is complete on student

performance. Logic-ITA also has facilities for considerable logging and teacher

feedback to support exploration of student performance [9], but does not offer students

help in planning their work. In this research, we propose to augment our own existing

Proofs Tutorial, with a cognitive architecture derived using educational data mining,

that can provide students feedback to avoid error-prone solutions, find optimal

solutions, and inform students of other student approaches.

In [10], the first author has applied educational data mining to analyze completed

formal proof solutions for automatic feedback generation. However, this work did not

take into account student errors, and could only provide general indications of student

approaches, as opposed to feedback tailored to a student’s current progress. In this

work, we explore all student attempts at proof solutions, including partial proofs and

incorrect rule applications, and use visualization tools to learn how this work can be

extended to automatically extract a production rule system to add to our logic proof

tutorial. In [11], the second author performed a pilot study to extract Markov decision

processes for a simple proof from three semesters of student data from Deep Thought,

and verified that the rules extracted by the MDP conformed with expert-derived rules

and generated buggy rules that surprised experts. In this work, we apply the technique

and extend it with visualization tools to new data from the Proofs Tutorial.

The Proofs Tutorial is a computer-aided learning tool implemented on NovaNET

(http://www.pearsondigital.com/novanet/). This program has been used for practice and

feedback in writing proofs in university discrete mathematics courses taught by the

Workshop of Educational Data Mining

12