Content uploaded by Martin Ebner

Author content

All content in this area was uploaded by Martin Ebner on Sep 02, 2019

Content may be subject to copyright.

Insights into Learning Competence through

Probabilistic Graphical Models?

Abstract. One-digit multiplication problems is one of the major ﬁelds

in learning mathematics at the level of primary school that has been

studied over and over. However, the majority of related work is focus-

ing on descriptive statistics on data from multiple surveys. The goal of

our research is to gain insights into multiplication misconceptions by

applying machine learning techniques. To reach this goal, we trained a

probabilistic graphical model of the students’ misconceptions from data

of an application for learning multiplication. The use of this model fa-

cilitates the exploration of insights into human learning competence and

the personalization of tutoring according to individual learner’s knowl-

edge states. The detection of all relevant causal factors of the erroneous

students answers as well as their corresponding relative weight is a valu-

able insight for teachers. Furthermore, the similarity between diﬀerent

multiplication problems - according to the students behavior - is quan-

tiﬁed and used for their grouping into clusters. Overall, the proposed

model facilitates real-time learning insights that lead to more informed

decisions.

Keywords: Bayesian networks Probabilistic graphical models Learn-

ing analytics

?The authors declare that there are no conﬂicts of interest and no ethical issues, no

particular funding was achieved.

Draft - originally published in: Saranti, A., Taraghi, B., Ebner, M., Holzinger, A. (2019) Insights into

Learning Competence Through Probabilistic Graphical Models. In: Machine Learning and Knowledge Extraction.

Third IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2019,

Canterbury, UK, August 26–29, 2019, Proceedings. pp. 250-271

2

1 Introduction

1.1 Previous Work

The ﬁeld of learning analytics is of increasing importance for educational research

[44], [17], [20]. Moreover, it aims to assist the learning process by providing teach-

ers a deeper insight into learning processes and learning results. Teachers play an

essential role because they are responsible for intervening in a pedagogical ade-

quate manner.Recently, the ﬁeld makes heavy use of statistical machine learning

[40]. Whilst educational data mining targets on automating learning activities,

learning analytics supports educators in their daily routine.

The so called “1x1 trainer” 6is a learning analytics application developed by

the department Educational Technology of Graz University of Technology, Aus-

tria. It uses the beneﬁts of both ﬁelds, learning analytics as well as educational

data mining [18], [19], [42].

The application poses exercises to students from the multiplication table

with one decimal digit operands. The algorithm of the “1x1 trainer” adapts the

sequence of given questions subject to the students answers individually, in order

to improve individual learning progress. However, technically, this needs to react

adaptively to the changes of the learning progress of a student. In that way, it

would support each student according to the user’s distinct learning progress

over the whole learning period. This underlying personalized adaptive learning

algorithm shall discover weak mathematical knowledge of single students and

alerts teachers just in time to adequately intervene.

The work is based on previous research that used data gathered by the “1x1

trainer”. Firstly, diﬀerent mathematical questions were roughly classiﬁed accord-

ing to the learners answers. Questions were considered to be more diﬃcult than

others when students required more attempts to answer them [47], [50]. Some

speciﬁc questions could be identiﬁed as diﬃcult for the majority of the users.

The next step was to analyze the explanation of the errors made. Therefore,

error types were assigned to falsely answered questions, which correspond to

the innate cognitive and conceptual learning shortcomings of the users [49]. The

“relative diﬃculty” of those questions - 2 ×3 seems to be simpler than 7 ×8 -

played no role in identifying the error types. More explanation on error types

follows in section 2.

1.2 Bayesian Student Models

There is a plethora of learning applications that use probabilistic graphical mod-

els (also called Bayesian models/networks [38]) to model student’s knowledge.

These models have started making an impact in the research of causal learning

and inference generally, but there are good arguments that even children’s causal

learning could also be modeled in this way [11].

Most applications belong to the category of intelligent tutoring systems (ITS)

[14] or adaptive educational systems (AES) [4]. The main goal of an ITS is to

6https://schule.learninglab.tugraz.at/einmaleins/, Last accessed 25 August 2018

Insights into Learning Competence through Probabilistic Graphical Models 3

provide personalized recommendations according to the diﬀerent learning styles,

whereas AES adapt the learning content as well as its sequence according to

the student’s proﬁle [41]. As explained in the literature review by Chrysaﬁadi &

Virvou [12], there are two classes of intelligent tutoring systems: Systems that

make diagnosis with the student’s knowledge, misconceptions, learning style or

cognitive state, and systems that plan a personalized strategy using diagnosis for

each learner individually. Student modelling is considered a subproblem of user

modelling which is of central importance to ITS since otherwise each student is

treated the same [36].

Primarily, we are interested in Bayesian networks because of their ability

to model uncertainty [35] and, at the same time, to support a decision making

process. The user modelling goals of a Bayesian network for knowledge modelling

is mainly to have an adaptive estimation of the knowledge itself, since it may

increase or decrease during the learning process [5]. Since scalar models and fuzzy

logic approaches [15] have lower precision, structural models are built with the

assumption that the knowledge is composed mainly by independent parts. On the

other hand, bug/perturbation models [12] represent errors and misconceptions

of the student. In this case, the Bayesian network is used to ﬁnd the error that

most probably caused the observable behavior (also called evidence) [35] which

is called credit/blame assignment problem [37]. Bayesian networks can model

the assumption that a wrongly answered question having two potential causes is

most probably caused by the one that is more prevalent, according to the data

provided so far. Sometimes, random slips or typos are included in the model and

do not rely on assumptions as for example: A wrongly answered question does

not necessarily mean that the student does not know a concept completely, or a

correctly answered one wasn’t a guess. The structure in both cases constitutes

the qualitative model; its deﬁnition uses domain knowledge and (optionally)

data. The parametrization is learned from the data during a training phase and

constitutes the quantitative model.

The reason for the creation of the model is in some cases to assist the teachers

of large classes that suﬀer from a high dropout rate [51]. A model recognizes the

student’s knowledge faster and more accurate [35] which is primarily beneﬁcial

when the class has a large number of students. In other cases that are summed

up in [6], the goal is to provide a personalized optimal sequence of the learning

material or even to sequence the curriculum according to the student’s individual

needs. And yet, further cases [46], [45] show that the learning application that is

based on the model provides long-term learning eﬀects as opposed to traditional

methods. This was studied by a post-test that was made several weeks after the

learning sessions.

The issue of deﬁning the prior beliefs, which consist the starting parameter-

ization, is often coupled with user clustering; demographics, longitudinal data

[46], pre-tests [25], [13], deﬁning the prior beliefs as well as the starting group-

ings [5] with respect to the learners. In other cases, the teacher sets the prior

beliefs from his/her experience [36] or a uniform prior is used [13], [37]. Another

common characteristic is the deﬁnition of hidden structural elements that rep-

4

resent unobservable entities, which must be estimated from the observed ones.

The design of the structure must take correct assumptions into account, based

on a solid theoretical background, otherwise the model will not work correctly

[35].

In the work of Eva Mill´an et al. [32], the researchers draw a parallel between

medical diagnosis systems and student’s knowledge diagnosis [33]. Actually, this

is an important comparison as the development of clinical reasoning and decision-

making skills is very similar [3].

The student answers a set of questions that can only be answered correctly,

when several concepts are known. In this case the knowledge of the concepts is

the cause of the answer. The noise in the process, for instance when a student

knows the concept but answers wrongly and vice versa a correct guess, is also

modelled. The initialization of the model parameters is made by teachers-in-the-

loop; afterwards the parameters are learned from the data. The model is used

to eﬃciently determine those concepts the student knows less and the deductive

proposal of the next question.

The “eTeacher” is a web-based education system [22], [41] that recognizes

the learning style of a student according to the performance in exams as well

as email, chat and messaging usage. The number of diﬀerent learning strategies

and their characteristics is the ”domain knowledge” deﬁning the structure of

the Bayesian network. The initialization of the parameters uses in some cases

uniform priors and in others priors deﬁned by experts. After that initial phase,

the parameters are continuously learned from the behaviour of the students.

After identifying the learning style, a recommendation engine proposes diﬀerent

ways to learn the same material to each student according to his or her learning

style.

“ANDES” is an ITS developed by Conati et al. [14], which mainly focuses on

knowledge tracing but also on recognition of the learning plan of the user. The

students solve Newtonian physics problems with diﬀerent possible solution paths

that deﬁne the Bayesian network’s structure. Since each action may belong to

diﬀerent solution paths and the user does not provide its reasoning explicitly,

the credit assignment problem is to ﬁnd and quantify the most likely solution an

action belongs to. This triggers personalized help instructions and hints in two

cases: when a wrong answer is given or when the model predicts that the answer

might be wrong. The parameters of the network change in an online manner

while the student is solving the problems. Firstly, the evaluation was made by

simulating students that have diﬀerent knowledge proﬁles and measuring the

accuracy of the predictions made by the model. In a second step, a post-test was

carried out to compare real students having used “ANDES” to students who

have not. Regression analysis was used to recognize the correlation between the

use of the program and the learning gain [7].

Speciﬁcally for mathematical problems there are several approaches that spe-

cialize in dealing with decimals misconceptions. In the work of Stacey et. al. [46],

[45] the misconceptions that deﬁne the structure of the model are provided by

two main factors: the domain knowledge and data of a Decimal Conception Test

Insights into Learning Competence through Probabilistic Graphical Models 5

(DCT) that students had to go through. Wrongly answered questions provided

by the students depend on their misconceptions. The researchers deﬁned the

distinct misconception by computing which of them has the highest probability

according to the data. Although the model drives diﬀerent question sequencing

strategies, some of the misconceptions were not correctly recognized. Therefore,

the researchers decided that the teacher and not the system should provide in-

structions.

Also, the research work of Goguadze concentrates on the modelling of dec-

imals misconceptions [25], [24]. The “AdaptErrEx” project selected the most

frequently occurring misconceptions and ordered them a taxonomy (higher and

lower level misconceptions), which is reﬂected in the dependencies of the Bayesian

network. As the previous application, a wrong answer may be caused by diﬀerent

misconceptions. The prior beliefs are deﬁned by a pre-test; the researchers assert

that suﬃcient training data diminish the role of the prior in the computation

of the posterior. This prior deﬁnes the typical/average student and then each

user’s parameters can be updated and individualized accordingly. One aspect

that has not been considered in this model yet, is the diﬃculty of each question:

easy questions will more likely be answered correctly than diﬃcult ones, even if

there is a high probability of misconception.

Several student modelling models track the progress of knowledge through

time with Dynamic Bayesian Networks (DBN). The knowledge of the learner

at each time point can be considered to be dependent on the knowledge and

(optionally) the observed result of the interaction at the previous time point

[34]. The project “LISTEN” [10] represents the hidden knowledge state of the

student at each time point. The observable entities are the tutor interventions

and the student’s performance which are used to infer the knowledge state. In

the work of K¨aser et. al. [27] there is an overview and comparison of Bayesian

Knowledge Tracing (BKT), which is a technique for student modelling using a

Hidden Markov model (HMM) modelling and DBN for various learning topics,

such as number representation, mathematical operations, physics, algebra and

spelling. A HMM is a special case of a DBN, which, according to the researchers,

cannot represent dependencies that would lead to hierarchies of skills; in these

case DBNs create more adequate models.

All above described applications have a Bayesian network of the students

model at its architecture core. There are a number of other components that

either support the teacher or the student. One of them, for example, is the vi-

sualization of the model in the “VisMod” application [52], which is displayed in

(among other things) color and arrow intensity instead of number-ﬁlled tables.

This increases the readability of the model and enhances the tutor’s understand-

ing. Gamiﬁcation elements can also be found in “Maths Garden” [28], an appli-

cation that lets users gain and loose points and coins depending on answering

correctly or wrongly. A coaching component that provides feedback and hints to

refresh the memory can be found in “ANDES” ’s architecture [7]. An overview

about the design and architectural elements of intelligent tutoring systems that

6

have a Bayesian network as user model is provided in the work of Gamboa et.

al. [21].

A detailed overview about intelligent techniques other than Bayesian net-

works, such as recommender systems for the computation of the learning path

as well as clustering and classiﬁcation for learner proﬁles that are used in e-

learning systems, is provided in [31]. Speciﬁcally in [5], the demand for the most

appropriate activity proposed - neither too easy nor too diﬃcult - can only be

fulﬁlled, if the used model is both accurate and adaptive.

1.3 Research Question

The main objective of this research work is to answer the research question,

whether Bayesian networks can quantify the deﬁned misconceptions of one-digit

multiplication problems. In order to answer this question the “1x1 trainer” ap-

plication is taken as the underlying data provider. The application focuses on

the recognition of the current learning status. However learning aware appli-

cations maintain an adaptive learning model that represents the knowledge of

the learner/user with regard to the learned topic. The application is expected

to support individual learning needs and abilities as well as considering com-

mon characteristics in the learning process of diﬀerent persons. The progress

of the learning model itself will be used to transform the learning application

into an adaptive one; that may change the content and sequence of assessments

constantly to improve the learning process and to maximize the learning eﬀorts.

The “1x1 trainer” application has a current overall report that is accessi-

ble to teachers. It contains information about the actual number as well as the

proportion of correct and wrong answers of each posed question. It uses color en-

coding that helps distinguishing four sets of questions with similar proportion of

correct and wrong answers. The implementation of the Bayesian model provides

further insights to detailed cognitive information that enriches the information

content of the current report. Furthermore, the new report can concentrate on

individualized learning status, considering the causes of the wrong answers and

can be updated in real time after each action of the student.

1.4 Outline

This research work proposes a Bayesian model for the learning competence of

students using the “1x1 trainer” application. The ﬁrst step is to specify the

error types that are relevant for this research; their detailed description is made

in section 2. Data analysis (speciﬁcally descriptive statistics) is used to guide

the necessary assumptions about the modelled entities and their independences.

Based on this information, the structure of the model and its parametrization

is deﬁned. The personalized model of each student and the method by which

it adapts its parameters to new data is described in section 3. The usage of

the model and the insights that are provided to the teachers in the form of an

enhanced report are explained in section 4. Finally, a conclusion about future

research and improvement possibilities is in section 5.

Insights into Learning Competence through Probabilistic Graphical Models 7

2 Error types of one-digit multiplication and descriptive

statistics

2.1 Error types of one-digit multiplication problems

The bug library [12] of the proposed learning competence model contains six

error types: operand, intrusion, consistency, oﬀ-by, add/sub/div, and pattern.

Any false answer that does not belong to one of those six categories is assigned

to the unclassiﬁed category. The description of the error types is explained in

detail in [48]; a brief description follows here:

1. Operand error: It occurs, when the student mistakes at least one operand

for one of its neighbours [8]. In the implementation only a neighbourhood of

overall absolute distance of 2 from the correct operands was considered. One

example is the answer 48 to the question 7×8 since the user may mistakenly

multiplied 6 ×8. Research shows that this is the most frequently occurred

error, but it occurs with a diﬀerent proportion in each posed question [9].

2. Operand intrusion (abbreviated intrusion) error: It happens, when the decades

digit and/or the unit digit of the result equals one of the two operands of

the posed question, for example 7 ×8 = 78. It is argued by [8] that the two

operands of the multiplication question are perceived as one number by the

student (the ﬁrst operand corresponding to the decades digit and the second

to the unit digit).

3. Consistency: The student’s answer has either the unit digit or the decade

digit of the correct answer [43], [16]. For example, the answer 46 to the

question 7 ×8 indicates that the unit digit is correct, but the decades digit

is false.

4. Oﬀ-by-±1, Oﬀ-by-±2 : It occurs, when the answer of the student deviates

from the correct one by ±1 or ±2, for example, when the answer of the

question 5 ∗8 is one of the following: {38,39,41,42}.

5. Add/Sub/Div: The student confuses the operation itself and performs for

example an addition instead of a multiplication; in that case the answer to

7×8 is 15.

6. Pattern: The student mistakes the order of the digits of the result, for ex-

ample, question 7 ×8 provides the answer 65 (the decades digit and the unit

digit are permuted).

7. Unclassiﬁed: Any answer that can not be matched to one of the above error

types.

All questions that have a correct answer with value smaller than 10 do not

have consistency error. These are: 1 ×1,1×2,2×1,1×3,3×1,1×4,4×1,

1×5,5×1,1×6,6×1,1×7,7×1,1×8,8×1,1×9,9×1,2×2,2×3,

2×4,3×2,3×3,4×2.

One of the main reasons to use a probabilistic graphical model, is the fact

that a speciﬁc false answer can be classiﬁed to multiple error types. The identi-

ﬁcation of the most probable error type causing a wrong answer is called credit

8

assignment. The table 1 shows the possible false answers for the question 7 ×8.

One can see that for example the answer 72 could occur because of an operand

or an intrusion error.

Table 1. Answers for question 7 ×8 listed by error types

Error Type Answers

operand 40,42,48,49,54,63,64,72

intrusion 18,28,38,48,58,68,71,72,73,74,75,76,77,78,79,88,98

consistency 16,26,36,46,51,52,53,54,55,57,58,59,66,76,86,96

oﬀ-by-±1/oﬀ-by-±254,55,57,58

add/sub/div 1,15

pattern 65

unclassiﬁed 4,5,6,7,8,9,10,11,12,19,20,21,22,23,24,25,27,29,30,

31,32,33,34,35,37,39,41,43,44,45,47,50,60,61,62,67,

69,80,81,82,83,84,85,87,89,90,91,92,93,94,95,97,99

correct 56

2.2 Data of the “1x1 trainer”

The data that were used for building the model were provided from the “1x1

trainer” application. The application is for both students and teachers. For this

work it was also used for a preliminary categorization of learners. Users of this

application are confronted with multiplication questions with both multiplicands

being one-digit integer numbers. The possible questions range from 1 ×1 up to

10 ×9 (a total of 90 questions) and are posed in a pre-speciﬁed order. The ap-

plication does not provide any means of help or hints to the students so far; the

only feedback users get, is whether their answer is correct or not. It is expected

that by repeated use of the application the students will learn and get better

through exercise. But there is no individualisation that takes care of the per-

sonal needs of the learning style and knowledge level of the users. Furthermore,

personal information such as age, gender, demographics, and educational level

were not collected.

The data were cleaned in the preprocessing phase. The answers that did not

lie in the interval [0 −100) were considered invalid and were removed. Overall

there were 1179720 question-answer pairs with 1164786 valid. The number of

unique users that gave at least one valid answer is 9058.

3 Probabilistic Graphical Model of Learning Competence

The use of a learning-aware data-driven application cannot assume that the

user’s learning competence remains unchanged. Simple statistical descriptions

are not practical in representing a continuous change and do not eﬀectively

capture the diﬀerences between the learning process of the users. Furthermore,

Insights into Learning Competence through Probabilistic Graphical Models 9

the purpose is to choose intelligent actions (also called “actionable information”

[1]) based on the data and this is not possible simply by one rigid and non-

adaptive analysis of the data.

The choice of a probabilistic graphical model has several beneﬁts. Firstly, it

allows the representation of conditional dependencies (and independencies cor-

respondingly) in the graphical representation of the model of the data. Those

are assumed to be the same for all users and stay stable over the course of

application usage. Secondly, its parameters (that can be thought of as a conﬁg-

uration or instance) are adaptive and change with each new data sample that is

observed. They may be a temporary snapshot description that characterizes the

learning competence but unlike the statistics there is an eﬀective way to adapt

those and not recompute them from scratch each time the model confronts new

data. Thirdly, they’ve already been extensively used for decision problems [1],

[29] which are the forefronts of reinforcement learning algorithms.

3.1 Introduction to Probabilistic Graphical Models

Probabilistic Graphical Models are representations of joint probability distri-

butions over random variables that have probabilistic relationships expressed

through a graph. The random variables involved can be discrete which have cat-

egorical values or continuous with real values. The set of possible values that a

random variable can take - sometimes also referred as the possible outcomes of

the experiment described by the random variable - is its domain. The random

variables can be either visible or hidden. The visible ones have outcomes that

can be directly observed and their values are contained in the dataset. The hid-

den variables are deﬁned by human experts using the domain knowledge of the

problem, but their outcomes are not directly accessible. They usually represent

latent causes of visible random variables and can improve the accuracy and the

interpretability of the model [30].

To specify the dependencies of the variables in general, one needs to specify

their direction, type and intensity. This is made with the use of graphs which

provide the terminology and theory for understanding and reasoning about Prob-

abilistic Graphical Models. The nodes (also called vertices) of the graph represent

the random variables and the edges their dependencies which can be directed

or undirected. Undirected models - also called Markov networks - on the other

hand represent symmetric probabilistic interactions where there is no depen-

dency with direction, only factors that represent the degree of the strength of

the connection. In case where the dependencies are directed, the graph must be

a directed acyclic graph (DAG), otherwise circular reasoning would be possible.

These two categories are used in diﬀerent applications.

3.2 Model Structure

Domain knowledge about the already described error types that are encoun-

tered in one-digit multiplication, as described in section 2, was used to deﬁne

10

the model. This is in accordance with the data-driven approach of model con-

struction [30] where the structure of the model is speciﬁed by the designer and

the parameters are learned from the data.

A question is either answered correctly or faulty. The student can make one

of the following errors: Operand, intrusion, consistency, oﬀ-by-±1 and oﬀ-by-±2,

pattern, confusion with addition, subtraction, division or an unclassiﬁed error

(meaning none of the above). Therefore a multinomial random variable called

Learning Stateq- individual for each question qwas chosen to represent the

proportion of each of these misconceptions of the user, when he or she is an-

swering a one-digit multiplication question. The variable follows the categorical

distribution; in this case the Learning Stateqhas eight possible outcomes and

the domain of this random variable is Val(Learning Stateq) = {operand, in-

trusion, consistency, pattern, confusion, unclassiﬁed, correct}(meaning that 1

is the operand error, 2 the intrusion error and so on). The Learning Stateqof

a speciﬁc user can be described for example 5% operand error, 4% consistency

error and 91% correct answering (the rest possible outcomes have 0%). This

parametrization must be learned from the data.

In the previous section it is shown that a speciﬁc faulty answer may be classi-

ﬁed to more than one error types. Although in reality the model does not assume

that more than one error type created a particular answer, the model cannot

know a priori which error type was more prevalent and played the decisive role

in choosing the wrong answer. The Learning Stateqis hidden and the percent

of each error type is expected to be learned by the provided answers. Thereby, a

dominant error type (for a speciﬁc user) can be still discovered and weaken the

belief that multiple error types played a role for a speciﬁc faulty answer. In sec-

tion 4 the inference of the most probable error type (credit assignment problem)

of a speciﬁc wrong answer will be made after the learning of the parameters is

completed.

The proportion of correct and false answers is diﬀerent for each question.

Even though each question is not posed the same number of times and the be-

lief about the possibility of correctly answering each question is diﬀerent, this

was also taken into account. That means that the probability of answering cor-

rectly is not. Therefore, there are 90 random variables called Correctness1×1to

Correctness10×9(abbreviated by Correctnessq) that have each two possible

outcomes. Therefore the Bernoulli distribution was chosen, which is equivalent

to a categorical distribution with a domain of two values.

Each question has a distinct random variable, named accordingly as Answers1×1

to Answers10×9(abbreviated as Answesq), which is a child of the Learning Stateq

random variable. The arrows from the Learning Stateqto its children reﬂect

the dependency of the answer to a question from the misconception or correct

understanding of the user.

The conditional independence property of each Learning Competence model

is expressed by the following equation:

Answesq⊥Correctnessq|Learning Stateq(1)

Insights into Learning Competence through Probabilistic Graphical Models 11

Fig. 1. The structure of all Probabilistic Graphical models for Learning Compe-

tence. The shaded Answesqnodes are the ones that are observed, whereas the

Correctnessq,Learning Stateqrandom variables remain unobserved.

The joint probability distribution for each question qhas the following fac-

torization:

P(Correctnessq,Learning Stateq,Answersq) =

P(Correctnessq)P(Learning Stateq|Correctnessq)

P(Answersq|Learning Stateq)

(2)

Each error type can only produce a speciﬁc subset of answers, so the others

will have zero probability of occurring given this particular error type. Every

row of the conditional probability tables of the Answersqrandom variables has

values that sum to one and the last row has only one entry with probability 1.0

at the column with the correct answer and 0.0 everywhere else. Figure 1 depicts

the described structure of Learning Competence models.

The model needs to express the following procedure: First knowing if the

question is answered correctly; this is provided by the Correctnessqrandom

variable. If this is true then there are no more steps to follow. In the case where

the answer is false, there must have been an error which belongs to the hid-

den Learning Stateq. One of the possible answers of this error, as seen and

quantiﬁed by Answersqwill be the actual answer of the user.

The model reﬂects our belief about the overall learning competence of the

user. Its structure is considered to be the same for all users, but the conditional

probability values (entries in the conditional probability tables) will diﬀer for

each individual user. Nevertheless the model can also reveal similarities between

the users, meaning at this stage models that have similar parameter values.

3.3 Learning the Model’s Parameters

The answers of questions of the students which are already gathered comprise the

data set denoted by D. The goal of parameter learning is the estimation of the

densities of all random variables in the model. The joint probability distribution

PMdeﬁned by the model Mwith parameters Θis expressed by equation 2.

The parameter learning’s goal is to increase the likelihood of the data given the

12

model: P(D|M) or equivalently the log-likelihood: log P (D|M) with respect to

the set of the parameters Θof the model. The likelihood expresses the probability

of the data given a particular model; a model that assigns a higher likelihood to

the data Dapproximates the true distribution (the one that has generated the

data) better.

The algorithm that is used to estimate the parameters in cases where some

of the variables are hidden is expectation-maximization (EM). Since the la-

tent variables Correctnessqand Learning Stateqare not observed, the direct

maximization of the likelihood is not possible. The EM algorithm initializes all

model’s parameters randomly and iteratively increases the likelihood by step-

wise maximizing the expected value of the currently estimated parameters [2]. If

the likelihood’s increase or the parameters’ change is not signiﬁcant compared

to the previous iteration, then the algorithm can be stopped. The procedure of

updating the log-likelihood in this manner is shown to guarantee convergence

to a stationary point, which can be a local minimum, local maximum or saddle

point. Fortunately, by initializing the iterations from diﬀerent starting parame-

ters and injecting small changes to the parameters, the local minima and saddle

points can be avoided [30].

The models are simple enough; therefore the EM-algorithm has a straightfor-

ward analytical solution. The available data were divided into a training and test

set, with a dataset containing data from users that have answered all the ques-

tions at least one time (The number of users that have answered all questions

exactly once is 2218). The models parameters are computed by the EM-algorithm

on the training data and it iterates 4 times. After 4 iterations the likelihood of

the training set increases, but the likelihood of the test set decreases which con-

sists an indication of overﬁtting. The diagram in ﬁgure 2 describes the main

computational blocks of this process.

Fig. 2. Computational blocks diagram of data preprocessing, splitting into training

and test set until the computation of the learned model.

Insights into Learning Competence through Probabilistic Graphical Models 13

4 Insights

After the model of a particular student is learned - by using the informed prior

as starting point and as evidence the answers he or she has given so far. The

better and more accurate the model captures the learning competence of the

student, the better the performance of the predictions of the answers will be. In

some probabilistic modelling frameworks such as Figaro 7the parameter learning

part is made by the oﬄine component and the probability queries by the online

component.

There are three types of reasoning one can make with probabilistic graphi-

cal models: causal, evidential and explaining away. Causal reasoning (also called

prediction) consists of statements that start with the knowledge of the causes as

evidence and provide information about the eﬀects. In our model this would be

possible if the Correctnessqand Learning Stateqwere known: the computa-

tion of the answer to a posed question would be accurately determinable. The

direction of causal reasoning in directed graphical models goes from parent to

child variables (“downstream”) in general and is used to predict future events.

Evidential reasoning (also called explanation) on the other hand has the

opposite direction and involves situations where eﬀects lead to the speciﬁcation

of causes. This is the most important reasoning in our case because the answers

of the students provide the information to do evidential reasoning and learn

the hidden variables Correctnessqand Learning Stateqwhich in turn can

be used for causal reasoning to predict the future answers of each student. The

diﬀerence in causal and evidential reasoning can be understood by considering

the direction of time; evidential reasoning infers the past probability distribution

from the current set of data whereas causal reasoning makes a prediction for the

future given the data. The great beneﬁt of graphical models over statistics is

that the same model is used for both backward and forward reasoning (with

respect to the perception of time).

Intercausal reasoning occurs when one random variable depends on two or

more parents. In this case, the observation of the value of one parent inﬂuences

the belief about the value of the other(s) (either strengthen or weaken). In this

situation it is said that one reason explains away the other. The Learning Com-

petence model’s structure does not contain such cases; further discussion about

this reasoning type can be found in [26], [30], [2].

The upcoming sections proceed with an analytical implementation of prob-

abilistic queries which is speciﬁc to the designed Learning Competence models.

Personalized insights computed by the latent explanations of wrong answers of

each student are made possible by exact and eﬃcient inference as described in

section 4.2.

4.1 Probability Queries

A conditional probability query P(Y|E=e) - also called probabilistic inference

- computes the posterior of the subset of random variables represented by Y(tar-

7https://www.cra.com/work/case-studies/ﬁgaro, Last accessed 25 August 2018

14

get of query) given observations eof the subset of evidence variables denoted by

E(of course there may be a subset of variables Zin the model not belonging to

either of these two subsets). By using the Bayes rule, the conditional probability

is written as:

P(Y|E=e) = P(Y, e)

P(e)(3)

The MAP query, which is also called most probable explanation (MPE) [30],

[39] is a query that maximizes the posterior of the joint distribution of a subset

of random variables Y:

MAP(Y|E=e) = arg max

yP(y, e) (4)

In the case of MAP Query the whole set of random variables is X={Y,E}.

In other words the MPE, after observing (clamping) a subset of variables, it

computes the most likely values of the rest of them jointly.

A slightly diﬀerent query is the marginal MAP which is written as follows:

Marignal MAP(Y|E=e) = arg max

yP(y|e) =

arg max

YX

Z

P(Y,Z|E=e)(5)

which directly follows from the fact that X={Y,E,Z}.

The computation of the query result can be made with the variable elimina-

tion algorithm which is described in the following section 4.2; in this case the ex-

act value of equation 3 is computed by dividing P(y, e) = PwP(y, e, w) and P(e) =

PyP(y, e). Alternatively, the the normalization of a vector containing all P(yk, e)

(where ykare all possible outcomes of the variables Y) so that it has sum

that equals to one, provides also the desired result. For more complex Bayesian

networks approximate inference algorithms are applied since exact inference is

NP-hard [29], but even those can be in the worst-case also NP-hard [30]. The

Learning Competence models are simple and the variable elimination algorithm

is fast enough.

4.2 Variable elimination in the Learning Competence Model

The probability query that is of high relevance for the teachers is the probability

of error types regarded as causes of a speciﬁc wrong answer. The sum of products

expression in equation 6 computes the distribution of the Learning Stateqby

means of the joint distribution P(Cq,LSq,Aq):

P(LSq) = X

Cq,Aq

P(Cq,LSq,Aq) = X

Cq

P(Cq)P(LSq|Cq)X

Aq

P(Aq|LSq) (6)

Insights into Learning Competence through Probabilistic Graphical Models 15

Fig. 3. Parameters of Learning Competence model of question 6 ×7 that are relevant

to the computation of the MAP query when the answer is 40

The ﬁrst step of the Variable Elimination algorithm, in case it is applied

where an evidence exists, is to compute the unnormalized joint distribution

P(C6×7,LS6×7,A6×7= 40). For example, the faulty answer 40 for the question

6×7 eliminates all cases for which the answer is not equal to 40; it can belong

only to two potential error types: consistency and oﬀ-by. The remaining rows of

the joint distribution - those containing the unnormalized proportion unequal to

0 are listed in table 2. The computations use the corresponding parameters of

the Learning Competence Model of question 6 ×7 depicted in ﬁgure 3.

Table 2. Unnormalized joint distribution P(C6×7,LS6×7,A6×7= 40)

C6×7LS6×7A6×7unnormalized proportions

wrong operand 40 0.158 ·0.336 ·0.035 = 1.85 ·10−3

wrong oﬀ-by 40 0.158 ·0.103 ·0.202 = 3.28 ·10−3

The sum of the unnormalized proportions, 1.85·10−3+3.28·10−3= 5.14·10−3

(which is the value of P(A6×7= 40)), can be used to compute the normalized

probabilities of the causes of answer 40 as depicted in table 3.

Table 3. Normalized joint distribution P(C6×7,LS6×7,A6×7= 40)

C6×7LS6×7A6×7normalized probabilities

wrong operand 40 1.85 ·10−3/5.14 ·10−3= 0.36

wrong oﬀ-by 40 3.28 ·10−3/5.14 ·10−3= 0.64

16

The process eventually performs the following computation in equation 7

which is in accordance to equation 3.

P(C6×7,LS6×7|A6×7= 40) = P(C6×7,LS6×7,A6×7= 40))

P(A6×7= 40) (7)

The distributions Correctness6×7and Learning State6×7in the Learning

Competence model is as follows:

Table 4. Learning State6×7distribution of wrong answers in question 6 ×7 before

the user answers 40

wrong correct

0.158 0.842

operand intrusion consistency oﬀ-by add/sub/div pattern unclassiﬁed

0.336 0.079 0.163 0.103 0.0014 0.072 0.243

After observing 40, the Explanations probability distributions are as follows:

Table 5. Explanations distribution of wrong answers in question 6 ×7 after the user

answers 40

wrong

1.0

operand oﬀ-by

0.36 0.64

The result of the MAP Query (most probable explanation) is the joint as-

signment MAP(Correctness6×7,Learning State6×7) = (wrong,oﬀ-by). The

result of the Marginal MAP query over the Learning State6×7only, states

that the most probable cause of the answer is the oﬀ-by error, as seen in ﬁgure

4.

This is an example of a case where the an error type has a higher probability

than another one in the P(Learning Stateq|Correctnessq= wrong) distribu-

tion, but the probability query could state that the most probable cause of a

particular answer is the second one.

The results of the probability queries depend on the parameters of the model,

which in turn are inﬂuenced by the prior distribution and the number of EM-

iterations.

5 Future Work

The learned probabilistic model can be used in a generative scheme where the

learning application will sample the model to predict the answer of the student.

There are several algorithms that compute samples from the models with dif-

ferent characteristics [39], [30]. Particularly for this model where the dataset is

highly unbalanced and the number of correctly answered questions is predom-

inant, the metric to measure prediction performance should particularly take

this fact into account. Although this feature does not provide an insight per

Insights into Learning Competence through Probabilistic Graphical Models 17

Fig. 4. Learning State6×7and explanations distribution of question 6 ×7 before and

after the user answers 40

se, it can be a starting point for other informative learning aspects. One as-

pect is explainable-AI which combines Bayesian learning approaches with clas-

sic logical approaches and ontologies, thereby making use of re-traceability and

transparency [23].

Even though the proposed research extends the capabilities of the current

learning application considerably, it cannot answer the fundamental question of

which should be the most appropriate question to pose to the student. After

the diﬀerent learning competences are derived, the handling is delegated to the

teacher, not the application itself. Further considerations apply to whether the

models of learning competences that could be grouped together are the ones

where the students will have the same learning path till they’ve learned to an-

swer all questions correctly. The goal of this learning-aware application is not to

group the learning competences by similarity of their parameters (expressing the

current situation), but to ﬁnd which ones will lead to similar optimal learning

paths. This learning-aware application could beneﬁt from an answer prediction

component that accurately simulates students learning paths.

References

1. Barga, R., Fontama, V., Tok, W.H., Cabrera-Cordon, L.: Predictive analytics with

Microsoft Azure machine learning. Springer (2015)

2. Bishop, C.: Pattern recognition and machine learning. Springer (2006)

18

3. Bloice, M., Simonic, K.M., Holzinger, A.: On the usage of health records for the

teaching of decision-making to students of medicine. In: Huang, R., Kinshuk, Chen,

N.S. (eds.) The New Development of Technology Enhanced Learning, pp. 185–201.

Springer Berlin Heidelberg (2014). https://doi.org/10.1007/978-3-642-38291-8 11

4. Brusilovsky, P., Millan, E.: User models for adaptive hypermedia and adaptive

educational systems. In: The adaptive web, pp. 3–53. Springer, Heidelberg (2007).

https://doi.org/10.1007/978-3-540-72079-9 1

5. Brusilovsky, P., Mill´an, E.: User models for adaptive hypermedia and adaptive

educational systems. In: The adaptive web, pp. 3–53. Springer (2007)

6. Brusilovsky, P., Peylo, C.: Adaptive and intelligent web-based educational systems.

International Journal of Artiﬁcial Intelligence in Education (IJAIED) 13(2-4), 159–

172 (2003)

7. Bunt, A., Conati, C.: Probabilistic student modelling to improve exploratory be-

haviour. User Modeling and User-Adapted Interaction 13(3), 269–309 (2003)

8. Campbell, J.I.: Mechanisms of simple addition and multiplication: A modiﬁed

network-interference theory and simulation. Mathematical cognition 1(2), 121–164

(1995)

9. Campbell, J.I.: On the relation between skilled performance of simple division

and multiplication. Journal of Experimental Psychology: Learning, Memory, and

Cognition 23(5), 1140–1159 (1997)

10. Chang, K.m., Beck, J., Mostow, J., Corbett, A.: A bayes net toolkit for student

modeling in intelligent tutoring systems. In: Proceedings of the 8th International

Conference on Intelligent Tutoring Systems. pp. 104–113. Springer (2006)

11. Chater, N., Tenenbaum, J.B., Yuille, A.: Probabilistic models of cognition: Con-

ceptual foundations. Trends in cognitive sciences 10(7), 287–291 (2006)

12. Chrysaﬁadi, K., Virvou, M.: Student modeling approaches: A literature review for

the last decade. Expert Systems with Applications 40(11), 4715–4729 (2013)

13. Conati, C., Gertner, A., Vanlehn, K.: Using bayesian networks to manage uncer-

tainty in student modeling. User modeling and user-adapted interaction 12(4),

371–417 (2002)

14. Conati, C., Gertner, A.S., VanLehn, K., Druzdzel, M.J.: On-line student modeling

for coached problem solving using bayesian networks. In: User Modeling UM 97.

pp. 231–242. Springer (1997). https://doi.org/10.1007/978-3-7091-2670-7 24

15. Danaparamita, M., Gaol, F.L.: Comparing student model accuracy with bayesian

network and fuzzy logic in predicting student knowledge level. International Jour-

nal of Multimedia and Ubiquitous Engineering 9(4), 109–120 (2014)

16. Domahs, F., Delazer, M., Nuerk, H.C.: What makes multiplication facts diﬃcult:

Problem size or neighborhood consistency? Experimental Psychology 53(4), 275–

282 (2006)

17. Ebner, M., Neuhold, B., Sch¨on, M.: Learning analytics–wie datenanalyse helfen

kann, das lernen gezielt zu verbessern. In: Wilbers, K., Hohenstein, A. (eds.)

Handbuch E-Learning-Expertenwissen aus Wissenschaft und Praxis-Strategie, In-

strumente, Fallstudien, pp. 1–20. Deutscher Wirtschaftsdienst (Wolters Kluwer

Deutschland), 48, erg.-lfg edn. (2013)

18. Ebner, M., Sch¨on, M.: Why learning analytics in primary education matters. Bul-

letin of the Technical Committee on Learning Technology 15(2), 14–17 (2013)

19. Ebner, M., Sch¨on, M., Taraghi, B., Steyre, M.: Teachers little helper: Multi-math-

coach. International Association for Development of the Information Society (2013)

20. Ebner, M., Taraghi, B., Saranti, A., Sch¨on, S.: Seven features of smart learn-

ing analytics-lessons learned from four years of research with learning analytics.

eLearning papers 40, 51–55 (2015)

Insights into Learning Competence through Probabilistic Graphical Models 19

21. Gamboa, H., Fred, A.: Designing intelligent tutoring systems: a bayesian approach.

Enterprise information systems 3, 452–458 (2002)

22. Garc´ıa, P., Amandi, A., Schiaﬃno, S., Campo, M.: Evaluating bayesian networks

precision for detecting students learning styles. Computers & Education 49(3),

794–808 (2007)

23. Goebel, R., Chander, A., Holzinger, K., Lecue, F., Akata, Z., Stumpf, S.,

Kieseberg, P., Holzinger, A.: Explainable ai: the new 42? In: Springer Lec-

ture Notes in Computer Science LNCS 11015. pp. 295–303. Springer (2018).

https://doi.org/10.1007/978-3-319-99740-7 21

24. Goguadze, G., Sosnovsky, S., Isotani, S., McLaren, B.M.: Towards a bayesian stu-

dent model for detecting decimal misconceptions. In: Proceedings of the 19th Inter-

national Conference on Computers in Education. pp. 34–41. Chiang Mai, Thailand

(2011)

25. Goguadze, G., Sosnovsky, S.A., Isotani, S., McLaren, B.M.: Evaluating a bayesian

student model of decimal misconceptions. In: Proceedings of the 4th International

Conference on Educational Data Mining. pp. 301–306. Citeseer (2011)

26. Karkera, K.R.: Building probabilistic graphical models with Python. Packt Pub-

lishing Ltd (2014)

27. K¨aser, T., Klingler, S., Schwing, A.G., Gross, M.: Dynamic bayesian networks for

student modeling. IEEE Transactions on Learning Technologies 10(4), 450–462

(2017)

28. Klinkenberg, S., Straatemeier, M., van der Maas, H.L.: Computer adaptive practice

of maths ability using a new item response model for on the ﬂy ability and diﬃculty

estimation. Computers & Education 57(2), 1813–1824 (2011)

29. Kochenderfer, M.J.: Decision making under uncertainty: theory and application.

MIT Press (2015)

30. Koller, D., Friedman, N.: Probabilistic graphical models: principles and techniques.

MIT Press (2009)

31. Markowska-Kaczmar, U., Kwasnicka, H., Paradowski, M.: Intelligent techniques in

personalization of learning in e-learning systems. In: Computational Intelligence

for Technology Enhanced Learning, pp. 1–23. Springer (2010)

32. Mill´an, E., Agosta, J.M., P´erez de la Cruz, J.L.: Bayesian student modeling and

the problem of parameter speciﬁcation. British Journal of Educational Technology

32(2), 171–181 (2001)

33. Mill´an, E., Loboda, T., P´erez-De-La-Cruz, J.L.: Bayesian networks for student

model engineering. Computers & Education 55(4), 1663–1683 (2010)

34. Mill´an, E., P´erez-De-La-Cruz, J.L.: A bayesian diagnostic algorithm for student

modeling and its evaluation. User Modeling and User-Adapted Interaction 12(2-

3), 281–330 (2002)

35. Mill´an, E., Trella, M., P´erez-de-la Cruz, J.L., Conejo, R.: Using bayesian networks

in computerized adaptive tests. In: Computers and Education in the 21st Century,

pp. 217–228. Springer (2000)

36. Nouh, Y., Karthikeyani, P., Nadara jan, R.: Intelligent tutoring system-bayesian

student model. In: 1st International Conference on Digital Information Manage-

ment. pp. 257–262. IEEE (2006)

37. Pardos, Z.A., Heﬀernan, N.T., Anderson, B., Heﬀernan, C.L.: Using ﬁne-grained

skill models to ﬁt student performance with bayesian networks. Handbook of edu-

cational data mining pp. 417–426 (2010)

38. Pearl, J.: Embracing causality in default reasoning. Artiﬁcial Intelligence 35(2),

259–271 (1988)

20

39. Pfeﬀer, A.: Practical Probabilistic Programming. Manning Publications (2016)

40. Romero, C., Ventura, S.: Educational data mining: a review of the state of the art.

IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and

Reviews) 40(6), 601–618 (2010)

41. Schiaﬃno, S., Garcia, P., Amandi, A.: eteacher: Providing personalized assistance

to e-learning students. Computers & Education 51(4), 1744–1754 (2008)

42. Sch¨on, M., Ebner, M., Kothmeier, G.: It’s just about learning the multiplication

table. In: Buckingham Shum, S., Gasevic, D., Ferguson, R. (eds.) Proceedings of

the 2nd international conference on learning analytics and knowledge. pp. 73–81.

ACM, New York, NY, USA (2012)

43. Seidenberg, M.S., McClelland, J.L.: A distributed, developmental model of word

recognition and naming. Psychological review 96(4), 523–568 (1989)

44. Siemens, G., d Baker, R.S.: Learning analytics and educational data mining: to-

wards communication and collaboration. In: Proceedings of the 2nd international

conference on learning analytics and knowledge. pp. 252–254. ACM (2012)

45. Stacey, K., Flynn, J.: Evaluating an adaptive computer system for teaching about

decimals: Two case studies. In: AI-ED2003 Supplementary Proceedings of the 11th

International Conference on Artiﬁcial Intelligence in Education. pp. 454–460. Cite-

seer (2003)

46. Stacey, K., Sonenberg, E., Nicholson, A., Boneh, T., Steinle, V.: A teaching model

exploiting cognitive conﬂict driven by a bayesian network. In: International Con-

ference on User Modeling. pp. 352–362. Springer (2003)

47. Taraghi, B., Ebner, M., Saranti, A., Sch¨on, M.: On using markov chain to evi-

dence the learning structures and diﬃculty levels of one digit multiplication. In:

Proceedings of the Fourth International Conference on Learning Analytics And

Knowledge. pp. 68–72. ACM (2014)

48. Taraghi, B., Frey, M., Saranti, A., Ebner, M., M¨uller, V., Großmann, A.: Determin-

ing the causing factors of errors for multiplication problems. In: European Summit

on Immersive Education. pp. 27–38. Springer (2014)

49. Taraghi, B., Saranti, A., Ebner, M., Mueller, V., Grossmann, A.: Towards a

learning-aware application guided by hierarchical classiﬁcation of learner proﬁles.

J. UCS 21(1), 93–109 (2015)

50. Taraghi, B., Saranti, A., Ebner, M., Sch¨on, M.: Markov chain and classiﬁcation

of diﬃculty levels enhances the learning path in one digit multiplication. In: In-

ternational Conference on Learning and Collaboration Technologies. pp. 322–333.

Springer (2014)

51. Xenos, M.: Prediction and assessment of student behaviour in open and distance

education in computers using bayesian networks. Computers & Education 43(4),

345–359 (2004)

52. Zapata-Rivera, J.D., Greer, J.E.: Interacting with inspectable bayesian student

models. International Journal of Artiﬁcial Intelligence in Education 14(2), 127–

163 (2004)