Content uploaded by Lei Shi
Author content
All content in this area was uploaded by Lei Shi on Aug 24, 2023
Content may be subject to copyright.
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL: A Generative Adversarial
Imitation Learning Approach of Student
Modelling for Intelligent Tutoring Systems
Zhaoxing Li1, Lei Shi1,2*, Jindi Wang1, Alexandra I. Cristea1
and Yunzhan Zhou1
1Department of Computer Science, Durham University, Durham,
UK.
2Open Lab, School of Computing, Newcastle University,
Newcastle Upon Tyne, UK.
*Corresponding author(s). E-mail(s): lei.shi@newcastle.ac.uk;
Contributing authors: zhaoxing.li2@durham.ac.uk;
jindi.wang@durham.ac.uk;alexandra.i.cristea@durham.ac.uk;
yunzhan.zhou@durham.ac.uk;
Abstract
The continuous application of Artificial Intelligence (AI) technologies in
online education has led to significant progress, especially in the field
of Intelligent Tutoring Systems (ITS), online courses and learning man-
agement systems (LMS). An important research direction of the field
is to provide students with customised learning trajectories via student
modelling. Previous studies have shown that customisation of learn-
ing trajectories could effectively improve students’ learning experiences
and outcomes. However, training an ITS that can customise students’
learning trajectories suffers from cold-start, time-consumption, human
labour-intensity, and cost problems. One feasible approach is to simu-
late real students’ behaviour trajectories through algorithms, to generate
data that could be used to train the ITS. Nonetheless, implementing
high-accuracy student modelling methods that effectively address these
issues remains an ongoing challenge. Traditional simulation methods, in
particular, encounter difficulties in ensuring the quality and diversity of
the generated data, thereby limiting their capacity to provide Intelligent
Tutoring Systems (ITS) with high-fidelity and diverse training data. We
thus propose Sim-GAIL, a novel student modelling method based on
1
Springer Nature 2021 L
A
T
E
X template
2Sim-GAIL
Generative Adversarial Imitation Learning (GAIL). To the best of our
knowledge, it is the first method using GAIL to address the challenge
of lacking training data, resulting from the issues mentioned above. We
analyse and compare the performance of Sim-GAIL with two traditional
Reinforcement Learning-based and Imitation Learning-based methods
using action distribution evaluation, cumulative reward evaluation, and
offline-policy evaluation. The experiments demonstrate that our method
outperforms traditional ones on most metrics. Moreover, we apply our
method to a domain plagued by the cold start problem, Knowledge Trac-
ing (KT), and the results show that our novel method could effectively
improve the KT model’s prediction accuracy in a cold-start scenario.
Keywords: Student Modelling, Generative Adversarial Imitation Learning,
Intelligent Tutoring Systems.
1 Introduction
Intelligent Tutoring Systems (ITS) are increasingly incorporating Artificial
Intelligence (AI) technologies, including machine learning and deep learning,
which could effectively offer customised learning trajectories for each student
based on their prior knowledge and learning activities [1]. Research in cognitive
science has shown that there is a strong relationship between, amongst others,
the sequence of learning materials and learning outcomes [2]. In a traditional
online learning platform, there is only one single static linear learning trajec-
tory provided to students. In this one-size-fits-all approach, students may lose
their motivation and even drop out of the course, due to anxiety or boredom
encountered in the learning process [3]. Research on customised learning tra-
jectories for students has been emerging in the ITS field. However, developing
an ITS that can provide students with customised learning trajectories requires
a large amount of data for training the system, which is time-consuming and
costly [4], long known to be requiring a large amount of manual labour from
education providers (instructors, authors, etc.) [5]. Although many mature
ITSs have sufficient data to train algorithms, a large number of emerging ITSs
are still suffering from a lack of training data in the early stages of development,
also known as the cold start problem [6].
To tackle these challenges, previous studies have proposed various methods
for simulating student learning trajectories (i.e., generating massive student
learning behavioural data) that can be used to train an ITS. Early simulated
student behaviour proposals stemmed from the aim at automatic validation
of educational interventions via a sandbox method [7]. More recently, Jar-
boui et al. attempted to model student trajectory sequences into a Markov
Decision Process [8], but in real educational scenarios, only a few ITS can pro-
vide all the feature data consistent with a Markov Decision Process (e.g., the
reward function of the ITS agent). Zimmer et al. defined reward functions to
build reinforcement learning agents [9] to generate student trajectories, but
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 3
this method requires building different reward functions for different datasets,
which makes it difficult to generalise. Besides, humans’ psychological responses
to learning trajectories and reward mechanisms are difficult to simulate. This
leads to circumstances where student simulation methods may not be able to
simulate student learning trajectories sufficiently. Anderson et al. proposed a
student simulation method based on Behavioural Cloning (BC) [10], the sim-
plest form of Imitation Learning, which aims to solve the abovementioned
problems where the reward is sparse and hard to define [11]. Whilst promis-
ing, BC-based methods only learn from the few features collected in student
data, and the actions that algorithms are able to model can be very limited.
Motivated by the gap in prior literature identified above, the research
question of this paper is: How to build an efficient student simulation
method that can generate massive student learning data, which can
be used for ITS training?
To answer this research question, we propose Sim-GAIL, a Generative
Adversarial Imitation Learning (GAIL) approach to student modelling. Our
Sim-GAIL method can be used to generate simulated student data to solve
the lack of data and cold start problems in ITS training.
Furthermore, to showcase its efficiency, we compare our Sim-GAIL with the
two main student modelling methods used in the ITS field, the RL-based and
the BC-based student modelling approach, using data from the very recent and
largest ITS dataset, EdNet [12]. We extract action and state features to train
the models. We analyse and compare performance using action distribution
evaluation, cumulative reward evaluation (CRE), and two offline-policy eval-
uation (OPE) methods, which include Importance Sampling (IS) and Fitted
Q Evaluation (FQE). Moreover, we apply our method’s generated data in an
ITS cold-start scenario. The experimental results show that our method out-
performs the two traditional RL-based and BC-based baseline methods and
could improve the training efficiency of the ITS in a cold-start scenario.
The main contributions of this work lie in the following three aspects:
1. We propose Sim-GAIL, a student modelling approach, to generate simula-
tion data for ITS training.
2. It is the first method, to the best of our knowledge, that uses Generative
Adversarial Imitation Learning (GAIL) to implement student modelling to
address the challenge of lacking training data and the cold start problem.
3. The experiments demonstrate that a trained Sim-GAIL could simulate real
student learning trajectories very well. Our method outperforms traditional
RL-based and BC-based methods on most metrics and can improve the
training efficiency in cold start scenarios.
Thus, the advantages of Sim-GAIL include its ability to effectively gen-
erate data resembling real student behaviours, address the cold-start problem,
demonstrate superior performance on various metrics, efficiently converge to
an optimal policy, and offer scalability and generality across different datasets
and applications.
Springer Nature 2021 L
A
T
E
X template
4Sim-GAIL
This paper is structured as follows. Section 2introduces the background of
reinforcement learning, imitation learning (including behavioural cloning), and
student modelling. Section 3demonstrates the dataset, data pre-processing,
and model architecture. Section 4outlines the experiments and baseline mod-
els. Section 5discusses the evaluation methods and the experimental results
based on action distribution, Offline Policy (OP) evaluation, Expected Cumu-
lative Rewards (ECR) evaluation, and Knowledge Tracing (KT). Section 6
discusses our findings and future works. Section 7draws conclusions.
2 Background and Literature Review
Before analysing current competitors of the proposal for student modelling for
generating training data presented in this paper, we show the current state of
the underlying methodologies: Markov decision process, reinforcement learn-
ing, imitation learning, and finally, the method at the basis of our proposal,
generative adversarial imitation learning.
2.1 Markov Decision Process & Reinforcement Learning
The Markov Decision Process (MDP) is the standard method for sequential
decision-making (SDM) [13]. The sequential decision-making models can gen-
erally be seen as an instance of the Markov decision process. Reinforcement
learning is also typically regarded as an MDP [14]. Therefore, in this section,
we introduce MDP and then reinforcement learning.
2.1.1 Markov Decision Process (MDP)
MDP is a mathematical model of sequential decision used to generate stochas-
tic policies and rewards, achievable by an agent in an environment where the
system state exhibits Markov properties [15]. MDPs are represented as a set of
interacting objects, namely agents and environments, with components includ-
ing states, actions, policies, and rewards. In an MDP model, the agent observes
the present state of the environment and takes actions on the environment
in accordance with the policy, thereby changing the state of the environment
and getting rewards. The ultimate goal of the agent is to reach the maximum
cumulative reward, which is achieved using a reward function [16]. Figure 1
shows the structure of the MDP.
2.1.2 Reinforcement Learning (RL)
RL is a type of machine learning method that enables an agent to learn a
policy by taking different actions in an interactive environment, in order to
maximise cumulative rewards. It could be defined as the tuple of (S,A,P,R),
where Sis defined as the state of the environment, Arepresents actions of
the agent, P:S× A × S→[0,1] represents the transition probabilities of
actions from the current state to the next state and R:S×A×S→R
denotes the reward function. The goal of an RL agent is to achieve maximum
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 5
Fig. 1 Framework of the Markov Decision Process.
cumulative rewards. However, the drawback of traditional RL methods lies
in its computational overhead, brought by repeated interactions between the
agent and the environment.
2.2 Imitation Learning (IL)
Different from RL, where the agent learns by interacting with the environment
to obtain the maximum rewards, IL is a method of learning policy that involves
emulating the behaviour of experts’ trajectories [17], instead of leveraging an
explicit reward function as in RL.
2.2.1 Behavioural Cloning (BC)
BC considers the learning of policy under supervised learning settings, lever-
aging state-action pairs [18,19]. Albeit simple and effective, BC suffers from
the heavy reliance on extremely large amounts of data [20,21], without which
a distributional mismatch, often referred to as covariate shift [22,23], would
occur, due to compounding errors and stochasticity in the environment during
test time.
2.2.2 Apprenticeship Learning (AL)
Different from BC, AL instead tries to identify features of the expert’s trajec-
tories that are more generalisable and to find a policy that matches the same
feature expectations with respect to the expert [24]. Its goal is to find a pol-
icy that performs no worse than the expert across a class of cost functions.
The main limitation of AL is that it cannot imitate the expert trajectory well,
due to the restricted class of cost functions. Specifically, when the true cost
function does not lie within the cost function classes, the agent cannot be
guaranteed to outperform the expert.
Springer Nature 2021 L
A
T
E
X template
6Sim-GAIL
2.3 Generative Adversarial Imitation Learning (GAIL)
GAIL addresses the drawbacks of RL and AL effectively [20], by borrowing the
idea of Generative Adversarial Networks (GANs) [25]. It is derived from a type
of Imitation Learning, called Maximum Causal Entropy Inverse Reinforcement
Learning (MaxEntIRL) [26].
Figure 2shows the mechanism of GAIL. Integrating GANs into imitation
learning allows for the Generator never to be exposed to real-world examples,
enabling agents to learn only from experts’ demonstrations. In GAIL, the
Discriminator is trained with the objective of distinguishing the generated
trajectories from real trajectories, while the Generator, on the other hand,
attempts to imitate the real trajectories, to fool the Discriminator into thinking
it is actually one of them.
Fig. 2 Mechanism of Generative Adversarial Imitation Learning.
2.4 Student Modelling
As the traditional one-size-fits-all approach can no longer satisfy student learn-
ing needs, it leads to increased demands for customised learning [27,28].
Various student modelling methods have been proposed, which are gener-
ally classified as integrating expert knowledge-based or data-driven methods
[29,30]. Knowledge-based methods refer to utilising human knowledge to
address issues that would normally require human intelligence [7,31]. Data-
driven methods simulate students’ learning trajectories through massive
student learning records data [6,32,33].
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 7
2.4.1 Expert Knowledge-based Methods
The majority of the studies in this field involve building different forms of stu-
dent models, to train a reinforcement learning (RL) agent [34]. Glesias et al.
proposed a Markov Decision Process based on expert knowledge, to train stu-
dent models [35]. Doroud et al. suggested an RL-based agent method rooted
in cognitive theory, to optimise the sequencing of the knowledge components
(KCs) [34]. The reward function of this method is based on pre- and post-test
scores, taken as a metric, and termed Normalised Learning Gain (NLG). How-
ever, this metric needs evaluation by human participants, which is excessively
human resource-intensive. Yudelson et al. proposed a ‘Student Simulation’
method based on Bayesian Knowledge Tracing (BKT), which could train a ‘sim
student’ to imitate real students’ mastery of different knowledge [36]. Segal et
al. suggested a student simulation method based on the Item Response Theory
(IRT) [37], which could respond to different reactions to courses at different
difficulty levels [38]. Azhar et al. [39] introduced an application of Reinforce-
ment Learning (RL) for optimising the learning sequence modelling of online
learning materials, which is an end-to-end pipeline to automatically derive
and evaluate robust representations of students’ interactions and policies for
content sequencing in online education.
2.4.2 Data-driven methods
Compared with integrating expert knowledge-based methods, data-driven
methods could better simulate real students’ learning trajectories and more
effectively reduce biases [13]. There have been some studies [40–42] aiming to
build student simulation methods based on data-driven MDP approaches. For
example, Beck et al. proposed a Population Student Model (PSM) based on a
linear regression model that could simulate the probability of the student’s cor-
rect response [43]. However, this method requires a high-quality dataset from
real ITS platforms. Limited by the quantity of high-quality datasets, the previ-
ous data-driven model struggled to keep up with the expanding requirements
of ITS development. Li et al. proposed a student behaviour simulation method
based on a Decision Transformer, to generate student behaviour data for ITS
training [6,33]. Emond et al. [44] proposed an Adaptive Instructional System
(AIS) as a self-improvement system. It presented a methodological approach
that incorporates three concurrent research activities: Bayesian networks mod-
elling of learning processes, knowledge elicitation from expert instructors, and
the use of simulated learners and tutors for exploring AIS design options. On
the other hand, with the further development of ITS research, more and more
high-quality datasets, such as EdNet [12], have been published in recent years,
which can be used to achieve a high-quality data-driven student simulations.
However, collecting data like the EdNet dataset is extremely time-consuming
and labour-intensive. How to improve the effectiveness of ITS with small data
volumes or in a cold-start scenario is still a problem that needs to be addressed.
Springer Nature 2021 L
A
T
E
X template
8Sim-GAIL
3 Method
In this section, we introduce the methodology for the research described in this
paper. First, we describe the EdNet dataset we use, in Section 3.1. In Section
3.2, we show how we preprocess the data in EdNet, to obtain the features we
need. We then articulate the framework of our SIM-GAIL method in Section
3.3.
3.1 Dataset
We adopt EdNet [12], the largest dataset in the ITS field, for our experiments.
This dataset comprises students’ interaction log data with an ITS, which can be
used to extract the state and action representation. EdNet is a massive bench-
mark dataset of interactions between students and a MOOC learning platform
called SANTA1. SANTA is a TOEIC (Test of English for International Com-
munication) learning platform in South Korea, and the EdNet dataset was
collected by Riiid! AI Research2. There are 131,417,236 interaction logs col-
lected from 784,309 students in 13,169 exercises over two years, as shown in
Table 1. The interaction logs for each student are recorded in an indepen-
dent CSV (Comma-Separated Values) file. EdNet is a four-layer hierarchical
dataset, structured from KT1 to KT4, according to the granularity of inter-
active actions. KT1 only contains simple information, such as question and
answer pairs and elapsed time. Based on the information in KT1, to provide
correlation information between student behaviour and question-and-answer
sequences, EdNet adds detailed action records to KT2, such as watching video
lectures and reading articles. In KT3, actions such as choosing response options
and reviewing explanations are added to KT2, which can be used to infer the
influence of different learning activities on students’ knowledge states. KT4
includes the finest detailed action information, such as purchasing courses,
and pausing and playing video lectures, which could be used to investigate the
impact of sparse key actions on overall learning outcomes.
Table 1 Statistics of the EdNet
Number of Interactions 131,417,236
Number of Students 784,309
Number of Exercises 13,169
3.2 Data Preprocessing
The problems involving decision-making processes are transformed into MDPs
in general [8] (see section 2.1.1). In this experiment, we view the students’
sequential decision-making trajectories as a Markov Decision Process. Extract-
ing the action space and state space of the real students’ data is essential for
1https://www.aitutorsanta.com
2https://www.riiid.co
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 9
building an effective student simulation method using MDP. Next, we show
how we explore the data and extract the action space and state space.
3.2.1 Action Space
There are 13,169 questions, 1,021 lectures, and 293 kinds of skills in EdNet
[12]. However, there are no criteria for separating these courses into different
parts. Bassen et al. [4] proposed a method to group knowledge concepts, based
on the assumption that each part was grouped by domain experts’ experience.
Inspired by this method, we divide the lectures and questions space of the
agent into 7 groups. However, as the division into 7 groups is of a too coarse
granularity for the action space, we further use the method proposed in [38],
and divide the difficulty of the questions from 1 to 4 by the answer correctness
rate, obtained by comparing the students answer logs and the correct answers.
Some lectures lack a difficulty ranking and are therefore assigned a default dif-
ficulty value of 0. Hence, all action spaces are divided into 5 difficulty levels,
with 7 groups, and thus 35 action types in total. Figure 3shows the distri-
bution of the 35 types of actions in EdNet. In each group, the action types
include 4 questions from difficulty levels 1 to 4, and 1 lecture. Taking Group
1, for example, actions 1 to 4 correspond to questions with different difficulty
levels, and action 5 corresponds to lectures where the difficulty level cannot be
defined, which is set as 0. As shown in Figure 3, the rest of the groups follow
this pattern.
Fig. 3 Analysis of the action distribution of the EdNet dataset.
Springer Nature 2021 L
A
T
E
X template
10 Sim-GAIL
Fig. 4 The Sim-GAIL Pipeline
3.2.2 State Space
EdNet records the interaction data for each student with the system, in sep-
arate CSV files, via UNIX timestamps. Therefore, most of the state features
obtained from EdNet are longitudinal and temporal. Previous works have
shown that different state feature choices could make a large difference in
the performance of the algorithms [40,45]. We select state features that are
widely chosen in similar simulated student works [4,35,40,42]. Transitions
between these selected states represent students’ learning trajectories. Table 2
shows the features we select from EdNet: ‘correct so far’ is the proportion of
the correct answer to the number of all activities attempted; ‘av time’ is the
cumulative average of the elapsed time spent on each action; ‘av fam’ denotes
the average familiarity of the 7 groups; ‘topic fam’ denotes the familiarity with
the current group; ‘prev correct’ denotes the number of correct answers in the
previous group; and ‘steps in part’ counts student learning steps in the cur-
rent group. Compared to previous works [4,40], we select more state features,
which could potentially simulate the students’ trajectories in real situations
more effectively.
Table 2 State feature representation
State Feature Description
‘correct so far’ The ratio of correct responses
‘av time’ The cumulative average of the elapsed time
‘av fam’ Average familiarity of all parts
‘topic fam’ Familiarity with the current part
‘prev correct’ Numbers of correct answers in previous responses
‘steps in part’ Counts of student learning steps
‘lects consumed’ Numbers of lectures a student has learnt
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 11
3.3 Sim-GAIL Model Architecture
Our Sim-GAIL model is built upon Generative Adversarial Imitation Learn-
ing (GAIL) [20], which aims to solve the problem of Imitation Learning of
having difficulty in dealing with constant regularisation and not being able to
match occupancy measures in large environments. Equation 1demonstrates
the optimal negative log loss, distinguishing between the pair: state πand
action πE.
ψ∗
GA (ρπ−ρπE) = max
D∈(0,1)S×A
Eπ[log(D(s, a))] + EπE[log(1 −D(s, a))],(1)
where ψGA∗is the average of the real trajectories’ data, and Dis the discrimi-
native classifier. Using causal entropy Has the policy regulariser, the following
procedure can be derived:
minimise
πψ∗
GA (ρπ−ρπE)−λH(π) = DJS (ρπ, ρπE)−λH(π).(2)
This equation combines Imitation Learning (IL) and Generative Adversarial
Networks (GAN) [25]. Generator Sgenerates trajectories that are passed to
Discriminator D. The Generator’s goal is to make it less likely for the Dis-
criminator to differentiate the real trajectories and those generated by the
Generator, whilst the Discriminator’s goal is to distinguish between them. The
Generator achieves the best learning effect when the Discriminator fails to
recognise the generated trajectories. Lastly, ρπEin equation 1is the occupancy
measure of the real trajectories.
Eπ[log(D(s, a))] + EπE[log(1 −D(s, a))] −λH(π) (3)
There is a function approximation of πand D. TRPO [46] is used to find a
saddle point (π, D), which decreases the value of Expression 3. To decrease the
expected cost, we use the cost function c(s, a) = log D(s, a). As classified by
Discriminator, the cost function will move toward real trajectories-like regions
of the state-action space, to achieve the training goal of Discriminator.
Figure 4shows the pipeline of Sim-GAIL. Real student data from EdNet
is processed by the methods introduced in Section 3.2 and fed into the GAIL
module (middle part) to create a simulation policy that could be used for train-
ing the ‘sim student’ (right part). The middle part is described in algorithm 1.
We start by initialising the policy θand Discriminator D. At each iteration,
we sample real student trajectories from the dataset and update the Discrimi-
nator parameters using the Adam gradient [47]. Then, we take a policy update
step using the TRPO rule, to decrease the expected cost [46]. At last, we take
a KL-constrained natural gradient step, to train the Discriminator.
Springer Nature 2021 L
A
T
E
X template
12 Sim-GAIL
Algorithm 1 Algorithm of Sim-GAIL.
Require: Real students trajectories, τE∼πE; initiating the policy θand
Discriminator D
1: for each i= 0,1,2, ... do
2: Sample student trajectories τi∼πθi
3: Update the parameters wito wi+1 in Discriminator
4: ˆ
Eτi[∇wlog (Dw(s, a))] + ˆ
EτE[∇wlog (1 −Dw(s, a))]
5: Take a policy step from θito θi+1 with cost function log Dwi+1 (s, a)
6: ˆ
Eτi[∇θlog πθ(a|s)Q(s, a)] −λ∇θH(πθ)
7: where Q(¯s, ¯a) = ˆ
Eτilog Dwi+1 (s, a)|s0= ¯s, a0= ¯a
8: end for
4 Experiments
In this section, we introduce the experimental setup in our Sim-GAIL method
and the two baseline methods that serve as comparator.
4.1 Sim-GAIL
In order to simulate the real student learning behaviour on a real platform,
we build a simulator, to play back the real student learning trajectories from
EdNet, selected using a stochastic policy. Specifically, we first sample the real
student trajectories from Ednet. The state includes ‘correct so far’, ‘ave time’,
‘av fam’, ‘topic fam’, ‘pre correct’, ‘step in part’, and ‘lects consumed’. Then,
a subset of the trajectories is randomly picked and controlled with the pol-
icy. After that, for each student’s trajectory, a set of action-state pairs, are
extracted from the observation policy. The policy outputs a student action,
responding to the state feature at each timestamp. In this way, we created
the simulation that represents the ground-truth policy, used to train other
methods on.
For the experimental setup, we use an auto-encoder to process the data.
Sim-GAIL is implemented using the PyTorch framework. We train the model
on the seven features mentioned before using the 1,000 students’ interaction
logs.
4.2 Baseline Models
Among the few studies that could be selected as baseline methods, the
current state-of-the-art top performers so far are the Behavioural Cloning
based method proposed by Torabi [48] and the Reinforcement Learning-based
method proposed by Kumar [49]. Therefore, we use these two methods as the
baselines for the experiments.
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 13
4.2.1 Behavioural Cloning (BC)
The first baseline is the Behavioural Cloning (BC) based method, proposed by
Torabi [48]. This model has shown good performance in the task of simulating
users’ behaviour from observations. Similarly, we employ a Mixture Regression
(MR) approach [50], which is a Gaussian mixture of the actions and states,
to process the data features. For fairness of the comparison, we use the same
action-state pair data to train the Sim-GAIL and BC-methods, with data
extracted from EdNet (see section 4.1). The supervised learning method is
applied to train the policy and Adam optimisation [47], with a batch size of
128.
4.2.2 Reinforcement Learning (RL)
The second baseline is the Reinforcement Learning (RL) based method pro-
posed by Kumar [49], which uses the Conservative Q-learning (CQL) approach.
EdNet does not contain any students’ prior- or post-test scores. Hence, we use
the method proposed by Azharet al. [51] to build a reward function, based on
the historical logs of students’ scores. More specifically, we use the correctness
of the students’ responses as the reward function. If a student’s response is
correct, a positive reward will be given; otherwise, a negative reward will be
provided. Moreover, we integrate the difficulty levels of the questions. We set
the rewards from 1 to 4, based on the difficulty level of the activity. Thus, if
the student’s responses match the correct answers, they get a positive reward
of 1 to 4; and if no, they receive a negative reward of -1 to -4. The Dynamic
Programming (DP) [52] method is used to train the model. More specifically,
we utilise a Policy Iteration (PI) method to train the agent. This process could
be separated into two repeated stages: the first is evaluating the value of every
state in the finite MDP according to the current policy. The second is using
the Bellman Optimality equation [53] to make the policy iteration based on
the current policy.
5 Evaluation
Our evaluation includes two parts: The first part compares the Sim-GAIL with
the two baseline models, and the second part uses Knowledge Tracing models
to evaluate the effect of the Sim-GAIL.
In the first part of the evaluation, to better evaluate Sim-GAIL and its
performance relative to traditional models, we develop our own comprehensive
evaluation framework. Since the most critical elements for a Markov Decision
Process are action,reward, and policy, as shown in Figure 1, we build a novel
framework, to evaluate the efficiency of Sim-GAIL and two baseline models
from these three aspects, respectively. In particular, we identify action distri-
bution, to evaluate the action, expected cumulative rewards, to evaluate the
reward, and offline policy, to evaluate the policy. The first metric, the action dis-
tribution, is the similarity of distributions between the generated actions and
Springer Nature 2021 L
A
T
E
X template
14 Sim-GAIL
the real actions from the historical (ground-truth) data. We compare this met-
ric amongst Sim-GAIL, the BC-based method, and the RL-based method with
the original data, by using the Kullback–Leibler divergence method, which is
generally used to measure the difference between two distributions [54]. Sec-
ond, we compare the Expected Cumulative Rewards (ECR) for each of these
three methods. Third, we use two Off-line Policy Evaluation (OPE) meth-
ods, including Importance Sampling (IS) and Fitted Q Evaluation (FQE), to
compare the policy of these three methods. Our comprehensive and nuanced
evaluation framework is aimed at delivering a more detailed and informative
assessment of Sim-GAIL and its performance relative to traditional models.
In the second part of the evaluation, we use three state-of-the-art Knowl-
edge Tracing models to evaluate Sim-GAIL, to test whether our method could
be efficaciously applied in a real-world cold-start scenario. We apply the gen-
erated data to a widely used ITS technique called knowledge tracing (KT) to
verify the effectiveness of our model. KT could be used to predict the students’
next actions, based on their historical behavioural trajectories [6]. We apply
the generated data in three state-of-the-art KT models, i.e., SSAKT, SAINT,
and LTMTL, to test if the generated data mixed with the original data could
improve their accuracy, when training on only a small set of student data.
5.1 Action Distribution Evaluation
As mentioned in Section 3.2, we obtain the action distribution of EdNet by
allocating the 35 actions into seven groups, resulting in five actions per group,
as shown in Figure 3. We can observe that actions 21, 22, 23, and 24 have
higher frequencies than other actions. This pattern also appears in the action
distribution generated by Sim-GAIL. The major difference in action distribu-
tions between the real data from EdNet and those generated by Sim-GAIL is
that action 25 (i.e., one of the lecture actions) in the latter is not close to the
average value of 0. In addition, action 26 in Sim-GAIL also exhibits a higher
frequency. Figure 6shows the action distribution of the simulated students
generated by the RL-based method. The highest frequencies fall into groups
5 and 6, while group 6 contains most of the high-frequency actions. Unlike
the action distribution of real data, the clustering of each group can not be
clearly identified in the action distribution of the RL-based method. Figure
7shows the action distribution of the simulated students generated by the
Behavioural Cloning (BC) based method. Within this distribution, actions in
group 6 illustrate the highest frequencies, indicating that actions in group 6
are the most frequent ones. Figure 8compares the action distribution amongst
the data generated by these three different student simulation methods. We
can see that the BC-based method outperforms the RL-based method in this
metric, and the action distribution of Sim-GAIL generated data is closest to
the real data’s distribution.
Moreover, we use the Kullback–Leibler divergence (KL) method to measure
whether the action distribution generated by these three methods conforms
to the real action distribution from EdNet. Table 3shows the KL values of
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 15
the distribution of the actions generated by these three methods and that of
the real actions, respectively. The KL value between the action distribution of
the data generated by Sim-GAIL; and the action distribution of the real data
(ground truth) is the lowest (0.297), which suggests that the action distribution
generated by Sim-GAIL is the closest to the real action distribution. Thus,
it performs the best in this metric. The result also shows that the BC-based
method (0.391) performs worse than Sim-GAIL but better than the RL-based
method (0.408) in this metric.
Table 3 Kullback–Leibler divergence of action distribution.
Model Sim-GAIL RL BC
KL value 0.297 0.408 0.391
Fig. 5 Action distribution of the Sim-GAIL model.
The state ‘topic fam’ represents a student’s familiarity with the current
topic. It is an important indicator that can reflect a student’s mastery of knowl-
edge. We compare the action distribution of the state value ‘topic fam’ from
simulated students generated by three different methods, which is shown in
Figure 9. From left to right is the distribution of simulated student actions in
the state of ‘topic-fam’ generated by Sim-GAIL, RL-based method, and BC-
based method. It can be seen that data generated by the RL-based method
is the most distributed in the most difficulty-level actions (the darkest bar in
Springer Nature 2021 L
A
T
E
X template
16 Sim-GAIL
Fig. 6 Action distribution of the Reinforcement Learning-based model.
each figure). Within this policy generated by RL, the method could obtain the
highest rewards in the short term. However, the distribution of actions in the
lecture (the orange bar) is minimal. Such a distribution does not match the real
learning trajectories of students, because students need to learn new knowl-
edge through attending lectures. The BC-based method has a more average
distribution of actions on all difficulty-level actions. However, the distributions
of lecture actions are unstable, which is also inconsistent with the real stu-
dents’ learning trajectories. The action distribution of the simulated student
method based on Sim-GAIL is the most in line with the real students’ trajec-
tories action distribution, and the counts of students’ actions between lectures
and questions are relatively stable. This indicates that the simulated students
generated by the Sim-GAIL method can balance the data distribution and
optimal policy to achieve a better simulation effect.
5.2 Expected Cumulative Rewards Evaluation
Expected Cumulative Rewards (ECR) represents the average of the expected
cumulative rewards under a given policy [55]. ECR could effectively reflect the
cumulative reward obtained by the method, which is a crucial indicator of the
effect of the method. The equation for computing ECR is:
EC R =Es0∼D,π ∗Q(s0, π∗(s0)) ,(4)
where the Q(s0, a) function is the ‘action value’ of the action aselected by
policy πin the initial state s0. In this experimental setting, we set ECR to be
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 17
Fig. 7 Action distribution of the Behavioural Cloning-based model.
simply equal to the unique initial state value ECR =Vπ∗(s0). We calculate
the cumulative rewards for 100 rounds over 1,000 steps starting from the initial
state. The results of the expected cumulative rewards evaluation are shown in
Figure 10, and a higher ECR indicates better performance.
The ECR of Sim-GAIL grows the fastest among the three methods, sug-
gesting its superior ability to accumulate rewards in the early stages of the
simulation. This rapid growth could be attributed to the generative nature
of the GAIL algorithm, which enables efficient exploration and exploitation
of the simulation environment, leading to higher rewards. After 200 steps,
Sim-GAIL’s ECR reaches a plateau at around 400, indicating that the model
has learned an optimal policy and further exploration does not significantly
increase the total rewards. This illustrates the model’s ability to converge to
an optimal solution quickly, a key advantage in scenarios where computational
resources or time are limited.
The RL method exhibits a slower ECR growth rate compared to Sim-
GAIL. This could be due to the inherent challenge in reinforcement learning
of balancing exploration and exploitation. Although RL eventually stabilises
at a cumulative reward of approximately 290 after 500 steps, this indicates its
lower efficiency compared to Sim-GAIL. BC displays the slowest ECR growth
rate, stabilising at around 240 after 400 steps. This slower growth and lower
final ECR compared to Sim-GAIL and RL reflect the limitations of the BC
method, which may not fully capture the complex dynamics of the simulation
environment.
Springer Nature 2021 L
A
T
E
X template
18 Sim-GAIL
Fig. 8 Comparison of different models’ actions distribution.
These observations indicate that Sim-GAIL outperforms the traditional
RL and BC methods in terms of ECR growth rate and final ECR value, high-
lighting the effectiveness of the GAIL approach in this context. This superior
performance underscores the novelty and potential of our proposed Sim-GAIL
as a powerful tool for generating simulated student data for ITS training.
5.3 Offline Policy Evaluation
As a robust policy evaluation method that does not require human par-
ticipation, the Offline Policy Evaluation (OPE) is often used to evaluate
Reinforcement Learning (RL), which has shown great potential in decision-
making tasks, such as robotics [56] and games [57]. In these tasks, RL optimal
strategies could be evaluated in either the environment or the simulator.
There are various ways of evaluation, such as maximum cumulative reward,
optimal policy, and evaluating the score in games, and the score could be
high or low, and a high score indicates a better performance [58]. How-
ever, in human-participating tasks, evaluation becomes very difficult. First,
human subjectivity may lead to bias in the results. Second, the simulator can-
not consider every feature in a complex environment. Finally, experiments,
where humans are involved, may make the evaluation process expensive, time-
consuming, and resource-intensive. The OPE methods [59] were proposed to
address these problems, where the evaluation of the policy is only based on
the collected historical offline log data. They are mainly applied in scenar-
ios where online interactions involve high-risk and expensive settings, such
as stock trading, medical recommendation, and educational systems [60]. In
this paper, we employed a combination of two OPE methods: the Importance
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 19
Fig. 9 Action distribution of the state feature ‘topic fam’ from simulated students gener-
ated by three different methods. The horizontal axis is the value of ‘topic fam’ 1 to 4, the
vertical axis is the normalised counts of the actions, the orange bar represents the lecture
consumption, and the blue bar represents questions, from easy to difficult. The difficulty is
represented by hue strength.
Sampling (including three variants, OIS, WIS, and PIS) [61] and the Fitted
Q Evaluation method [62], which allows for testing the policy performance of
the three models.
5.3.1 Importance Sampling
As one of the OPE methods, Importance Sampling (IS) is used in situations
where it is difficult to sample directly from the original data distribution. It
is a method that uses a simple and collectable distribution to calculate the
expected value of the desired distribution [61]. There are many works using IS
to evaluate the target policy (the policy derived from the RL algorithms) and
the behaviour policy (the policy used to gather the data) when dealing with
MDPs [63,64]. However, the basic IS method may suffer from high variance,
Springer Nature 2021 L
A
T
E
X template
20 Sim-GAIL
Fig. 10 Expected Cumulative Rewards evaluation.
due to the huge difference between those two policies. In our experiment, we
used three IS methods: the general IS (i.e., Ordinary Importance Sampling
(OIS)) and two variants of the general IS, including Weighted Importance
Sampling (WIS) and Per-Decision Importance Sampling (PDIS). WIS employs
a weighted average to mitigate the variance [65]. The Per-Decision Importance
Sampling modifies the sampling ratio and makes the reward dependent only
upon the previous action in each timestamp [62]. The combination of the three
methods can better observe the policy distribution of the generated data.
Table 4shows the results of the Importance Sampling evaluation. On the
OIS criteria, the BC-based method outperforms the RL-based method but is
worsen than Sim-GAIL. On the PDIS criteria, the Sim-GAIL method out-
performs both RL-based and BC-based methods and the BC-based method
performs better than the RL-based method. Sim-GAIL outperforms the other
two baseline models, and the RL-based method performs better than the BC-
based method on the WIS criteria. In summary, Sim-GAIL outperforms the
other baseline models on every criterion.
Table 4 Importance Sampling Evaluation results.
Model OIS PDIS WIS
Behavioural Cloning 6.59E+01 3.96E+01 0.970
Reinforcement Learning 3.86E-02 3.25E+05 3.841
Sim-GAIL 7.35E-02 8.07E+03 4.753
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 21
5.3.2 Fitted Q Evaluation
The FQE algorithm regards the MDP as a supervised learning problem. This
method uses a function approximator to fit the Q function under a specified
policy, based on the observation of the dataset [62].
Figure 11 shows the Fitted Q Evaluation results on the initial state. Sim-
GAIL outshines the other two methods, affirming its superior performance.
This is likely due to the strengths of the GAIL approach, which efficiently
captures the complex dynamics of the environment and generates more robust
policies. Sim-GAIL’s ISV peaks in the third epoch, indicating rapid learning
and optimisation. Despite subsequent oscillations, Sim-GAIL’s performance
consistently surpasses that of RL and BC methods, showcasing its robust-
ness and stability. The RL method exhibits better ISV performance than the
BC method. Both methods show a steady increase, with their maximum ISV
reached in the 9th epoch. However, their peak performance still falls short of
Sim-GAIL’s average level, underscoring the superior efficiency and effective-
ness of Sim-GAIL. In summary, Figure 11 highlights the efficacy of Sim-GAIL
in terms of policy quality and learning speed, as evidenced by its superior Fit-
ted Q Evaluation results. This underscores the potential of Sim-GAIL as an
efficient and robust approach for generating simulated student data for ITS
training.
Figure 12 shows the FQE loss of the three methods. Sim-GAIL’s FQE
loss increases rapidly, peaking in the third and fourth epochs. It then swiftly
declines but starts to ascend again after the fifth epoch. This rapid fluctuation
reflects the model’s active learning and adaptation process. In contrast, RL and
BC methods exhibit relatively stable, slower FQE loss growth. In particular,
RL shows moderate growth, while BC displays the slowest growth. This slower
and more stable growth could be indicative of a more conservative learning
process compared to Sim-GAIL.
Despite generating the highest Q(s0, π(s0)) values, Sim-GAIL also incurs
higher and less stable validation loss compared to the RL and BC methods.
This suggests that while Sim-GAIL is efficient in learning and optimising the
policy, it may overfit the training data, leading to higher validation loss. While
Sim-GAIL outperforms RL and BC methods overall, the results also indicate
a need for parameter tuning to reduce the loss, highlighting an area for further
improvement in Sim-GAIL’s implementation.
In summary, Figure 12 underscores the dynamic and efficient learning capa-
bility of Sim-GAIL, as well as the need for further tuning to optimise its
performance. Despite the higher and less stable validation loss, Sim-GAIL’s
overall superiority in generating higher Q-values reaffirms its potential as
a robust tool for generating simulated student data for Intelligent Tutoring
System training.
Springer Nature 2021 L
A
T
E
X template
22 Sim-GAIL
Fig. 11 Initial State Value Estimate of the FQE.
Fig. 12 The FQE-loss.
5.4 Evaluation using Knowledge Tracing (KT) Models
Knowledge Tracing (KT) is an emerging research direction and has been widely
applied in intelligent educational applications, where students’ historical tra-
jectories are used to model and predict their knowledge states [31]. However,
the lack of student interaction data in the early stage of using a system, known
as the cold-start problem, limits the performance of KT models. It has been
one massive obstacle to the development and application of KT. In this experi-
ment, we applied the original data and the data generated from the Sim-GAIL
method to the state-of-the-art KT models to test whether our model could
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 23
improve the performance of KT models in a cold start scenario. This in turn
proves the efficiency of our proposed Sim-GAIL method’s ability to simulate
and generate students’ historical trajectory data.
In the KT research area, there is a Riiid Answer Correctness Prediction
Competition on Kaggle3, which compares the state-of-the-art KT models using
the EdNet dataset. The current top three models in this competition are
SAINT, SSAKT, and LTMTI 4. The prediction competition provides a dataset
of 2,500 students to train the KT model. Therefore, we assume that the volume
of 2,500 students is sufficient for KT models to get good prediction perfor-
mances. Thus, in our experiments, we considered the case of a data size of
no more than 2,500. Therefore, we selected datasets of sizes 500, 1,000, 1,500,
2,000, and 2,500 student records. Each student record contains the student’s
sequence of discrete learning actions. In our experiment, we first used Sim-
GAIL to generate simulated data whose size is equal to the original data size,
and then we mixed it with the original real data to build a new dataset. After
that, we fed this mixed dataset into the 3 KT models, respectively. For exam-
ple, in the case of the original data size being equal to 500, we input the 500
student records to Sim-GAIL, which generated equally-sized (i.e., 500) sim-
ulated student records. Then, we mixed these 500 generated student records
with the original 500 student records, to build a new dataset of size 1,000. This
new mixed dataset was finally used to train the KT models. We compared the
performance of the KT models between using this mixed dataset and using
only the original data. The metric we used here is AUC.
Figure 13 shows the pairwise AUC comparisons of the three KT models
trained on only the original students’ data (SAINT, SSAKT, and LTMTL; in
grey) and trained on the mixed dataset (SAINT*, SSAKT*, and LTMTL*;
in red). The curves of SSAKT* and LTMTL* are constantly higher than the
curves of SSAKT and LTMTL in all the cases, i.e., 1,000, 2,000, 3,000, 4,000,
and 5,000 sizes of the mixed dataset. The curve of SAINT* is higher than the
curve of SAINT in the cases of 1,000, 2,000, and 3,000 sizes of data. Although
the curve of SAINT* is very close to SAINT in the cases of 5,000 sizes of data,
the former still outperforms the latter. In all those three pairwise comparisons,
especially in the cases of smaller data sizes (1,000, 2,000, and 3,000), obviously,
training on mixed data (a combination of the original and generated data)
could improve the KT models. The graphical representation of these results
would likely show an upward trend for all KT models, demonstrating that the
accuracy of the KT models can be improved with more data and iterations.
The lines representing the training on mixed data would be above those of
the original KT models, indicating our method’s superior performance. This
suggests that the data generated by our Sim-GAIL method can help improve
the KT models, especially in cold-start scenarios, where the size of the available
data is small.
3https://www.kaggle.com/code/datakite/riiid-answer-correctness
4http://ednet-leaderboard.s3-website-ap-northeast-1.amazonaws.com
Springer Nature 2021 L
A
T
E
X template
24 Sim-GAIL
Fig. 13 Pairwise AUC comparisons of the three KT models trained on only original stu-
dents’ data (SAINT, SSAKT, LTMTL, in grey) and trained on the mixed dataset (SAINT*,
SSAKT*, LTMTL*, in red). On the horizontal axis, 500, 1,000,...,2,500 indicate that the
grey curve model uses the original dataset, and (1,000),(2,000),...,(5,000) indicate that the
red curve model uses the mixed dataset.
6 Discussions and Future Work
6.1 Result Analysis
From the results of the experiment, we observe that Sim-GAIL outperforms the
baseline methods on the metrics of Action Distribution Evaluation,Expected
Cumulative Rewards Evaluation, and Offline Policy Evaluation. The satisfying
fit simulation results may come from the fact that there is no need to define a
reward function for Sim-GAIL, compared with other baseline models. Defining
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 25
reward functions manually may be too complex to fit the real student trajec-
tories’, thus a reward function built by algorithms instead of humans might
result in a better policy [20]. The results of the evaluation using the KT models
show that Sim-GAIL could be applied to real-world educational scenarios and
improve the efficiency of current educational technologies. More specifically,
our method could effectively alleviate the cold-start problem of KT models.
Our Sim-GAIL method outperforms the baseline models on every metric.
The RL-based method outperforms the BC-based method in terms of offline
policy evaluation. This indicates that a suitable setting of the reward function
could generate better policies. This result is also reflected in the distribution
of ‘topic fam’ actions. The policy generated by the RL-based method places
more emphasis on high-difficulty and high-reward actions. Such a policy works
well for obtaining higher cumulative rewards, but it does not match the action
distribution of real students’ trajectories. Besides, the distribution of ‘lecture’
actions whose default reward value is 0, is very small and unstable. Thus, the
action distribution generated by the RL-based method is inconsistent with the
action distribution of real students’ trajectories. The BC-based method out-
performs the RL-based method in action distribution, but is worse in offline
policy evaluation. This suggests that, although the BC-based method can ren-
der the action distribution more aligned with the real action distribution, it
is difficult to obtain a better learning policy. Therefore, Sim-GAIL is a more
advanced student simulation method than those two traditional ones. Besides,
as Sim-GAIL does not require a dedicated reward function to fit different
datasets, compared with traditional student simulation methods, our method
could be easily transferred and applied to another ITS.
In the evaluation using KT models, we apply our method to three differ-
ent state-of-the-art KT models. The results indicate that our method could
improve training efficiency in cold-start scenarios. In Figure 13, every KT
model trained on the mixed data (a combination of the original data and the
data generated by our Sim-GAIL method) performs better in each group. The
results suggest that it could improve training efficiency in small-sized data sce-
narios, proving that it could alleviate the cold-start problem in the early stages
of ITS development. For instance, in the above experiments, every KT* model
performs better when the original data size is smaller than 2,000. After the data
size is larger than 2,000, the performance of using the original dataset (KT) is
close to that of using a mixed dataset (KT*), but the KT* still outperforms
the KT.
6.2 Advantages
Our proposed model, Sim-GAIL, brings several significant advantages to the
field of student modelling for Intelligent Tutoring Systems (ITS). A fundamen-
tal strength of Sim-GAIL lies in its underlying mechanism, that of Generative
Adversarial Imitation Learning (GAIL), which endows the model with the
capacity to generate new data instances that closely resemble actual student
behaviour data. This generative modelling capability of Sim-GAIL is crucial
Springer Nature 2021 L
A
T
E
X template
26 Sim-GAIL
for creating a rich, diverse dataset needed for effective ITS training. Addition-
ally, Sim-GAIL offers a solution to a common issue encountered in the early
stages of ITS development - the cold start problem. The ability to generate
simulated student data allows Sim-GAIL to effectively tackle this problem,
accelerating the training process of ITS.
In terms of performance, Sim-GAIL has demonstrated superiority over tra-
ditional Reinforcement Learning (RL) and Behavioural Cloning (BC) based
methods across various metrics, including action distribution evaluation,
cumulative reward evaluation, and offline-policy evaluation. This implies that
Sim-GAIL can simulate student behaviours with higher accuracy and effec-
tiveness. Furthermore, the efficiency of Sim-GAIL is evident from the rapid
convergence to an optimal policy whilst simulating real student learning tra-
jectories, providing a significant advantage in scenarios where computational
resources or time are limited.
Beyond these, the scalability and generality of Sim-GAIL further enhance
its appeal. As a data-driven model, Sim-GAIL does not rely on expert knowl-
edge for defining the reward function, which contrasts with some RL-based
methods. This attribute allows Sim-GAIL to scale and generalise across
different datasets and applications, seamlessly.
In essence, Sim-GAIL represents a novel, effective, and efficient approach
to student modelling. By offering a promising tool for generating simulated
student data, Sim-GAIL contributes to enhancing the efficacy of ITS training.
6.3 Limitations
The limitations of this work mainly lie in the following aspects. First, our work
adopts a general state representation method from other studies [4,51], where
Sim-GAIL outperforms other baseline methods on most metrics. As discussed
in Section 3.2, the selection of state representation may impact the models’
performance. However, the experimental design of our work does not consider
the potential impact of different state combinations on various methods. Sec-
ond, in the experiments of evaluation using KT models, when a KT model
beyond the cold-start stage and has sufficient data, the increase in the amount
of simulated data may lead to a decrease in the prediction accuracy of the KT
model, which may be the bias caused by Sim-GAIL not considering all the
features of student actions.
6.4 Future Work
While our proposed Sim-GAIL method shows promising results in student
simulation for Intelligent Tutoring Systems (ITS), there are several avenues
for future exploration and improvement.
Fine-grained Simulations: In our current implementation, Sim-GAIL
focuses on generating simulated student behaviour data at a coarse level.
Future work can explore methods to capture more fine-grained details, such
as students’ cognitive processes, metacognitive strategies, and affective states.
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 27
Incorporating these aspects could lead to more accurate and comprehensive
student modelling.
Adaptive Simulation: Currently, Sim-GAIL generates simulated student
data based on predefined models. Future research can investigate methods to
make the simulation adaptive, allowing sim students to learn and evolve based
on feedback from the ITS. This adaptive simulation approach can provide more
dynamic and personalised student trajectories.
Transfer Learning and Generalisation: Sim-GAIL has been evaluated
on the EdNet dataset, but its generalisability to other domains and datasets
remains an open question. Future work can explore transfer learning techniques
to enhance the model’s ability to generalise across different educational con-
texts and datasets, enabling wider applicability of Sim-GAIL in various ITS
settings.
Human-In-The-Loop Simulations: Although Sim-GAIL offers an effi-
cient alternative to collecting real student data, it is crucial to acknowledge
the limitations of fully replacing human students with sim-students. Future
research can investigate human-in-the-loop simulation methods, where sim stu-
dents are combined with real student interactive data, allowing for iterative
refinement and validation of the simulated trajectories.
By pursuing these future research directions, we can further enhance Sim-
GAIL’s capabilities and contribute to the advancement of student modelling
techniques in the field of Intelligent Tutoring Systems.
7 Conclusion
In this study, we have introduced Sim-GAIL, a pioneering student simulation
method founded on the Generative Adversarial Imitation Learning (GAIL)
algorithm. It stands as the first of its kind that trains ITS using simulated
student behaviour data, effectively addressing the challenges of high-cost,
resource-intensive real student data collection, and the cold start problem
encountered during early-stage ITS training. Sim-GAIL demonstrates supe-
rior performance in comparison with traditional Reinforcement Learning-based
and Imitation Learning-based methods, marking a significant advancement in
state-of-the-art student modelling for Intelligent Tutoring Systems.
Our student simulation method, Sim-GAIL, leverages the EdNet dataset
and outperforms the baseline methods: a Reinforcement Learning method
based on Conservative Q-learning and an Imitation Learning method based
on Behavioural Cloning. We have thoroughly evaluated our method from four
aspects: action distribution discrepancy based on the Kullback-Leibler diver-
gence, reward function using Expected Cumulative Rewards (ECR), and two
Offline Policy Evaluation (OPE) methods - Importance Sampling and Fitted Q
Evaluation. Our results convincingly demonstrate that Sim-GAIL outperforms
the baseline models in all these aspects.
Further, we have applied Sim-GAIL to state-of-the-art knowledge tracing
models and observed a noticeable improvement in their performance, especially
Springer Nature 2021 L
A
T
E
X template
28 Sim-GAIL
in cold-start scenarios. This underlines Sim-GAIL’s efficiency in simulating and
generating students’ historical trajectory data, further emphasising its novelty
and potential to contribute to the field of student modelling for Intelligent
Tutoring Systems.
Moving forward, research can explore fine-grained simulations, adaptive
simulation techniques, transfer learning and generalisation, and human-in-the-
loop simulations, to enhance Sim-GAIL’s capabilities in student modelling even
further, as discussed in Section 6. This study paves the way for these future
endeavours by providing a robust, novel method for generating simulated
student data for ITS training.
8 Declarations
8.1 Conflict of Interest
The authors declare that they have no conflicts of interest in this work.
8.2 Data Availability
The datasets analysed during the current study are available in the
EdNet repository doi.org/10.48550/arXiv.1912.03072 [12]. These datasets were
derived from the following public domain resources: github.com/riiid/ednet#
properties-of-ednet.
References
[1] Zhu, X.: Machine teaching: An inverse problem to machine learning and
an approach toward optimal education. In: Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 29 (2015)
[2] Ritter, F.E., Nerb, J., Lehtinen, E., O’Shea, T.M.: In Order to Learn: How
the Sequence of Topics Influences Learning. Oxford University Press, ???
(2007)
[3] Shi, L., Cristea, A.I., Awan, M.S.K., Hendrix, M., Stewart, C.: Towards
understanding learning behavior patterns in social adaptive personalized
e-learning systems. (2013). Association for Information Systems
[4] Bassen, J., Balaji, B., Schaarschmidt, M., Thille, C., Painter, J., Zim-
maro, D., Games, A., Fast, E., Mitchell, J.C.: Reinforcement learning for
the adaptive scheduling of educational activities. In: Proceedings of the
2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12
(2020)
[5] Stash, N.V., Cristea, A.I., De Bra, P.M.: Authoring of learning styles in
adaptive hypermedia : problems and solutions. In: Proceedings of the 13th
International World Wide Web Conference on Alternate Track Papers &
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 29
Posters, pp. 114–123. ACM, New York NY USA (2004). https://doi.org/
10.1145/1013367.1013387
[6] Li, Z., Shi, L., Cristea, A., Zhou, Y., Xiao, C., Pan, Z.: Simstu-
transformer: A transformer-based approach to simulating student
behaviour. In: International Conference on Artificial Intelligence in Edu-
cation, pp. 348–351 (2022). Springer
[7] Cristea, A.I., Okamoto, T.: Considering automatic educational validation
of computerized educational systems. In: Proceedings IEEE International
Conference on Advanced Learning Technologies, pp. 415–417. IEEE,
Madison, WI, USA (2001). https://doi.org/10.1109/ICALT.2001.943962
[8] Jarboui, F., Gruson-Daniel, C., Durmus, A., Rocchisani, V.,
Goulet Ebongue, S.-H., Depoux, A., Kirschenmann, W., Perchet,
V.: Markov decision process for mooc users behavioral inference. In:
European MOOCs Stakeholders Summit, pp. 70–80 (2019). Springer
[9] Zimmer, M., Viappiani, P., Weng, P.: Teacher-student framework: a rein-
forcement learning approach. In: AAMAS Workshop Autonomous Robots
and Multirobot Systems (2014)
[10] Anderson, C.W., Draper, B.A., Peterson, D.A.: Behavioral cloning of
student pilots with modular neural networks. In: ICML, pp. 25–32 (2000)
[11] Schaal, S.: Is imitation learning the route to humanoid robots? Trends in
cognitive sciences 3(6), 233–242 (1999)
[12] Choi, Y., Lee, Y., Shin, D., Cho, J., Park, S., Lee, S., Baek, J., Bae, C.,
Kim, B., Heo, J.: Ednet: A large-scale hierarchical dataset in education.
In: International Conference on Artificial Intelligence in Education, pp.
69–73 (2020). Springer
[13] Shen, S., Chi, M.: Reinforcement learning: the sooner the better, or the
later the better? In: Proceedings of the 2016 Conference on User Modeling
Adaptation and Personalization, pp. 37–44 (2016)
[14] Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT
press, Cambridge, MA (2018)
[15] Levin, E., Pieraccini, R., Eckert, W.: Using markov decision process for
learning dialogue strategies. In: Proceedings of the 1998 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing, ICASSP’98
(Cat. No. 98CH36181), vol. 1, pp. 201–204 (1998). IEEE
[16] Li, Z., Shi, L., Cristea, A.I., Zhou, Y.: A survey of collaborative reinforce-
ment learning: Interactive methods and design patterns. In: Designing
Springer Nature 2021 L
A
T
E
X template
30 Sim-GAIL
Interactive Systems Conference 2021, pp. 1579–1590 (2021)
[17] Hussein, A., Gaber, M.M., Elyan, E., Jayne, C.: Imitation learning: A
survey of learning methods. ACM Computing Surveys (CSUR) 50(2),
1–35 (2017)
[18] Pomerleau, D.A.: Alvinn: An autonomous land vehicle in a neural
network. Advances in neural information processing systems 1(1988)
[19] Pomerleau, D.A.: Efficient training of artificial neural networks for
autonomous navigation. Neural computation 3(1), 88–97 (1991)
[20] Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in
neural information processing systems 29 (2016)
[21] Bhattacharyya, R., Wulfe, B., Phillips, D., Kuefler, A., Morton, J.,
Senanayake, R., Kochenderfer, M.: Modeling human driving behav-
ior through generative adversarial imitation learning. arXiv preprint
arXiv:2006.06412 (2020)
[22] Ross, S., Bagnell, D.: Efficient reductions for imitation learning. In:
Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics, pp. 661–668 (2010). JMLR Workshop and
Conference Proceedings
[23] Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning
and structured prediction to no-regret online learning. In: Proceed-
ings of the Fourteenth International Conference on Artificial Intelligence
and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference
Proceedings
[24] Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement
learning. In: Proceedings of the Twenty-first International Conference on
Machine Learning, p. 1 (2004)
[25] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances
in neural information processing systems 27 (2014)
[26] Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement
learning. In: Icml, vol. 1, p. 2 (2000)
[27] Brusilovsky, P.: Adaptive hypermedia for education and training. Adap-
tive technologies for training and education 46, 46–68 (2012)
[28] Shi, L., Al Qudah, D., Qaffas, A., Cristea, A.I.: Topolor: A social personal-
ized adaptive e-learning system. In: Carberry, S., Weibelzahl, S., Micarelli,
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 31
A., Semeraro, G. (eds.) User Modeling, Adaptation, and Personalization,
pp. 338–340. Springer, Berlin, Heidelberg (2013)
[29] Shi, L., Cristea, A.I.: Learners thrive using multifaceted open social
learner modeling. IEEE MultiMedia 23(1), 36–47 (2016). https://doi.org/
10.1109/MMUL.2015.93
[30] Shi, L., Cristea, A.I., Toda, A.M., Oliveira, W.: Exploring navigation
styles in a futurelearn mooc. In: Kumar, V., Troussas, C. (eds.) Intelligent
Tutoring Systems, pp. 45–55. Springer, Cham (2020)
[31] Liu, Q., Shen, S., Huang, Z., Chen, E., Zheng, Y.: A survey of knowledge
tracing. arXiv preprint arXiv:2105.15106 (2021)
[32] Alharbi, K., Cristea, A.I., Okamoto, T.: Agent-based classroom environ-
ment simulation: The effect of disruptive schoolchildren’s behaviour versus
teacher control over neighbours. In: Artificial Intelligence in Education.
AIED 2021. Lecture Notes in Computer Science. Springer, Cham. (2021).
https://doi.org/10.1007/978-3-030-78270-2 8
[33] Li, Z., Shi, L., Zhou, Y., Wang, J.: Towards student behaviour simulation:
A decision transformer based approach. In: International Conference on
Intelligent Tutoring Systems, pp. 553–562 (2023). Springer
[34] Doroudi, S., Aleven, V., Brunskill, E.: Where’s the reward? International
Journal of Artificial Intelligence in Education 29(4), 568–620 (2019)
[35] Iglesias, A., Mart´ınez, P., Aler, R., Fern´andez, F.: Reinforcement learning
of pedagogical policies in adaptive and intelligent educational systems.
Knowledge-Based Systems 22(4), 266–270 (2009)
[36] Yudelson, M.V., Koedinger, K.R., Gordon, G.J.: Individualized bayesian
knowledge tracing models. In: International Conference on Artificial
Intelligence in Education, pp. 171–180 (2013). Springer
[37] Hambleton, R.K., Swaminathan, H., Rogers, H.J.: Fundamentals of Item
Response Theory vol. 2. Sage, Newbury Park, London, New Delhi (1991)
[38] Segal, A., David, Y.B., Williams, J.J., Gal, K., Shalom, Y.: Combining
difficulty ranking with multi-armed bandits to sequence educational con-
tent. In: International Conference on Artificial Intelligence in Education,
pp. 317–321 (2018). Springer
[39] Azhar, A.Z., Segal, A., Gal, K.: Optimizing representations and poli-
cies for question sequencing using reinforcement learning. International
Educational Data Mining Society (2022)
Springer Nature 2021 L
A
T
E
X template
32 Sim-GAIL
[40] Tetreault, J.R., Litman, D.J.: A reinforcement learning approach to
evaluating state representations in spoken dialogue systems. Speech
Communication 50(8-9), 683–696 (2008)
[41] Rowe, J., Pokorny, B., Goldberg, B., Mott, B., Lester, J.: Toward simu-
lated students for reinforcement learning-driven tutorial planning in gift.
In: Proceedings of R. Sottilare (Ed.) 5th Annual GIFT Users Symposium.
Orlando, FL (2017)
[42] Chi, M., VanLehn, K., Litman, D.: Do micro-level tutorial decisions
matter: Applying reinforcement learning to induce pedagogical tutorial
tactics. In: International Conference on Intelligent Tutoring Systems, pp.
224–234 (2010). Springer
[43] Beck, J., Woolf, B.P., Beal, C.R.: Advisor: A machine learning architec-
ture for intelligent tutor construction. AAAI/IAAI 2000(552-557), 1–2
(2000)
[44] Emond, B., Smith, J., Musharraf, M., Torbati, R.Z., Billard, R., Barnes,
J., Veitch, B.: Development of ais using simulated learners, bayesian net-
works and knowledge elicitation methods. In: International Conference on
Human-Computer Interaction, pp. 143–158 (2022). Springer
[45] Shen, S., Chi, M.: Aim low: Correlation-based feature selection for model-
based reinforcement learning. International Educational Data Mining
Society (2016)
[46] Ho, J., Gupta, J., Ermon, S.: Model-free imitation learning with pol-
icy optimization. In: International Conference on Machine Learning, pp.
2760–2769 (2016). PMLR
[47] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980 (2014)
[48] Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation.
arXiv preprint arXiv:1805.01954 (2018)
[49] Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for
offline reinforcement learning. Advances in Neural Information Processing
Systems 33, 1179–1191 (2020)
[50] Lef`evre, S., Sun, C., Bajcsy, R., Laugier, C.: Comparison of paramet-
ric and non-parametric approaches for vehicle speed prediction. In: 2014
American Control Conference, pp. 3494–3499 (2014). IEEE
[51] Azhar, Z.A.Z.: Designing an offline reinforcement learning based peda-
gogical agent with a large scale educational dataset. Master of Science
Springer Nature 2021 L
A
T
E
X template
Sim-GAIL 33
Thesis, Data Science (2021). University of Edinburgh
[52] Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement
Learning and Dynamic Programming Using Function Approximators.
CRC press, Subs. of Times Mirror 2000 Corporate Blvd. NW Boca Raton,
FLUnited States (2010)
[53] Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspec-
tive on reinforcement learning. In: International Conference on Machine
Learning, pp. 449–458 (2017). PMLR
[54] Hershey, J.R., Olsen, P.A.: Approximating the kullback leibler divergence
between gaussian mixture models. In: 2007 IEEE International Conference
on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 4, p. 317
(2007). IEEE
[55] Voloshin, C., Le, H.M., Jiang, N., Yue, Y.: Empirical study of off-
policy policy evaluation for reinforcement learning. arXiv preprint
arXiv:1911.06854 (2019)
[56] Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Ojea,
J.A., Solowjow, E., Levine, S.: Residual reinforcement learning for robot
control. In: 2019 International Conference on Robotics and Automation
(ICRA), pp. 6023–6029 (2019). IEEE
[57] Lapan, M.: Deep Reinforcement Learning Hands-On: Apply Modern
RL Methods, with Deep Q-networks, Value Iteration, Policy Gradients,
TRPO, AlphaGo Zero and More. Packt Publishing, Ltd. (2018). https:
//doi.org/10.5555/3279266
[58] Weaver, L., Tao, N.: The optimal reward baseline for gradient-based
reinforcement learning. arXiv preprint arXiv:1301.2315 (2013)
[59] Mandel, T., Liu, Y.-E., Levine, S., Brunskill, E., Popovic, Z.: Offline policy
evaluation across representations with applications to educational games.
In: AAMAS, vol. 1077 (2014)
[60] Saito, Y., Udagawa, T., Kiyohara, H., Mogi, K., Narita, Y., Tateno, K.:
Evaluating the robustness of off-policy evaluation. In: Fifteenth ACM
Conference on Recommender Systems, pp. 114–123 (2021)
[61] Tokdar, S.T., Kass, R.E.: Importance sampling: a review. Wiley Interdis-
ciplinary Reviews: Computational Statistics 2(1), 54–60 (2010)
[62] Tirinzoni, A., Salvini, M., Restelli, M.: Transfer of samples in policy
search via multiple importance sampling. In: International Conference on
Machine Learning, pp. 6264–6274 (2019). PMLR
Springer Nature 2021 L
A
T
E
X template
34 Sim-GAIL
[63] Shelton, C.R.: Importance sampling for reinforcement learning with
multiple objectives (2001)
[64] Ju, S., Shen, S., Azizsoltani, H., Barnes, T., Chi, M.: Importance sampling
to identify empirically valid policies and their critical decisions. In: EDM
(Workshops), pp. 69–78 (2019)
[65] Mahmood, A.R., Van Hasselt, H.P., Sutton, R.S.: Weighted impor-
tance sampling for off-policy learning with linear function approximation.
Advances in Neural Information Processing Systems 27 (2014)