Content uploaded by Lei Shi

Author content

All content in this area was uploaded by Lei Shi on Aug 24, 2023

Content may be subject to copyright.

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL: A Generative Adversarial

Imitation Learning Approach of Student

Modelling for Intelligent Tutoring Systems

Zhaoxing Li1, Lei Shi1,2*, Jindi Wang1, Alexandra I. Cristea1

and Yunzhan Zhou1

1Department of Computer Science, Durham University, Durham,

UK.

2Open Lab, School of Computing, Newcastle University,

Newcastle Upon Tyne, UK.

*Corresponding author(s). E-mail(s): lei.shi@newcastle.ac.uk;

Contributing authors: zhaoxing.li2@durham.ac.uk;

jindi.wang@durham.ac.uk;alexandra.i.cristea@durham.ac.uk;

yunzhan.zhou@durham.ac.uk;

Abstract

The continuous application of Artiﬁcial Intelligence (AI) technologies in

online education has led to signiﬁcant progress, especially in the ﬁeld

of Intelligent Tutoring Systems (ITS), online courses and learning man-

agement systems (LMS). An important research direction of the ﬁeld

is to provide students with customised learning trajectories via student

modelling. Previous studies have shown that customisation of learn-

ing trajectories could eﬀectively improve students’ learning experiences

and outcomes. However, training an ITS that can customise students’

learning trajectories suﬀers from cold-start, time-consumption, human

labour-intensity, and cost problems. One feasible approach is to simu-

late real students’ behaviour trajectories through algorithms, to generate

data that could be used to train the ITS. Nonetheless, implementing

high-accuracy student modelling methods that eﬀectively address these

issues remains an ongoing challenge. Traditional simulation methods, in

particular, encounter diﬃculties in ensuring the quality and diversity of

the generated data, thereby limiting their capacity to provide Intelligent

Tutoring Systems (ITS) with high-ﬁdelity and diverse training data. We

thus propose Sim-GAIL, a novel student modelling method based on

1

Springer Nature 2021 L

A

T

E

X template

2Sim-GAIL

Generative Adversarial Imitation Learning (GAIL). To the best of our

knowledge, it is the ﬁrst method using GAIL to address the challenge

of lacking training data, resulting from the issues mentioned above. We

analyse and compare the performance of Sim-GAIL with two traditional

Reinforcement Learning-based and Imitation Learning-based methods

using action distribution evaluation, cumulative reward evaluation, and

oﬄine-policy evaluation. The experiments demonstrate that our method

outperforms traditional ones on most metrics. Moreover, we apply our

method to a domain plagued by the cold start problem, Knowledge Trac-

ing (KT), and the results show that our novel method could eﬀectively

improve the KT model’s prediction accuracy in a cold-start scenario.

Keywords: Student Modelling, Generative Adversarial Imitation Learning,

Intelligent Tutoring Systems.

1 Introduction

Intelligent Tutoring Systems (ITS) are increasingly incorporating Artiﬁcial

Intelligence (AI) technologies, including machine learning and deep learning,

which could eﬀectively oﬀer customised learning trajectories for each student

based on their prior knowledge and learning activities [1]. Research in cognitive

science has shown that there is a strong relationship between, amongst others,

the sequence of learning materials and learning outcomes [2]. In a traditional

online learning platform, there is only one single static linear learning trajec-

tory provided to students. In this one-size-ﬁts-all approach, students may lose

their motivation and even drop out of the course, due to anxiety or boredom

encountered in the learning process [3]. Research on customised learning tra-

jectories for students has been emerging in the ITS ﬁeld. However, developing

an ITS that can provide students with customised learning trajectories requires

a large amount of data for training the system, which is time-consuming and

costly [4], long known to be requiring a large amount of manual labour from

education providers (instructors, authors, etc.) [5]. Although many mature

ITSs have suﬃcient data to train algorithms, a large number of emerging ITSs

are still suﬀering from a lack of training data in the early stages of development,

also known as the cold start problem [6].

To tackle these challenges, previous studies have proposed various methods

for simulating student learning trajectories (i.e., generating massive student

learning behavioural data) that can be used to train an ITS. Early simulated

student behaviour proposals stemmed from the aim at automatic validation

of educational interventions via a sandbox method [7]. More recently, Jar-

boui et al. attempted to model student trajectory sequences into a Markov

Decision Process [8], but in real educational scenarios, only a few ITS can pro-

vide all the feature data consistent with a Markov Decision Process (e.g., the

reward function of the ITS agent). Zimmer et al. deﬁned reward functions to

build reinforcement learning agents [9] to generate student trajectories, but

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 3

this method requires building diﬀerent reward functions for diﬀerent datasets,

which makes it diﬃcult to generalise. Besides, humans’ psychological responses

to learning trajectories and reward mechanisms are diﬃcult to simulate. This

leads to circumstances where student simulation methods may not be able to

simulate student learning trajectories suﬃciently. Anderson et al. proposed a

student simulation method based on Behavioural Cloning (BC) [10], the sim-

plest form of Imitation Learning, which aims to solve the abovementioned

problems where the reward is sparse and hard to deﬁne [11]. Whilst promis-

ing, BC-based methods only learn from the few features collected in student

data, and the actions that algorithms are able to model can be very limited.

Motivated by the gap in prior literature identiﬁed above, the research

question of this paper is: How to build an eﬃcient student simulation

method that can generate massive student learning data, which can

be used for ITS training?

To answer this research question, we propose Sim-GAIL, a Generative

Adversarial Imitation Learning (GAIL) approach to student modelling. Our

Sim-GAIL method can be used to generate simulated student data to solve

the lack of data and cold start problems in ITS training.

Furthermore, to showcase its eﬃciency, we compare our Sim-GAIL with the

two main student modelling methods used in the ITS ﬁeld, the RL-based and

the BC-based student modelling approach, using data from the very recent and

largest ITS dataset, EdNet [12]. We extract action and state features to train

the models. We analyse and compare performance using action distribution

evaluation, cumulative reward evaluation (CRE), and two oﬄine-policy eval-

uation (OPE) methods, which include Importance Sampling (IS) and Fitted

Q Evaluation (FQE). Moreover, we apply our method’s generated data in an

ITS cold-start scenario. The experimental results show that our method out-

performs the two traditional RL-based and BC-based baseline methods and

could improve the training eﬃciency of the ITS in a cold-start scenario.

The main contributions of this work lie in the following three aspects:

1. We propose Sim-GAIL, a student modelling approach, to generate simula-

tion data for ITS training.

2. It is the ﬁrst method, to the best of our knowledge, that uses Generative

Adversarial Imitation Learning (GAIL) to implement student modelling to

address the challenge of lacking training data and the cold start problem.

3. The experiments demonstrate that a trained Sim-GAIL could simulate real

student learning trajectories very well. Our method outperforms traditional

RL-based and BC-based methods on most metrics and can improve the

training eﬃciency in cold start scenarios.

Thus, the advantages of Sim-GAIL include its ability to eﬀectively gen-

erate data resembling real student behaviours, address the cold-start problem,

demonstrate superior performance on various metrics, eﬃciently converge to

an optimal policy, and oﬀer scalability and generality across diﬀerent datasets

and applications.

Springer Nature 2021 L

A

T

E

X template

4Sim-GAIL

This paper is structured as follows. Section 2introduces the background of

reinforcement learning, imitation learning (including behavioural cloning), and

student modelling. Section 3demonstrates the dataset, data pre-processing,

and model architecture. Section 4outlines the experiments and baseline mod-

els. Section 5discusses the evaluation methods and the experimental results

based on action distribution, Oﬄine Policy (OP) evaluation, Expected Cumu-

lative Rewards (ECR) evaluation, and Knowledge Tracing (KT). Section 6

discusses our ﬁndings and future works. Section 7draws conclusions.

2 Background and Literature Review

Before analysing current competitors of the proposal for student modelling for

generating training data presented in this paper, we show the current state of

the underlying methodologies: Markov decision process, reinforcement learn-

ing, imitation learning, and ﬁnally, the method at the basis of our proposal,

generative adversarial imitation learning.

2.1 Markov Decision Process & Reinforcement Learning

The Markov Decision Process (MDP) is the standard method for sequential

decision-making (SDM) [13]. The sequential decision-making models can gen-

erally be seen as an instance of the Markov decision process. Reinforcement

learning is also typically regarded as an MDP [14]. Therefore, in this section,

we introduce MDP and then reinforcement learning.

2.1.1 Markov Decision Process (MDP)

MDP is a mathematical model of sequential decision used to generate stochas-

tic policies and rewards, achievable by an agent in an environment where the

system state exhibits Markov properties [15]. MDPs are represented as a set of

interacting objects, namely agents and environments, with components includ-

ing states, actions, policies, and rewards. In an MDP model, the agent observes

the present state of the environment and takes actions on the environment

in accordance with the policy, thereby changing the state of the environment

and getting rewards. The ultimate goal of the agent is to reach the maximum

cumulative reward, which is achieved using a reward function [16]. Figure 1

shows the structure of the MDP.

2.1.2 Reinforcement Learning (RL)

RL is a type of machine learning method that enables an agent to learn a

policy by taking diﬀerent actions in an interactive environment, in order to

maximise cumulative rewards. It could be deﬁned as the tuple of (S,A,P,R),

where Sis deﬁned as the state of the environment, Arepresents actions of

the agent, P:S× A × S→[0,1] represents the transition probabilities of

actions from the current state to the next state and R:S×A×S→R

denotes the reward function. The goal of an RL agent is to achieve maximum

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 5

Fig. 1 Framework of the Markov Decision Process.

cumulative rewards. However, the drawback of traditional RL methods lies

in its computational overhead, brought by repeated interactions between the

agent and the environment.

2.2 Imitation Learning (IL)

Diﬀerent from RL, where the agent learns by interacting with the environment

to obtain the maximum rewards, IL is a method of learning policy that involves

emulating the behaviour of experts’ trajectories [17], instead of leveraging an

explicit reward function as in RL.

2.2.1 Behavioural Cloning (BC)

BC considers the learning of policy under supervised learning settings, lever-

aging state-action pairs [18,19]. Albeit simple and eﬀective, BC suﬀers from

the heavy reliance on extremely large amounts of data [20,21], without which

a distributional mismatch, often referred to as covariate shift [22,23], would

occur, due to compounding errors and stochasticity in the environment during

test time.

2.2.2 Apprenticeship Learning (AL)

Diﬀerent from BC, AL instead tries to identify features of the expert’s trajec-

tories that are more generalisable and to ﬁnd a policy that matches the same

feature expectations with respect to the expert [24]. Its goal is to ﬁnd a pol-

icy that performs no worse than the expert across a class of cost functions.

The main limitation of AL is that it cannot imitate the expert trajectory well,

due to the restricted class of cost functions. Speciﬁcally, when the true cost

function does not lie within the cost function classes, the agent cannot be

guaranteed to outperform the expert.

Springer Nature 2021 L

A

T

E

X template

6Sim-GAIL

2.3 Generative Adversarial Imitation Learning (GAIL)

GAIL addresses the drawbacks of RL and AL eﬀectively [20], by borrowing the

idea of Generative Adversarial Networks (GANs) [25]. It is derived from a type

of Imitation Learning, called Maximum Causal Entropy Inverse Reinforcement

Learning (MaxEntIRL) [26].

Figure 2shows the mechanism of GAIL. Integrating GANs into imitation

learning allows for the Generator never to be exposed to real-world examples,

enabling agents to learn only from experts’ demonstrations. In GAIL, the

Discriminator is trained with the objective of distinguishing the generated

trajectories from real trajectories, while the Generator, on the other hand,

attempts to imitate the real trajectories, to fool the Discriminator into thinking

it is actually one of them.

Fig. 2 Mechanism of Generative Adversarial Imitation Learning.

2.4 Student Modelling

As the traditional one-size-ﬁts-all approach can no longer satisfy student learn-

ing needs, it leads to increased demands for customised learning [27,28].

Various student modelling methods have been proposed, which are gener-

ally classiﬁed as integrating expert knowledge-based or data-driven methods

[29,30]. Knowledge-based methods refer to utilising human knowledge to

address issues that would normally require human intelligence [7,31]. Data-

driven methods simulate students’ learning trajectories through massive

student learning records data [6,32,33].

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 7

2.4.1 Expert Knowledge-based Methods

The majority of the studies in this ﬁeld involve building diﬀerent forms of stu-

dent models, to train a reinforcement learning (RL) agent [34]. Glesias et al.

proposed a Markov Decision Process based on expert knowledge, to train stu-

dent models [35]. Doroud et al. suggested an RL-based agent method rooted

in cognitive theory, to optimise the sequencing of the knowledge components

(KCs) [34]. The reward function of this method is based on pre- and post-test

scores, taken as a metric, and termed Normalised Learning Gain (NLG). How-

ever, this metric needs evaluation by human participants, which is excessively

human resource-intensive. Yudelson et al. proposed a ‘Student Simulation’

method based on Bayesian Knowledge Tracing (BKT), which could train a ‘sim

student’ to imitate real students’ mastery of diﬀerent knowledge [36]. Segal et

al. suggested a student simulation method based on the Item Response Theory

(IRT) [37], which could respond to diﬀerent reactions to courses at diﬀerent

diﬃculty levels [38]. Azhar et al. [39] introduced an application of Reinforce-

ment Learning (RL) for optimising the learning sequence modelling of online

learning materials, which is an end-to-end pipeline to automatically derive

and evaluate robust representations of students’ interactions and policies for

content sequencing in online education.

2.4.2 Data-driven methods

Compared with integrating expert knowledge-based methods, data-driven

methods could better simulate real students’ learning trajectories and more

eﬀectively reduce biases [13]. There have been some studies [40–42] aiming to

build student simulation methods based on data-driven MDP approaches. For

example, Beck et al. proposed a Population Student Model (PSM) based on a

linear regression model that could simulate the probability of the student’s cor-

rect response [43]. However, this method requires a high-quality dataset from

real ITS platforms. Limited by the quantity of high-quality datasets, the previ-

ous data-driven model struggled to keep up with the expanding requirements

of ITS development. Li et al. proposed a student behaviour simulation method

based on a Decision Transformer, to generate student behaviour data for ITS

training [6,33]. Emond et al. [44] proposed an Adaptive Instructional System

(AIS) as a self-improvement system. It presented a methodological approach

that incorporates three concurrent research activities: Bayesian networks mod-

elling of learning processes, knowledge elicitation from expert instructors, and

the use of simulated learners and tutors for exploring AIS design options. On

the other hand, with the further development of ITS research, more and more

high-quality datasets, such as EdNet [12], have been published in recent years,

which can be used to achieve a high-quality data-driven student simulations.

However, collecting data like the EdNet dataset is extremely time-consuming

and labour-intensive. How to improve the eﬀectiveness of ITS with small data

volumes or in a cold-start scenario is still a problem that needs to be addressed.

Springer Nature 2021 L

A

T

E

X template

8Sim-GAIL

3 Method

In this section, we introduce the methodology for the research described in this

paper. First, we describe the EdNet dataset we use, in Section 3.1. In Section

3.2, we show how we preprocess the data in EdNet, to obtain the features we

need. We then articulate the framework of our SIM-GAIL method in Section

3.3.

3.1 Dataset

We adopt EdNet [12], the largest dataset in the ITS ﬁeld, for our experiments.

This dataset comprises students’ interaction log data with an ITS, which can be

used to extract the state and action representation. EdNet is a massive bench-

mark dataset of interactions between students and a MOOC learning platform

called SANTA1. SANTA is a TOEIC (Test of English for International Com-

munication) learning platform in South Korea, and the EdNet dataset was

collected by Riiid! AI Research2. There are 131,417,236 interaction logs col-

lected from 784,309 students in 13,169 exercises over two years, as shown in

Table 1. The interaction logs for each student are recorded in an indepen-

dent CSV (Comma-Separated Values) ﬁle. EdNet is a four-layer hierarchical

dataset, structured from KT1 to KT4, according to the granularity of inter-

active actions. KT1 only contains simple information, such as question and

answer pairs and elapsed time. Based on the information in KT1, to provide

correlation information between student behaviour and question-and-answer

sequences, EdNet adds detailed action records to KT2, such as watching video

lectures and reading articles. In KT3, actions such as choosing response options

and reviewing explanations are added to KT2, which can be used to infer the

inﬂuence of diﬀerent learning activities on students’ knowledge states. KT4

includes the ﬁnest detailed action information, such as purchasing courses,

and pausing and playing video lectures, which could be used to investigate the

impact of sparse key actions on overall learning outcomes.

Table 1 Statistics of the EdNet

Number of Interactions 131,417,236

Number of Students 784,309

Number of Exercises 13,169

3.2 Data Preprocessing

The problems involving decision-making processes are transformed into MDPs

in general [8] (see section 2.1.1). In this experiment, we view the students’

sequential decision-making trajectories as a Markov Decision Process. Extract-

ing the action space and state space of the real students’ data is essential for

1https://www.aitutorsanta.com

2https://www.riiid.co

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 9

building an eﬀective student simulation method using MDP. Next, we show

how we explore the data and extract the action space and state space.

3.2.1 Action Space

There are 13,169 questions, 1,021 lectures, and 293 kinds of skills in EdNet

[12]. However, there are no criteria for separating these courses into diﬀerent

parts. Bassen et al. [4] proposed a method to group knowledge concepts, based

on the assumption that each part was grouped by domain experts’ experience.

Inspired by this method, we divide the lectures and questions space of the

agent into 7 groups. However, as the division into 7 groups is of a too coarse

granularity for the action space, we further use the method proposed in [38],

and divide the diﬃculty of the questions from 1 to 4 by the answer correctness

rate, obtained by comparing the students answer logs and the correct answers.

Some lectures lack a diﬃculty ranking and are therefore assigned a default dif-

ﬁculty value of 0. Hence, all action spaces are divided into 5 diﬃculty levels,

with 7 groups, and thus 35 action types in total. Figure 3shows the distri-

bution of the 35 types of actions in EdNet. In each group, the action types

include 4 questions from diﬃculty levels 1 to 4, and 1 lecture. Taking Group

1, for example, actions 1 to 4 correspond to questions with diﬀerent diﬃculty

levels, and action 5 corresponds to lectures where the diﬃculty level cannot be

deﬁned, which is set as 0. As shown in Figure 3, the rest of the groups follow

this pattern.

Fig. 3 Analysis of the action distribution of the EdNet dataset.

Springer Nature 2021 L

A

T

E

X template

10 Sim-GAIL

Fig. 4 The Sim-GAIL Pipeline

3.2.2 State Space

EdNet records the interaction data for each student with the system, in sep-

arate CSV ﬁles, via UNIX timestamps. Therefore, most of the state features

obtained from EdNet are longitudinal and temporal. Previous works have

shown that diﬀerent state feature choices could make a large diﬀerence in

the performance of the algorithms [40,45]. We select state features that are

widely chosen in similar simulated student works [4,35,40,42]. Transitions

between these selected states represent students’ learning trajectories. Table 2

shows the features we select from EdNet: ‘correct so far’ is the proportion of

the correct answer to the number of all activities attempted; ‘av time’ is the

cumulative average of the elapsed time spent on each action; ‘av fam’ denotes

the average familiarity of the 7 groups; ‘topic fam’ denotes the familiarity with

the current group; ‘prev correct’ denotes the number of correct answers in the

previous group; and ‘steps in part’ counts student learning steps in the cur-

rent group. Compared to previous works [4,40], we select more state features,

which could potentially simulate the students’ trajectories in real situations

more eﬀectively.

Table 2 State feature representation

State Feature Description

‘correct so far’ The ratio of correct responses

‘av time’ The cumulative average of the elapsed time

‘av fam’ Average familiarity of all parts

‘topic fam’ Familiarity with the current part

‘prev correct’ Numbers of correct answers in previous responses

‘steps in part’ Counts of student learning steps

‘lects consumed’ Numbers of lectures a student has learnt

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 11

3.3 Sim-GAIL Model Architecture

Our Sim-GAIL model is built upon Generative Adversarial Imitation Learn-

ing (GAIL) [20], which aims to solve the problem of Imitation Learning of

having diﬃculty in dealing with constant regularisation and not being able to

match occupancy measures in large environments. Equation 1demonstrates

the optimal negative log loss, distinguishing between the pair: state πand

action πE.

ψ∗

GA (ρπ−ρπE) = max

D∈(0,1)S×A

Eπ[log(D(s, a))] + EπE[log(1 −D(s, a))],(1)

where ψGA∗is the average of the real trajectories’ data, and Dis the discrimi-

native classiﬁer. Using causal entropy Has the policy regulariser, the following

procedure can be derived:

minimise

πψ∗

GA (ρπ−ρπE)−λH(π) = DJS (ρπ, ρπE)−λH(π).(2)

This equation combines Imitation Learning (IL) and Generative Adversarial

Networks (GAN) [25]. Generator Sgenerates trajectories that are passed to

Discriminator D. The Generator’s goal is to make it less likely for the Dis-

criminator to diﬀerentiate the real trajectories and those generated by the

Generator, whilst the Discriminator’s goal is to distinguish between them. The

Generator achieves the best learning eﬀect when the Discriminator fails to

recognise the generated trajectories. Lastly, ρπEin equation 1is the occupancy

measure of the real trajectories.

Eπ[log(D(s, a))] + EπE[log(1 −D(s, a))] −λH(π) (3)

There is a function approximation of πand D. TRPO [46] is used to ﬁnd a

saddle point (π, D), which decreases the value of Expression 3. To decrease the

expected cost, we use the cost function c(s, a) = log D(s, a). As classiﬁed by

Discriminator, the cost function will move toward real trajectories-like regions

of the state-action space, to achieve the training goal of Discriminator.

Figure 4shows the pipeline of Sim-GAIL. Real student data from EdNet

is processed by the methods introduced in Section 3.2 and fed into the GAIL

module (middle part) to create a simulation policy that could be used for train-

ing the ‘sim student’ (right part). The middle part is described in algorithm 1.

We start by initialising the policy θand Discriminator D. At each iteration,

we sample real student trajectories from the dataset and update the Discrimi-

nator parameters using the Adam gradient [47]. Then, we take a policy update

step using the TRPO rule, to decrease the expected cost [46]. At last, we take

a KL-constrained natural gradient step, to train the Discriminator.

Springer Nature 2021 L

A

T

E

X template

12 Sim-GAIL

Algorithm 1 Algorithm of Sim-GAIL.

Require: Real students trajectories, τE∼πE; initiating the policy θand

Discriminator D

1: for each i= 0,1,2, ... do

2: Sample student trajectories τi∼πθi

3: Update the parameters wito wi+1 in Discriminator

4: ˆ

Eτi[∇wlog (Dw(s, a))] + ˆ

EτE[∇wlog (1 −Dw(s, a))]

5: Take a policy step from θito θi+1 with cost function log Dwi+1 (s, a)

6: ˆ

Eτi[∇θlog πθ(a|s)Q(s, a)] −λ∇θH(πθ)

7: where Q(¯s, ¯a) = ˆ

Eτilog Dwi+1 (s, a)|s0= ¯s, a0= ¯a

8: end for

4 Experiments

In this section, we introduce the experimental setup in our Sim-GAIL method

and the two baseline methods that serve as comparator.

4.1 Sim-GAIL

In order to simulate the real student learning behaviour on a real platform,

we build a simulator, to play back the real student learning trajectories from

EdNet, selected using a stochastic policy. Speciﬁcally, we ﬁrst sample the real

student trajectories from Ednet. The state includes ‘correct so far’, ‘ave time’,

‘av fam’, ‘topic fam’, ‘pre correct’, ‘step in part’, and ‘lects consumed’. Then,

a subset of the trajectories is randomly picked and controlled with the pol-

icy. After that, for each student’s trajectory, a set of action-state pairs, are

extracted from the observation policy. The policy outputs a student action,

responding to the state feature at each timestamp. In this way, we created

the simulation that represents the ground-truth policy, used to train other

methods on.

For the experimental setup, we use an auto-encoder to process the data.

Sim-GAIL is implemented using the PyTorch framework. We train the model

on the seven features mentioned before using the 1,000 students’ interaction

logs.

4.2 Baseline Models

Among the few studies that could be selected as baseline methods, the

current state-of-the-art top performers so far are the Behavioural Cloning

based method proposed by Torabi [48] and the Reinforcement Learning-based

method proposed by Kumar [49]. Therefore, we use these two methods as the

baselines for the experiments.

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 13

4.2.1 Behavioural Cloning (BC)

The ﬁrst baseline is the Behavioural Cloning (BC) based method, proposed by

Torabi [48]. This model has shown good performance in the task of simulating

users’ behaviour from observations. Similarly, we employ a Mixture Regression

(MR) approach [50], which is a Gaussian mixture of the actions and states,

to process the data features. For fairness of the comparison, we use the same

action-state pair data to train the Sim-GAIL and BC-methods, with data

extracted from EdNet (see section 4.1). The supervised learning method is

applied to train the policy and Adam optimisation [47], with a batch size of

128.

4.2.2 Reinforcement Learning (RL)

The second baseline is the Reinforcement Learning (RL) based method pro-

posed by Kumar [49], which uses the Conservative Q-learning (CQL) approach.

EdNet does not contain any students’ prior- or post-test scores. Hence, we use

the method proposed by Azharet al. [51] to build a reward function, based on

the historical logs of students’ scores. More speciﬁcally, we use the correctness

of the students’ responses as the reward function. If a student’s response is

correct, a positive reward will be given; otherwise, a negative reward will be

provided. Moreover, we integrate the diﬃculty levels of the questions. We set

the rewards from 1 to 4, based on the diﬃculty level of the activity. Thus, if

the student’s responses match the correct answers, they get a positive reward

of 1 to 4; and if no, they receive a negative reward of -1 to -4. The Dynamic

Programming (DP) [52] method is used to train the model. More speciﬁcally,

we utilise a Policy Iteration (PI) method to train the agent. This process could

be separated into two repeated stages: the ﬁrst is evaluating the value of every

state in the ﬁnite MDP according to the current policy. The second is using

the Bellman Optimality equation [53] to make the policy iteration based on

the current policy.

5 Evaluation

Our evaluation includes two parts: The ﬁrst part compares the Sim-GAIL with

the two baseline models, and the second part uses Knowledge Tracing models

to evaluate the eﬀect of the Sim-GAIL.

In the ﬁrst part of the evaluation, to better evaluate Sim-GAIL and its

performance relative to traditional models, we develop our own comprehensive

evaluation framework. Since the most critical elements for a Markov Decision

Process are action,reward, and policy, as shown in Figure 1, we build a novel

framework, to evaluate the eﬃciency of Sim-GAIL and two baseline models

from these three aspects, respectively. In particular, we identify action distri-

bution, to evaluate the action, expected cumulative rewards, to evaluate the

reward, and oﬄine policy, to evaluate the policy. The ﬁrst metric, the action dis-

tribution, is the similarity of distributions between the generated actions and

Springer Nature 2021 L

A

T

E

X template

14 Sim-GAIL

the real actions from the historical (ground-truth) data. We compare this met-

ric amongst Sim-GAIL, the BC-based method, and the RL-based method with

the original data, by using the Kullback–Leibler divergence method, which is

generally used to measure the diﬀerence between two distributions [54]. Sec-

ond, we compare the Expected Cumulative Rewards (ECR) for each of these

three methods. Third, we use two Oﬀ-line Policy Evaluation (OPE) meth-

ods, including Importance Sampling (IS) and Fitted Q Evaluation (FQE), to

compare the policy of these three methods. Our comprehensive and nuanced

evaluation framework is aimed at delivering a more detailed and informative

assessment of Sim-GAIL and its performance relative to traditional models.

In the second part of the evaluation, we use three state-of-the-art Knowl-

edge Tracing models to evaluate Sim-GAIL, to test whether our method could

be eﬃcaciously applied in a real-world cold-start scenario. We apply the gen-

erated data to a widely used ITS technique called knowledge tracing (KT) to

verify the eﬀectiveness of our model. KT could be used to predict the students’

next actions, based on their historical behavioural trajectories [6]. We apply

the generated data in three state-of-the-art KT models, i.e., SSAKT, SAINT,

and LTMTL, to test if the generated data mixed with the original data could

improve their accuracy, when training on only a small set of student data.

5.1 Action Distribution Evaluation

As mentioned in Section 3.2, we obtain the action distribution of EdNet by

allocating the 35 actions into seven groups, resulting in ﬁve actions per group,

as shown in Figure 3. We can observe that actions 21, 22, 23, and 24 have

higher frequencies than other actions. This pattern also appears in the action

distribution generated by Sim-GAIL. The major diﬀerence in action distribu-

tions between the real data from EdNet and those generated by Sim-GAIL is

that action 25 (i.e., one of the lecture actions) in the latter is not close to the

average value of 0. In addition, action 26 in Sim-GAIL also exhibits a higher

frequency. Figure 6shows the action distribution of the simulated students

generated by the RL-based method. The highest frequencies fall into groups

5 and 6, while group 6 contains most of the high-frequency actions. Unlike

the action distribution of real data, the clustering of each group can not be

clearly identiﬁed in the action distribution of the RL-based method. Figure

7shows the action distribution of the simulated students generated by the

Behavioural Cloning (BC) based method. Within this distribution, actions in

group 6 illustrate the highest frequencies, indicating that actions in group 6

are the most frequent ones. Figure 8compares the action distribution amongst

the data generated by these three diﬀerent student simulation methods. We

can see that the BC-based method outperforms the RL-based method in this

metric, and the action distribution of Sim-GAIL generated data is closest to

the real data’s distribution.

Moreover, we use the Kullback–Leibler divergence (KL) method to measure

whether the action distribution generated by these three methods conforms

to the real action distribution from EdNet. Table 3shows the KL values of

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 15

the distribution of the actions generated by these three methods and that of

the real actions, respectively. The KL value between the action distribution of

the data generated by Sim-GAIL; and the action distribution of the real data

(ground truth) is the lowest (0.297), which suggests that the action distribution

generated by Sim-GAIL is the closest to the real action distribution. Thus,

it performs the best in this metric. The result also shows that the BC-based

method (0.391) performs worse than Sim-GAIL but better than the RL-based

method (0.408) in this metric.

Table 3 Kullback–Leibler divergence of action distribution.

Model Sim-GAIL RL BC

KL value 0.297 0.408 0.391

Fig. 5 Action distribution of the Sim-GAIL model.

The state ‘topic fam’ represents a student’s familiarity with the current

topic. It is an important indicator that can reﬂect a student’s mastery of knowl-

edge. We compare the action distribution of the state value ‘topic fam’ from

simulated students generated by three diﬀerent methods, which is shown in

Figure 9. From left to right is the distribution of simulated student actions in

the state of ‘topic-fam’ generated by Sim-GAIL, RL-based method, and BC-

based method. It can be seen that data generated by the RL-based method

is the most distributed in the most diﬃculty-level actions (the darkest bar in

Springer Nature 2021 L

A

T

E

X template

16 Sim-GAIL

Fig. 6 Action distribution of the Reinforcement Learning-based model.

each ﬁgure). Within this policy generated by RL, the method could obtain the

highest rewards in the short term. However, the distribution of actions in the

lecture (the orange bar) is minimal. Such a distribution does not match the real

learning trajectories of students, because students need to learn new knowl-

edge through attending lectures. The BC-based method has a more average

distribution of actions on all diﬃculty-level actions. However, the distributions

of lecture actions are unstable, which is also inconsistent with the real stu-

dents’ learning trajectories. The action distribution of the simulated student

method based on Sim-GAIL is the most in line with the real students’ trajec-

tories action distribution, and the counts of students’ actions between lectures

and questions are relatively stable. This indicates that the simulated students

generated by the Sim-GAIL method can balance the data distribution and

optimal policy to achieve a better simulation eﬀect.

5.2 Expected Cumulative Rewards Evaluation

Expected Cumulative Rewards (ECR) represents the average of the expected

cumulative rewards under a given policy [55]. ECR could eﬀectively reﬂect the

cumulative reward obtained by the method, which is a crucial indicator of the

eﬀect of the method. The equation for computing ECR is:

EC R =Es0∼D,π ∗Q(s0, π∗(s0)) ,(4)

where the Q(s0, a) function is the ‘action value’ of the action aselected by

policy πin the initial state s0. In this experimental setting, we set ECR to be

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 17

Fig. 7 Action distribution of the Behavioural Cloning-based model.

simply equal to the unique initial state value ECR =Vπ∗(s0). We calculate

the cumulative rewards for 100 rounds over 1,000 steps starting from the initial

state. The results of the expected cumulative rewards evaluation are shown in

Figure 10, and a higher ECR indicates better performance.

The ECR of Sim-GAIL grows the fastest among the three methods, sug-

gesting its superior ability to accumulate rewards in the early stages of the

simulation. This rapid growth could be attributed to the generative nature

of the GAIL algorithm, which enables eﬃcient exploration and exploitation

of the simulation environment, leading to higher rewards. After 200 steps,

Sim-GAIL’s ECR reaches a plateau at around 400, indicating that the model

has learned an optimal policy and further exploration does not signiﬁcantly

increase the total rewards. This illustrates the model’s ability to converge to

an optimal solution quickly, a key advantage in scenarios where computational

resources or time are limited.

The RL method exhibits a slower ECR growth rate compared to Sim-

GAIL. This could be due to the inherent challenge in reinforcement learning

of balancing exploration and exploitation. Although RL eventually stabilises

at a cumulative reward of approximately 290 after 500 steps, this indicates its

lower eﬃciency compared to Sim-GAIL. BC displays the slowest ECR growth

rate, stabilising at around 240 after 400 steps. This slower growth and lower

ﬁnal ECR compared to Sim-GAIL and RL reﬂect the limitations of the BC

method, which may not fully capture the complex dynamics of the simulation

environment.

Springer Nature 2021 L

A

T

E

X template

18 Sim-GAIL

Fig. 8 Comparison of diﬀerent models’ actions distribution.

These observations indicate that Sim-GAIL outperforms the traditional

RL and BC methods in terms of ECR growth rate and ﬁnal ECR value, high-

lighting the eﬀectiveness of the GAIL approach in this context. This superior

performance underscores the novelty and potential of our proposed Sim-GAIL

as a powerful tool for generating simulated student data for ITS training.

5.3 Oﬄine Policy Evaluation

As a robust policy evaluation method that does not require human par-

ticipation, the Oﬄine Policy Evaluation (OPE) is often used to evaluate

Reinforcement Learning (RL), which has shown great potential in decision-

making tasks, such as robotics [56] and games [57]. In these tasks, RL optimal

strategies could be evaluated in either the environment or the simulator.

There are various ways of evaluation, such as maximum cumulative reward,

optimal policy, and evaluating the score in games, and the score could be

high or low, and a high score indicates a better performance [58]. How-

ever, in human-participating tasks, evaluation becomes very diﬃcult. First,

human subjectivity may lead to bias in the results. Second, the simulator can-

not consider every feature in a complex environment. Finally, experiments,

where humans are involved, may make the evaluation process expensive, time-

consuming, and resource-intensive. The OPE methods [59] were proposed to

address these problems, where the evaluation of the policy is only based on

the collected historical oﬄine log data. They are mainly applied in scenar-

ios where online interactions involve high-risk and expensive settings, such

as stock trading, medical recommendation, and educational systems [60]. In

this paper, we employed a combination of two OPE methods: the Importance

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 19

Fig. 9 Action distribution of the state feature ‘topic fam’ from simulated students gener-

ated by three diﬀerent methods. The horizontal axis is the value of ‘topic fam’ 1 to 4, the

vertical axis is the normalised counts of the actions, the orange bar represents the lecture

consumption, and the blue bar represents questions, from easy to diﬃcult. The diﬃculty is

represented by hue strength.

Sampling (including three variants, OIS, WIS, and PIS) [61] and the Fitted

Q Evaluation method [62], which allows for testing the policy performance of

the three models.

5.3.1 Importance Sampling

As one of the OPE methods, Importance Sampling (IS) is used in situations

where it is diﬃcult to sample directly from the original data distribution. It

is a method that uses a simple and collectable distribution to calculate the

expected value of the desired distribution [61]. There are many works using IS

to evaluate the target policy (the policy derived from the RL algorithms) and

the behaviour policy (the policy used to gather the data) when dealing with

MDPs [63,64]. However, the basic IS method may suﬀer from high variance,

Springer Nature 2021 L

A

T

E

X template

20 Sim-GAIL

Fig. 10 Expected Cumulative Rewards evaluation.

due to the huge diﬀerence between those two policies. In our experiment, we

used three IS methods: the general IS (i.e., Ordinary Importance Sampling

(OIS)) and two variants of the general IS, including Weighted Importance

Sampling (WIS) and Per-Decision Importance Sampling (PDIS). WIS employs

a weighted average to mitigate the variance [65]. The Per-Decision Importance

Sampling modiﬁes the sampling ratio and makes the reward dependent only

upon the previous action in each timestamp [62]. The combination of the three

methods can better observe the policy distribution of the generated data.

Table 4shows the results of the Importance Sampling evaluation. On the

OIS criteria, the BC-based method outperforms the RL-based method but is

worsen than Sim-GAIL. On the PDIS criteria, the Sim-GAIL method out-

performs both RL-based and BC-based methods and the BC-based method

performs better than the RL-based method. Sim-GAIL outperforms the other

two baseline models, and the RL-based method performs better than the BC-

based method on the WIS criteria. In summary, Sim-GAIL outperforms the

other baseline models on every criterion.

Table 4 Importance Sampling Evaluation results.

Model OIS PDIS WIS

Behavioural Cloning 6.59E+01 3.96E+01 0.970

Reinforcement Learning 3.86E-02 3.25E+05 3.841

Sim-GAIL 7.35E-02 8.07E+03 4.753

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 21

5.3.2 Fitted Q Evaluation

The FQE algorithm regards the MDP as a supervised learning problem. This

method uses a function approximator to ﬁt the Q function under a speciﬁed

policy, based on the observation of the dataset [62].

Figure 11 shows the Fitted Q Evaluation results on the initial state. Sim-

GAIL outshines the other two methods, aﬃrming its superior performance.

This is likely due to the strengths of the GAIL approach, which eﬃciently

captures the complex dynamics of the environment and generates more robust

policies. Sim-GAIL’s ISV peaks in the third epoch, indicating rapid learning

and optimisation. Despite subsequent oscillations, Sim-GAIL’s performance

consistently surpasses that of RL and BC methods, showcasing its robust-

ness and stability. The RL method exhibits better ISV performance than the

BC method. Both methods show a steady increase, with their maximum ISV

reached in the 9th epoch. However, their peak performance still falls short of

Sim-GAIL’s average level, underscoring the superior eﬃciency and eﬀective-

ness of Sim-GAIL. In summary, Figure 11 highlights the eﬃcacy of Sim-GAIL

in terms of policy quality and learning speed, as evidenced by its superior Fit-

ted Q Evaluation results. This underscores the potential of Sim-GAIL as an

eﬃcient and robust approach for generating simulated student data for ITS

training.

Figure 12 shows the FQE loss of the three methods. Sim-GAIL’s FQE

loss increases rapidly, peaking in the third and fourth epochs. It then swiftly

declines but starts to ascend again after the ﬁfth epoch. This rapid ﬂuctuation

reﬂects the model’s active learning and adaptation process. In contrast, RL and

BC methods exhibit relatively stable, slower FQE loss growth. In particular,

RL shows moderate growth, while BC displays the slowest growth. This slower

and more stable growth could be indicative of a more conservative learning

process compared to Sim-GAIL.

Despite generating the highest Q(s0, π(s0)) values, Sim-GAIL also incurs

higher and less stable validation loss compared to the RL and BC methods.

This suggests that while Sim-GAIL is eﬃcient in learning and optimising the

policy, it may overﬁt the training data, leading to higher validation loss. While

Sim-GAIL outperforms RL and BC methods overall, the results also indicate

a need for parameter tuning to reduce the loss, highlighting an area for further

improvement in Sim-GAIL’s implementation.

In summary, Figure 12 underscores the dynamic and eﬃcient learning capa-

bility of Sim-GAIL, as well as the need for further tuning to optimise its

performance. Despite the higher and less stable validation loss, Sim-GAIL’s

overall superiority in generating higher Q-values reaﬃrms its potential as

a robust tool for generating simulated student data for Intelligent Tutoring

System training.

Springer Nature 2021 L

A

T

E

X template

22 Sim-GAIL

Fig. 11 Initial State Value Estimate of the FQE.

Fig. 12 The FQE-loss.

5.4 Evaluation using Knowledge Tracing (KT) Models

Knowledge Tracing (KT) is an emerging research direction and has been widely

applied in intelligent educational applications, where students’ historical tra-

jectories are used to model and predict their knowledge states [31]. However,

the lack of student interaction data in the early stage of using a system, known

as the cold-start problem, limits the performance of KT models. It has been

one massive obstacle to the development and application of KT. In this experi-

ment, we applied the original data and the data generated from the Sim-GAIL

method to the state-of-the-art KT models to test whether our model could

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 23

improve the performance of KT models in a cold start scenario. This in turn

proves the eﬃciency of our proposed Sim-GAIL method’s ability to simulate

and generate students’ historical trajectory data.

In the KT research area, there is a Riiid Answer Correctness Prediction

Competition on Kaggle3, which compares the state-of-the-art KT models using

the EdNet dataset. The current top three models in this competition are

SAINT, SSAKT, and LTMTI 4. The prediction competition provides a dataset

of 2,500 students to train the KT model. Therefore, we assume that the volume

of 2,500 students is suﬃcient for KT models to get good prediction perfor-

mances. Thus, in our experiments, we considered the case of a data size of

no more than 2,500. Therefore, we selected datasets of sizes 500, 1,000, 1,500,

2,000, and 2,500 student records. Each student record contains the student’s

sequence of discrete learning actions. In our experiment, we ﬁrst used Sim-

GAIL to generate simulated data whose size is equal to the original data size,

and then we mixed it with the original real data to build a new dataset. After

that, we fed this mixed dataset into the 3 KT models, respectively. For exam-

ple, in the case of the original data size being equal to 500, we input the 500

student records to Sim-GAIL, which generated equally-sized (i.e., 500) sim-

ulated student records. Then, we mixed these 500 generated student records

with the original 500 student records, to build a new dataset of size 1,000. This

new mixed dataset was ﬁnally used to train the KT models. We compared the

performance of the KT models between using this mixed dataset and using

only the original data. The metric we used here is AUC.

Figure 13 shows the pairwise AUC comparisons of the three KT models

trained on only the original students’ data (SAINT, SSAKT, and LTMTL; in

grey) and trained on the mixed dataset (SAINT*, SSAKT*, and LTMTL*;

in red). The curves of SSAKT* and LTMTL* are constantly higher than the

curves of SSAKT and LTMTL in all the cases, i.e., 1,000, 2,000, 3,000, 4,000,

and 5,000 sizes of the mixed dataset. The curve of SAINT* is higher than the

curve of SAINT in the cases of 1,000, 2,000, and 3,000 sizes of data. Although

the curve of SAINT* is very close to SAINT in the cases of 5,000 sizes of data,

the former still outperforms the latter. In all those three pairwise comparisons,

especially in the cases of smaller data sizes (1,000, 2,000, and 3,000), obviously,

training on mixed data (a combination of the original and generated data)

could improve the KT models. The graphical representation of these results

would likely show an upward trend for all KT models, demonstrating that the

accuracy of the KT models can be improved with more data and iterations.

The lines representing the training on mixed data would be above those of

the original KT models, indicating our method’s superior performance. This

suggests that the data generated by our Sim-GAIL method can help improve

the KT models, especially in cold-start scenarios, where the size of the available

data is small.

3https://www.kaggle.com/code/datakite/riiid-answer-correctness

4http://ednet-leaderboard.s3-website-ap-northeast-1.amazonaws.com

Springer Nature 2021 L

A

T

E

X template

24 Sim-GAIL

Fig. 13 Pairwise AUC comparisons of the three KT models trained on only original stu-

dents’ data (SAINT, SSAKT, LTMTL, in grey) and trained on the mixed dataset (SAINT*,

SSAKT*, LTMTL*, in red). On the horizontal axis, 500, 1,000,...,2,500 indicate that the

grey curve model uses the original dataset, and (1,000),(2,000),...,(5,000) indicate that the

red curve model uses the mixed dataset.

6 Discussions and Future Work

6.1 Result Analysis

From the results of the experiment, we observe that Sim-GAIL outperforms the

baseline methods on the metrics of Action Distribution Evaluation,Expected

Cumulative Rewards Evaluation, and Oﬄine Policy Evaluation. The satisfying

ﬁt simulation results may come from the fact that there is no need to deﬁne a

reward function for Sim-GAIL, compared with other baseline models. Deﬁning

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 25

reward functions manually may be too complex to ﬁt the real student trajec-

tories’, thus a reward function built by algorithms instead of humans might

result in a better policy [20]. The results of the evaluation using the KT models

show that Sim-GAIL could be applied to real-world educational scenarios and

improve the eﬃciency of current educational technologies. More speciﬁcally,

our method could eﬀectively alleviate the cold-start problem of KT models.

Our Sim-GAIL method outperforms the baseline models on every metric.

The RL-based method outperforms the BC-based method in terms of oﬄine

policy evaluation. This indicates that a suitable setting of the reward function

could generate better policies. This result is also reﬂected in the distribution

of ‘topic fam’ actions. The policy generated by the RL-based method places

more emphasis on high-diﬃculty and high-reward actions. Such a policy works

well for obtaining higher cumulative rewards, but it does not match the action

distribution of real students’ trajectories. Besides, the distribution of ‘lecture’

actions whose default reward value is 0, is very small and unstable. Thus, the

action distribution generated by the RL-based method is inconsistent with the

action distribution of real students’ trajectories. The BC-based method out-

performs the RL-based method in action distribution, but is worse in oﬄine

policy evaluation. This suggests that, although the BC-based method can ren-

der the action distribution more aligned with the real action distribution, it

is diﬃcult to obtain a better learning policy. Therefore, Sim-GAIL is a more

advanced student simulation method than those two traditional ones. Besides,

as Sim-GAIL does not require a dedicated reward function to ﬁt diﬀerent

datasets, compared with traditional student simulation methods, our method

could be easily transferred and applied to another ITS.

In the evaluation using KT models, we apply our method to three diﬀer-

ent state-of-the-art KT models. The results indicate that our method could

improve training eﬃciency in cold-start scenarios. In Figure 13, every KT

model trained on the mixed data (a combination of the original data and the

data generated by our Sim-GAIL method) performs better in each group. The

results suggest that it could improve training eﬃciency in small-sized data sce-

narios, proving that it could alleviate the cold-start problem in the early stages

of ITS development. For instance, in the above experiments, every KT* model

performs better when the original data size is smaller than 2,000. After the data

size is larger than 2,000, the performance of using the original dataset (KT) is

close to that of using a mixed dataset (KT*), but the KT* still outperforms

the KT.

6.2 Advantages

Our proposed model, Sim-GAIL, brings several signiﬁcant advantages to the

ﬁeld of student modelling for Intelligent Tutoring Systems (ITS). A fundamen-

tal strength of Sim-GAIL lies in its underlying mechanism, that of Generative

Adversarial Imitation Learning (GAIL), which endows the model with the

capacity to generate new data instances that closely resemble actual student

behaviour data. This generative modelling capability of Sim-GAIL is crucial

Springer Nature 2021 L

A

T

E

X template

26 Sim-GAIL

for creating a rich, diverse dataset needed for eﬀective ITS training. Addition-

ally, Sim-GAIL oﬀers a solution to a common issue encountered in the early

stages of ITS development - the cold start problem. The ability to generate

simulated student data allows Sim-GAIL to eﬀectively tackle this problem,

accelerating the training process of ITS.

In terms of performance, Sim-GAIL has demonstrated superiority over tra-

ditional Reinforcement Learning (RL) and Behavioural Cloning (BC) based

methods across various metrics, including action distribution evaluation,

cumulative reward evaluation, and oﬄine-policy evaluation. This implies that

Sim-GAIL can simulate student behaviours with higher accuracy and eﬀec-

tiveness. Furthermore, the eﬃciency of Sim-GAIL is evident from the rapid

convergence to an optimal policy whilst simulating real student learning tra-

jectories, providing a signiﬁcant advantage in scenarios where computational

resources or time are limited.

Beyond these, the scalability and generality of Sim-GAIL further enhance

its appeal. As a data-driven model, Sim-GAIL does not rely on expert knowl-

edge for deﬁning the reward function, which contrasts with some RL-based

methods. This attribute allows Sim-GAIL to scale and generalise across

diﬀerent datasets and applications, seamlessly.

In essence, Sim-GAIL represents a novel, eﬀective, and eﬃcient approach

to student modelling. By oﬀering a promising tool for generating simulated

student data, Sim-GAIL contributes to enhancing the eﬃcacy of ITS training.

6.3 Limitations

The limitations of this work mainly lie in the following aspects. First, our work

adopts a general state representation method from other studies [4,51], where

Sim-GAIL outperforms other baseline methods on most metrics. As discussed

in Section 3.2, the selection of state representation may impact the models’

performance. However, the experimental design of our work does not consider

the potential impact of diﬀerent state combinations on various methods. Sec-

ond, in the experiments of evaluation using KT models, when a KT model

beyond the cold-start stage and has suﬃcient data, the increase in the amount

of simulated data may lead to a decrease in the prediction accuracy of the KT

model, which may be the bias caused by Sim-GAIL not considering all the

features of student actions.

6.4 Future Work

While our proposed Sim-GAIL method shows promising results in student

simulation for Intelligent Tutoring Systems (ITS), there are several avenues

for future exploration and improvement.

Fine-grained Simulations: In our current implementation, Sim-GAIL

focuses on generating simulated student behaviour data at a coarse level.

Future work can explore methods to capture more ﬁne-grained details, such

as students’ cognitive processes, metacognitive strategies, and aﬀective states.

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 27

Incorporating these aspects could lead to more accurate and comprehensive

student modelling.

Adaptive Simulation: Currently, Sim-GAIL generates simulated student

data based on predeﬁned models. Future research can investigate methods to

make the simulation adaptive, allowing sim students to learn and evolve based

on feedback from the ITS. This adaptive simulation approach can provide more

dynamic and personalised student trajectories.

Transfer Learning and Generalisation: Sim-GAIL has been evaluated

on the EdNet dataset, but its generalisability to other domains and datasets

remains an open question. Future work can explore transfer learning techniques

to enhance the model’s ability to generalise across diﬀerent educational con-

texts and datasets, enabling wider applicability of Sim-GAIL in various ITS

settings.

Human-In-The-Loop Simulations: Although Sim-GAIL oﬀers an eﬃ-

cient alternative to collecting real student data, it is crucial to acknowledge

the limitations of fully replacing human students with sim-students. Future

research can investigate human-in-the-loop simulation methods, where sim stu-

dents are combined with real student interactive data, allowing for iterative

reﬁnement and validation of the simulated trajectories.

By pursuing these future research directions, we can further enhance Sim-

GAIL’s capabilities and contribute to the advancement of student modelling

techniques in the ﬁeld of Intelligent Tutoring Systems.

7 Conclusion

In this study, we have introduced Sim-GAIL, a pioneering student simulation

method founded on the Generative Adversarial Imitation Learning (GAIL)

algorithm. It stands as the ﬁrst of its kind that trains ITS using simulated

student behaviour data, eﬀectively addressing the challenges of high-cost,

resource-intensive real student data collection, and the cold start problem

encountered during early-stage ITS training. Sim-GAIL demonstrates supe-

rior performance in comparison with traditional Reinforcement Learning-based

and Imitation Learning-based methods, marking a signiﬁcant advancement in

state-of-the-art student modelling for Intelligent Tutoring Systems.

Our student simulation method, Sim-GAIL, leverages the EdNet dataset

and outperforms the baseline methods: a Reinforcement Learning method

based on Conservative Q-learning and an Imitation Learning method based

on Behavioural Cloning. We have thoroughly evaluated our method from four

aspects: action distribution discrepancy based on the Kullback-Leibler diver-

gence, reward function using Expected Cumulative Rewards (ECR), and two

Oﬄine Policy Evaluation (OPE) methods - Importance Sampling and Fitted Q

Evaluation. Our results convincingly demonstrate that Sim-GAIL outperforms

the baseline models in all these aspects.

Further, we have applied Sim-GAIL to state-of-the-art knowledge tracing

models and observed a noticeable improvement in their performance, especially

Springer Nature 2021 L

A

T

E

X template

28 Sim-GAIL

in cold-start scenarios. This underlines Sim-GAIL’s eﬃciency in simulating and

generating students’ historical trajectory data, further emphasising its novelty

and potential to contribute to the ﬁeld of student modelling for Intelligent

Tutoring Systems.

Moving forward, research can explore ﬁne-grained simulations, adaptive

simulation techniques, transfer learning and generalisation, and human-in-the-

loop simulations, to enhance Sim-GAIL’s capabilities in student modelling even

further, as discussed in Section 6. This study paves the way for these future

endeavours by providing a robust, novel method for generating simulated

student data for ITS training.

8 Declarations

8.1 Conﬂict of Interest

The authors declare that they have no conﬂicts of interest in this work.

8.2 Data Availability

The datasets analysed during the current study are available in the

EdNet repository doi.org/10.48550/arXiv.1912.03072 [12]. These datasets were

derived from the following public domain resources: github.com/riiid/ednet#

properties-of-ednet.

References

[1] Zhu, X.: Machine teaching: An inverse problem to machine learning and

an approach toward optimal education. In: Proceedings of the AAAI

Conference on Artiﬁcial Intelligence, vol. 29 (2015)

[2] Ritter, F.E., Nerb, J., Lehtinen, E., O’Shea, T.M.: In Order to Learn: How

the Sequence of Topics Inﬂuences Learning. Oxford University Press, ???

(2007)

[3] Shi, L., Cristea, A.I., Awan, M.S.K., Hendrix, M., Stewart, C.: Towards

understanding learning behavior patterns in social adaptive personalized

e-learning systems. (2013). Association for Information Systems

[4] Bassen, J., Balaji, B., Schaarschmidt, M., Thille, C., Painter, J., Zim-

maro, D., Games, A., Fast, E., Mitchell, J.C.: Reinforcement learning for

the adaptive scheduling of educational activities. In: Proceedings of the

2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12

(2020)

[5] Stash, N.V., Cristea, A.I., De Bra, P.M.: Authoring of learning styles in

adaptive hypermedia : problems and solutions. In: Proceedings of the 13th

International World Wide Web Conference on Alternate Track Papers &

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 29

Posters, pp. 114–123. ACM, New York NY USA (2004). https://doi.org/

10.1145/1013367.1013387

[6] Li, Z., Shi, L., Cristea, A., Zhou, Y., Xiao, C., Pan, Z.: Simstu-

transformer: A transformer-based approach to simulating student

behaviour. In: International Conference on Artiﬁcial Intelligence in Edu-

cation, pp. 348–351 (2022). Springer

[7] Cristea, A.I., Okamoto, T.: Considering automatic educational validation

of computerized educational systems. In: Proceedings IEEE International

Conference on Advanced Learning Technologies, pp. 415–417. IEEE,

Madison, WI, USA (2001). https://doi.org/10.1109/ICALT.2001.943962

[8] Jarboui, F., Gruson-Daniel, C., Durmus, A., Rocchisani, V.,

Goulet Ebongue, S.-H., Depoux, A., Kirschenmann, W., Perchet,

V.: Markov decision process for mooc users behavioral inference. In:

European MOOCs Stakeholders Summit, pp. 70–80 (2019). Springer

[9] Zimmer, M., Viappiani, P., Weng, P.: Teacher-student framework: a rein-

forcement learning approach. In: AAMAS Workshop Autonomous Robots

and Multirobot Systems (2014)

[10] Anderson, C.W., Draper, B.A., Peterson, D.A.: Behavioral cloning of

student pilots with modular neural networks. In: ICML, pp. 25–32 (2000)

[11] Schaal, S.: Is imitation learning the route to humanoid robots? Trends in

cognitive sciences 3(6), 233–242 (1999)

[12] Choi, Y., Lee, Y., Shin, D., Cho, J., Park, S., Lee, S., Baek, J., Bae, C.,

Kim, B., Heo, J.: Ednet: A large-scale hierarchical dataset in education.

In: International Conference on Artiﬁcial Intelligence in Education, pp.

69–73 (2020). Springer

[13] Shen, S., Chi, M.: Reinforcement learning: the sooner the better, or the

later the better? In: Proceedings of the 2016 Conference on User Modeling

Adaptation and Personalization, pp. 37–44 (2016)

[14] Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT

press, Cambridge, MA (2018)

[15] Levin, E., Pieraccini, R., Eckert, W.: Using markov decision process for

learning dialogue strategies. In: Proceedings of the 1998 IEEE Interna-

tional Conference on Acoustics, Speech and Signal Processing, ICASSP’98

(Cat. No. 98CH36181), vol. 1, pp. 201–204 (1998). IEEE

[16] Li, Z., Shi, L., Cristea, A.I., Zhou, Y.: A survey of collaborative reinforce-

ment learning: Interactive methods and design patterns. In: Designing

Springer Nature 2021 L

A

T

E

X template

30 Sim-GAIL

Interactive Systems Conference 2021, pp. 1579–1590 (2021)

[17] Hussein, A., Gaber, M.M., Elyan, E., Jayne, C.: Imitation learning: A

survey of learning methods. ACM Computing Surveys (CSUR) 50(2),

1–35 (2017)

[18] Pomerleau, D.A.: Alvinn: An autonomous land vehicle in a neural

network. Advances in neural information processing systems 1(1988)

[19] Pomerleau, D.A.: Eﬃcient training of artiﬁcial neural networks for

autonomous navigation. Neural computation 3(1), 88–97 (1991)

[20] Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in

neural information processing systems 29 (2016)

[21] Bhattacharyya, R., Wulfe, B., Phillips, D., Kueﬂer, A., Morton, J.,

Senanayake, R., Kochenderfer, M.: Modeling human driving behav-

ior through generative adversarial imitation learning. arXiv preprint

arXiv:2006.06412 (2020)

[22] Ross, S., Bagnell, D.: Eﬃcient reductions for imitation learning. In:

Proceedings of the Thirteenth International Conference on Artiﬁcial

Intelligence and Statistics, pp. 661–668 (2010). JMLR Workshop and

Conference Proceedings

[23] Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning

and structured prediction to no-regret online learning. In: Proceed-

ings of the Fourteenth International Conference on Artiﬁcial Intelligence

and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference

Proceedings

[24] Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement

learning. In: Proceedings of the Twenty-ﬁrst International Conference on

Machine Learning, p. 1 (2004)

[25] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,

Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances

in neural information processing systems 27 (2014)

[26] Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement

learning. In: Icml, vol. 1, p. 2 (2000)

[27] Brusilovsky, P.: Adaptive hypermedia for education and training. Adap-

tive technologies for training and education 46, 46–68 (2012)

[28] Shi, L., Al Qudah, D., Qaﬀas, A., Cristea, A.I.: Topolor: A social personal-

ized adaptive e-learning system. In: Carberry, S., Weibelzahl, S., Micarelli,

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 31

A., Semeraro, G. (eds.) User Modeling, Adaptation, and Personalization,

pp. 338–340. Springer, Berlin, Heidelberg (2013)

[29] Shi, L., Cristea, A.I.: Learners thrive using multifaceted open social

learner modeling. IEEE MultiMedia 23(1), 36–47 (2016). https://doi.org/

10.1109/MMUL.2015.93

[30] Shi, L., Cristea, A.I., Toda, A.M., Oliveira, W.: Exploring navigation

styles in a futurelearn mooc. In: Kumar, V., Troussas, C. (eds.) Intelligent

Tutoring Systems, pp. 45–55. Springer, Cham (2020)

[31] Liu, Q., Shen, S., Huang, Z., Chen, E., Zheng, Y.: A survey of knowledge

tracing. arXiv preprint arXiv:2105.15106 (2021)

[32] Alharbi, K., Cristea, A.I., Okamoto, T.: Agent-based classroom environ-

ment simulation: The eﬀect of disruptive schoolchildren’s behaviour versus

teacher control over neighbours. In: Artiﬁcial Intelligence in Education.

AIED 2021. Lecture Notes in Computer Science. Springer, Cham. (2021).

https://doi.org/10.1007/978-3-030-78270-2 8

[33] Li, Z., Shi, L., Zhou, Y., Wang, J.: Towards student behaviour simulation:

A decision transformer based approach. In: International Conference on

Intelligent Tutoring Systems, pp. 553–562 (2023). Springer

[34] Doroudi, S., Aleven, V., Brunskill, E.: Where’s the reward? International

Journal of Artiﬁcial Intelligence in Education 29(4), 568–620 (2019)

[35] Iglesias, A., Mart´ınez, P., Aler, R., Fern´andez, F.: Reinforcement learning

of pedagogical policies in adaptive and intelligent educational systems.

Knowledge-Based Systems 22(4), 266–270 (2009)

[36] Yudelson, M.V., Koedinger, K.R., Gordon, G.J.: Individualized bayesian

knowledge tracing models. In: International Conference on Artiﬁcial

Intelligence in Education, pp. 171–180 (2013). Springer

[37] Hambleton, R.K., Swaminathan, H., Rogers, H.J.: Fundamentals of Item

Response Theory vol. 2. Sage, Newbury Park, London, New Delhi (1991)

[38] Segal, A., David, Y.B., Williams, J.J., Gal, K., Shalom, Y.: Combining

diﬃculty ranking with multi-armed bandits to sequence educational con-

tent. In: International Conference on Artiﬁcial Intelligence in Education,

pp. 317–321 (2018). Springer

[39] Azhar, A.Z., Segal, A., Gal, K.: Optimizing representations and poli-

cies for question sequencing using reinforcement learning. International

Educational Data Mining Society (2022)

Springer Nature 2021 L

A

T

E

X template

32 Sim-GAIL

[40] Tetreault, J.R., Litman, D.J.: A reinforcement learning approach to

evaluating state representations in spoken dialogue systems. Speech

Communication 50(8-9), 683–696 (2008)

[41] Rowe, J., Pokorny, B., Goldberg, B., Mott, B., Lester, J.: Toward simu-

lated students for reinforcement learning-driven tutorial planning in gift.

In: Proceedings of R. Sottilare (Ed.) 5th Annual GIFT Users Symposium.

Orlando, FL (2017)

[42] Chi, M., VanLehn, K., Litman, D.: Do micro-level tutorial decisions

matter: Applying reinforcement learning to induce pedagogical tutorial

tactics. In: International Conference on Intelligent Tutoring Systems, pp.

224–234 (2010). Springer

[43] Beck, J., Woolf, B.P., Beal, C.R.: Advisor: A machine learning architec-

ture for intelligent tutor construction. AAAI/IAAI 2000(552-557), 1–2

(2000)

[44] Emond, B., Smith, J., Musharraf, M., Torbati, R.Z., Billard, R., Barnes,

J., Veitch, B.: Development of ais using simulated learners, bayesian net-

works and knowledge elicitation methods. In: International Conference on

Human-Computer Interaction, pp. 143–158 (2022). Springer

[45] Shen, S., Chi, M.: Aim low: Correlation-based feature selection for model-

based reinforcement learning. International Educational Data Mining

Society (2016)

[46] Ho, J., Gupta, J., Ermon, S.: Model-free imitation learning with pol-

icy optimization. In: International Conference on Machine Learning, pp.

2760–2769 (2016). PMLR

[47] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980 (2014)

[48] Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation.

arXiv preprint arXiv:1805.01954 (2018)

[49] Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for

oﬄine reinforcement learning. Advances in Neural Information Processing

Systems 33, 1179–1191 (2020)

[50] Lef`evre, S., Sun, C., Bajcsy, R., Laugier, C.: Comparison of paramet-

ric and non-parametric approaches for vehicle speed prediction. In: 2014

American Control Conference, pp. 3494–3499 (2014). IEEE

[51] Azhar, Z.A.Z.: Designing an oﬄine reinforcement learning based peda-

gogical agent with a large scale educational dataset. Master of Science

Springer Nature 2021 L

A

T

E

X template

Sim-GAIL 33

Thesis, Data Science (2021). University of Edinburgh

[52] Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement

Learning and Dynamic Programming Using Function Approximators.

CRC press, Subs. of Times Mirror 2000 Corporate Blvd. NW Boca Raton,

FLUnited States (2010)

[53] Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspec-

tive on reinforcement learning. In: International Conference on Machine

Learning, pp. 449–458 (2017). PMLR

[54] Hershey, J.R., Olsen, P.A.: Approximating the kullback leibler divergence

between gaussian mixture models. In: 2007 IEEE International Conference

on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 4, p. 317

(2007). IEEE

[55] Voloshin, C., Le, H.M., Jiang, N., Yue, Y.: Empirical study of oﬀ-

policy policy evaluation for reinforcement learning. arXiv preprint

arXiv:1911.06854 (2019)

[56] Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Ojea,

J.A., Solowjow, E., Levine, S.: Residual reinforcement learning for robot

control. In: 2019 International Conference on Robotics and Automation

(ICRA), pp. 6023–6029 (2019). IEEE

[57] Lapan, M.: Deep Reinforcement Learning Hands-On: Apply Modern

RL Methods, with Deep Q-networks, Value Iteration, Policy Gradients,

TRPO, AlphaGo Zero and More. Packt Publishing, Ltd. (2018). https:

//doi.org/10.5555/3279266

[58] Weaver, L., Tao, N.: The optimal reward baseline for gradient-based

reinforcement learning. arXiv preprint arXiv:1301.2315 (2013)

[59] Mandel, T., Liu, Y.-E., Levine, S., Brunskill, E., Popovic, Z.: Oﬄine policy

evaluation across representations with applications to educational games.

In: AAMAS, vol. 1077 (2014)

[60] Saito, Y., Udagawa, T., Kiyohara, H., Mogi, K., Narita, Y., Tateno, K.:

Evaluating the robustness of oﬀ-policy evaluation. In: Fifteenth ACM

Conference on Recommender Systems, pp. 114–123 (2021)

[61] Tokdar, S.T., Kass, R.E.: Importance sampling: a review. Wiley Interdis-

ciplinary Reviews: Computational Statistics 2(1), 54–60 (2010)

[62] Tirinzoni, A., Salvini, M., Restelli, M.: Transfer of samples in policy

search via multiple importance sampling. In: International Conference on

Machine Learning, pp. 6264–6274 (2019). PMLR

Springer Nature 2021 L

A

T

E

X template

34 Sim-GAIL

[63] Shelton, C.R.: Importance sampling for reinforcement learning with

multiple objectives (2001)

[64] Ju, S., Shen, S., Azizsoltani, H., Barnes, T., Chi, M.: Importance sampling

to identify empirically valid policies and their critical decisions. In: EDM

(Workshops), pp. 69–78 (2019)

[65] Mahmood, A.R., Van Hasselt, H.P., Sutton, R.S.: Weighted impor-

tance sampling for oﬀ-policy learning with linear function approximation.

Advances in Neural Information Processing Systems 27 (2014)