ArticlePDF Available

Educational data mining: prediction of students' academic performance using machine learning algorithms

Authors:

Abstract and Figures

Educational data mining has become an effective tool for exploring the hidden relationships in educational data and predicting students' academic achievements. This study proposes a new model based on machine learning algorithms to predict the final exam grades of undergraduate students, taking their midterm exam grades as the source data. The performances of the random forests, nearest neighbour, support vector machines, logistic regression, Naïve Bayes, and k-nearest neighbour algorithms, which are among the machine learning algorithms, were calculated and compared to predict the final exam grades of the students. The dataset consisted of the academic achievement grades of 1854 students who took the Turkish Language-I course in a state University in Turkey during the fall semester of 2019–2020. The results show that the proposed model achieved a classification accuracy of 70–75%. The predictions were made using only three types of parameters; midterm exam grades, Department data and Faculty data. Such data-driven studies are very important in terms of establishing a learning analysis framework in higher education and contributing to the decision-making processes. Finally, this study presents a contribution to the early prediction of students at high risk of failure and determines the most effective machine learning methods.
This content is subject to copyright. Terms and conditions apply.
Educational data mining: prediction
ofstudents’ academic performance using
machine learning algorithms
Mustafa Yağcı*
Introduction
e application of data mining methods in the field of education has attracted great
attention in recent years. Data Mining (DM) is the discovery of data. It is the field of
discovering new and potentially useful information or meaningful results from big data
(Witten etal., 2011). It also aims to obtain new trends and new patterns from large data-
sets by using different classification algorithms (Baker & Inventado, 2014).
Educational data mining (EDM) is the use of traditional DM methods to solve prob-
lems related to education (Baker & Yacef, 2009; cited in Fernandes etal., 2019). EDM
is the use of DM methods on educational data such as student information, edu-
cational records, exam results, student participation in class, and the frequency of
Abstract
Educational data mining has become an effective tool for exploring the hidden rela-
tionships in educational data and predicting students’ academic achievements. This
study proposes a new model based on machine learning algorithms to predict the
final exam grades of undergraduate students, taking their midterm exam grades as
the source data. The performances of the random forests, nearest neighbour, support
vector machines, logistic regression, Naïve Bayes, and k-nearest neighbour algorithms,
which are among the machine learning algorithms, were calculated and compared to
predict the final exam grades of the students. The dataset consisted of the academic
achievement grades of 1854 students who took the Turkish Language-I course in a
state University in Turkey during the fall semester of 2019–2020. The results show that
the proposed model achieved a classification accuracy of 70–75%. The predictions
were made using only three types of parameters; midterm exam grades, Depart-
ment data and Faculty data. Such data-driven studies are very important in terms of
establishing a learning analysis framework in higher education and contributing to
the decision-making processes. Finally, this study presents a contribution to the early
prediction of students at high risk of failure and determines the most effective machine
learning methods.
Keywords: Machine learning, Educational data mining, Predicting achievement,
Learning analytics, Early warning systems
Open Access
© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the mate-
rial. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creat iveco mmons. org/ licen ses/ by/4. 0/.
RESEARCH
Yağcı Smar t Learning Environments (2022) 9:11
https://doi.org/10.1186/s40561-022-00192-z
Smart Learning Environments
*Correspondence:
mustafayagci06@gmail.com
Kırşehir Ahi Evran University,
Faculty of Engineering
and Architecture,
40100 Kırşehir, Turkey
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 2 of 19
Yağcı Smar t Learning Environments (2022) 9:11
students’ asking questions. In recent years, EDM has become an effective tool used
to identify hidden patterns in educational data, predict academic achievement, and
improve the learning/teaching environment.
Learning analytics has gained a new dimension through the use of EDM (Waheed
etal., 2020). Learning analytics covers the various aspects of collecting student infor-
mation together, better understanding the learning environment by examining and
analysing it, and revealing the best student/teacher performance (Long & Siemens,
2011). Learning analytics is the compilation, measurement and reporting of data
about students and their contexts in order to understand and optimize learning and
the environments in which it takes place. It also deals with the institutions developing
new strategies.
Another dimension of learning analytics is predicting student academic perfor-
mance, uncovering patterns of system access and navigational actions, and deter-
mining students who are potentially at risk of failing (Waheed etal., 2020). Learning
management systems (LMS), student information systems (SIS), intelligent teaching
systems (ITS), MOOCs, and other web-based education systems leave digital data
that can be examined to evaluate students’ possible behavior. Using EDM method,
these data can be employed to analyse the activities of successful students and those
who are at risk of failure, to develop corrective strategies based on student academic
performance, and therefore to assist educators in the development of pedagogical
methods (Casquero etal., 2016; Fidalgo-Blanco etal., 2015).
e data collected on educational processes offer new opportunities to improve
the learning experience and to optimize users’ interaction with technological plat-
forms (Shorfuzzaman etal., 2019). e processing of educational data yields improve-
ments in many areas such as predicting student behaviour, analytical learning, and
new approaches to education policies (Capuano & Toti, 2019; Viberg etal., 2018). is
comprehensive collection of data will not only allow education authorities to make
data-based policies, but also form the basis of software to be developed with artificial
intelligence on the learning process.
EDM enables educators to predict situations such as dropping out of school or less
interest in the course, analyse internal factors affecting their performance, and make sta-
tistical techniques to predict students’ academic performance. A variety of DM meth-
ods are employed to predict student performance, identify slow learners, and dropouts
(Hardman etal., 2013; Kaur etal., 2015). Early prediction is a new phenomenon that
includes assessment methods to support students by proposing appropriate corrective
strategies and policies in this field (Waheed etal., 2020).
Especially during the pandemic period, learning management systems, quickly put
into practice, have become an indispensable part of higher education. While students
use these systems, the log records produced have become ever more accessible. (Mac-
fadyen & Dawson, 2010; Kotsiantis et al., 2013; Saqr et al., 2017). Universities now
should improve the capacity of using these data to predict academic success and ensure
student progress (Bernacki etal., 2020).
As a result, EDM provides the educators with new information by discovering hidden
patterns in educational data. Using this model, some aspects of the education system
can be evaluated and improved to ensure the quality of education.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 3 of 19
Yağcı Smar t Learning Environments (2022) 9:11
Literature
In various studies on EDM, e-learning systems have been successfully analysed (Lara
et al., 2014). Some studies have also classified educational data (Chakraborty et al.,
2016), while some have tried to predict student performance (Fernandes etal., 2019).
Asif etal. (2017) focused on two aspects of the performance of undergraduate students
using DM methods. e first aspect is to predict the academic achievements of students
at the end of a four-year study program. e second one is to examine the development
of students and combine them with predictive results. He divided the students into two
parts as low achievement and high achievement groups. He have found that it is impor-
tant for the educators to focus on a small number of courses indicating particularly good
or poor performance in order to offer timely warnings, support underperforming stu-
dents and offer advice and opportunities to high-performing students. Cruz-Jesus etal.
(2020) predicted student academic performance with 16 demographics such as age, gen-
der, class attendance, internet access, computer possession, and the number of courses
taken. Random forest, logistic regression, k-nearest neighbours and support vector
machines, which are among the machine learning methods, were able to predict stu-
dents’ performance with accuracy ranging from 50 to 81%.
Fernandes etal. (2019) developed a model with the demographic characteristics of the
students and the achievement grades obtained from the in-term activities. In that study,
students’ academic achievement was predicted with classification models based on Gra-
dient Boosting Machine (GBM). e results showed that the best qualities for estimating
achievement scores were the previous year’s achievement scores and unattendance. e
authors found that demographic characteristics such as neighbourhood, school and age
information were also potential indicators of success or failure. In addition, he argued
that this model could guide the development of new policies to prevent failure. Similarly,
by using the student data requested during registration and environmental factors, Hof-
fait and Schyns (2017) determined the students with the potential to fail. He found that
students with potential difficulties could be classified more precisely by using DM meth-
ods. Moreover, their approach makes it possible to rank the students by levels of risk.
Rebai etal. (2020) proposed a machine learning-based model to identify the key factors
affecting academic performance of schools and to determine the relationship between
these factors. He concluded that the regression trees showed that the most important
factors associated with higher performance were school size, competition, class size,
parental pressure, and gender proportions. In addition, according to the random forest
algorithm results, the school size and the percentage of girls had a powerful impact on
the predictive accuracy of the model.
Ahmad and Shahzadi, (2018) proposed a machine learning-based model to find an
answer to the question whether students were at risk regarding their academic per-
formance. Using the students’ learning skills, study habits, and academic interaction
features, they made a prediction with a classification accuracy of 85%. e research-
ers concluded that the model they proposed could be used to determine academically
unsuccessful student.Musso etal., (2020) proposed a machine learning model based on
learning strategies, perception of social support, motivation, socio-demographics, health
condition, and academic performance characteristics. With this model, he predicted the
academic performance and dropouts. He concluded that the predictive variable with
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 4 of 19
Yağcı Smar t Learning Environments (2022) 9:11
the highest effect on predicting GPA was learning strategies while the variable with the
greatest effect on determining dropouts was background information.
Waheed et al., (2020) designed a model with artificial neural networks on stu-
dents’ records related to their navigation through the LMS. e results showed that
demographics and student clickstream activities had a significant impact on student
performance. Students who navigated through courses performed higher. Students’ par-
ticipation in the learning environment had nothing to do with their performance. How-
ever, he concluded that the deep learning model could be an important tool in the early
prediction of student performance. Xu etal. (2019) determined the relationship between
the internet usage behaviors of university students and their academic performance and
he predicted students’ performance with machine learning methods.e model he pro-
posed predicted students’ academic performance at a high level of accuracy. e results
suggested that Internet connection frequency features were positively correlated with
academic performance, whereas Internet traffic volume features were negatively corre-
lated with academic performance. In addition, he concluded that internet usage features
had an important role on students’ academic performance. Bernacki etal. (2020) tried to
find out whether the log records in the learning management system alone would be suf-
ficient to predict achievement. He concluded that the behaviour-based prediction model
successfully predicted 75% of those who would need to repeat a course. He also stated
that, with this model, students who might be unsuccessful in the subsequent semesters
could be identified and supported. Burgos etal. (2018) predicted the achievement grades
that the students might get in the subsequent semesters and designed a tool for students
who were likely to fail. He found that the number of unsuccessful students decreased by
14% compared to previous years. A comparative analysis of studies predicting the aca-
demic achievement grades using machine learning methods is given in Table1.
A review of previous research that aimed to predict academic achievement indicates
that researchers have applied a range of machine learning algorithms, including mul-
tiple, probit and logistic regression, neural networks, and C4.5 and J48 decision trees.
However, random forests (Zabriskie et al., 2019), genetic programming (Xing et al.,
2015), and Naive Bayes algorithms (Ornelas & Ordonez, 2017) were used in recent stud-
ies. e prediction accuracy of these models reaches very high levels.
Prediction accuracy of student academic performance requires an deep understanding
of the factors and features that impact student results and the achievement of student
(Alshanqiti & Namoun, 2020). For this purpose, Hellas etal. (2018) reviewed 357 articles
on student performance detailing the impact of 29 features. ese features were mainly
related to psychomotor skills such as course and pre-course performance, student par-
ticipation, student demographics such as gender, high school performance, and self-
regulation. However, the dropout rates were mainly influenced by student motivation,
habits, social and financial issues, lack of progress, and career transitions.
e literature review suggests that, it is a necessity to improve the quality of education
by predicting the academic performance of the students and supporting those who are
in the risk group.In the literature, the prediction of academic performance was made
with many and various variables, various digital traces left by students on the internet
(browsing, lesson time, percentage of participation) (Fernandes etal., 2019; Rubin etal.,
2010; Waheed etal., 2020; Xu et al., 2019) and students demographic characteristics
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 5 of 19
Yağcı Smar t Learning Environments (2022) 9:11
Table 1 Comparative analysis
References Variables Objectives Level Dataset Algorithms Accuracy
Min Max
Asif et al. (2017) The marks for all the
courses that are taught in
the four years of the degree
programme
Predicting students’ perfor-
mance Undergraduate students 210 DT, 1-NN, NB, NN, RF NN (62.50%) NB (83.65%)
Cruz-Jesus et al. (2020) Year of the study cycle,
gender, age, number of
enrolled years in high
school, scholarship, internet
access, class size, school
size, economic level, popu-
lation density, number of
unit courses attended
Predicting students’ perfor-
mance High schools students 110627 ANN, DT, ET, RF, SVM, kNN,
LR LR (81.1%) SVM (51.2%)
Fernandes et al. (2019) Class with persons with
special needs, Classroom
usage environment, Gen-
der, age (mean), Student
benefit, city, neighbour-
hood, Student with special
needs, Grade (mean),
Absence (mean)
Predict academic
outcomes of student
performance
High schools students Dataset1:19000
Dataset2:19834 Gradient Boosting Machine 89.5% 91.9%
Hoffait and Schyns (2017) Gender, Nationality, Stud-
ies, Prior schooling, math,
scholarship, success
Predicting students at high
risk of failure secondary school students 2244 RF, LR, ANN ANN (70.4%) RF (90%)
Rebai et al. (2020) Socioeconomic status,
school type, school loca-
tion, competition, teacher
characteristic (experience,
salary), class size, school
size, gender, parental
education, political context,
parental pressure
to identify the key factors
that impact schools’ aca-
demic performance and to
explore their relationships
Secondary schools 105 schools RT, RF
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 19
Yağcı Smar t Learning Environments (2022) 9:11
Table 1 (continued)
References Variables Objectives Level Dataset Algorithms Accuracy
Min Max
Ahmad and Shahzadi
(2018)Previous degree marks,
Home environment, Study
habits Learning skills,
Hardworking and Academic
interaction
Identification of students in
the risk group Undergraduate students 300 MPNN 95%
Musso et al., (2020) Learning strategies, coping
strategies, cognitive factors,
social support, background,
self-concept, self-satisfac-
tion, use of IT and reading
Grade point average,
academic retention, and
degree completion
Undergraduate students 655 ANN 60.5% 80.7%
Waheed et al., (2020) Students’ demographics,
clickstream events Pass-fail, withdrawn-pass,
distinction-fail, distinction-
pass
Undergraduate students 32593 ANN, SVM, LR 84% 93%
Xu et al. (2019) Internet usage behaviours
comprise online time, inter-
net connection frequency,
internet traffic volume, and
online time
Predicting students’ perfor-
mance Undergraduate students 4000 DT, NN, SVM 71% 76%
Bernacki et al. (2020) Log records in the learning
management system Predict achievement Undergradeate students 337 LR, NB, J-48 DT, J-Rip DT J-48 (53.71%) LR (67.36%)
Burgos et al. (2018) Historical student course
grade data Drop out of a course Undergradeate students 100 SVM, FFNN, PESFAM,
LOGIT_Act SVM (62.50) LOGIT_Act(97.13%)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 7 of 19
Yağcı Smar t Learning Environments (2022) 9:11
(gender, age, economic status, number of courses attended, internet access, etc.) (Ber-
nacki etal., 2020; Rizvi etal., 2019; García-González & Skrita, 2019; Rebai etal., 2020;
Cruz-Jesus etal., 2020; Aydemir, 2017), learning skills, study approaches, study habits
(Ahmad & Shahzadi, 2018), learning strategies, social support perception, motivation,
socio-demography, health form, academic performance characteristics (Costa-Mendes
etal., 2020; Gök, 2017; Kılınç, 2015; Musso etal., 2020), homework, projects, quizzes
(Kardaş & Güvenir, 2020), etc. In almost all models developed in such studies, prediction
accuracy is ranging from 70 to 95%.Hovewer, collecting and processing such a variety
of data both takes a lot of time and requires expert knowledge.Similarly,Hoffait and
Schyns (2017) suggestedthat collecting so many data isdifficultand socio-economic
data are unnecessary.Moreover, these demographic or socio-economic data may not
always give the right idea ofpreventing failure(Bernacki etal., 2020).
e study concerns predicting students’ academic achievement using grades only, no
demographic characteristics and no socio-economic data. is study aimed to develop
a new model based on machine learning algorithms to predict the final exam grades of
undergraduate students taking their midterm exam grades, Faculty and Department of
the students.
For this purpose, classification algorithms with the highest performance in predict-
ing students’ academic achievement were determined by using machine learning clas-
sification algorithms. e reason for choosing the Turkish Language-I course was that
it is a compulsory course that all students enrolled in the university must take.Using
this model, students’ final exam gradeswerepredicted.ese models will enable the
development of pedagogical interventions and new policies to improve students’ aca-
demic performance.In this way, the number of potentially unsuccessful students can be
reduced following the assessments made after each midterm.
Method
is section describes the details of the dataset, pre-processing techniques, and machine
learning algorithms employed in this study.
Dataset
Educational institutions regularly store all data that are available about students in elec-
tronic medium.Data arestored in databasesfor processing.ese data can be of many
types and volumes, from students’ demographics to their academic achievements.In this
study, the data were taken from the Student Information System (SIS), where all student
records are stored at a State University in Turkey. In these records, the midterm exam
grades, final exam grades, Faculty, and Department of 1854 students who have taken the
Turkish Language-I course in the 2019–2020 fall semester were selected as the dataset.
Table2 shows the distribution of students according to the academic unit. Moreover, as
a additional file1 the dataset are presented.
Midterm and final exam gradesare ranging from 0to100. In this system, the end-of-
semester achievement grade is calculated by taking 40% of the midterm exam and 60%
of the final exam. Students with achievement grade below 60 are unsuccessful and those
above 60 are successful. e midterm exam is usually held in the middle of the academic
semester and the final exam is held at the end of the semester. ere are approximately
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 8 of 19
Yağcı Smar t Learning Environments (2022) 9:11
9weeks (2.5months) from the midterm exam to the final exam. In other words, there is
a two and a half month period for corrective actions for students who are at risk of fail-
ing thanks to the final exam predictions made. In other words, the answer to the ques-
tion of how effective the student’s performance in the middle of the semester is on his
performance at the end of the semester was investigated.
Data identication andcollection
At this phase, it is determined from which source the data will be stored, which fea-
tures of the data will be used, and whether the collected data is suitable for the purpose.
Feature selection involves decreasing the number of variables used to predict a particu-
lar outcome.e goal; to facilitate the interpretability of the model, reduce complexity,
increase the computational efficiency of algorithms, and avoid overfitting.
Establishing DM model andimplementation ofalgorithm
RF, NN, LR, SVM, NB and kNN were employed to predict students’ academic perfor-
mance. e prediction accuracy was evaluated using tenfold cross validation. e DM
processserves two main purposes.e first purpose is to make predictions by analyz-
ing the data in the database (predictive model). e second one is to describe behaviors
(descriptive model). In predictive models, a model is created by using data with known
results. en, using this model, the result values are predicted for datasets whose results
are unknown. In descriptive models, the patterns in the existing data are defined to make
decisions.
When the focus is on analysing the causes of success or failure, statistical methods
such as logistic regression and time series can be employed (Ortiz & Dehon, 2008;Arias
Ortiz & Dehon, 2013). However, when the focus is on forecasting, neural networks
(Delen, 2010; Vandamme etal., 2007), support vector machines (Huang & Fang, 2013),
decision trees (Delen, 2011; Nandeshwar etal., 2011) and random forests (Delen, 2010;
Vandamme etal., 2007) is more efficient and give more accurate results. Statistical
techniques are to create a model that can successfully predict output values based on
Table 2 The dataset
Academic unit Number
of
Students
Faculty of Education 404
Faculty of Arts and Sciences 319
Faculty of Health Sciences 296
Faculty of Economics and Administrative Sciences 221
School of Physical Education and Sports 192
Faculty of Engineering and Architecture 116
School of Physical Therapy and Rehabilitation 92
Faculty of Islamic Sciences 88
Faculty of Agriculture 68
Faculty of Fine Arts 30
Vocational School of Applied Sciences 28
Total Number of Students 1854
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 9 of 19
Yağcı Smar t Learning Environments (2022) 9:11
available input data.On the other hand, machine learning methods automatically create
a model that matches the input data with the expected target values when a supervised
optimization problem is given.
e performance of the model was measured by confusion matrix indicators. It is
understood from the literature that there is no single classifier that works best for pre-
diction results. erefore, it is necessary to investigate which classifiers are more studied
for the analysed data (Asif etal., 2017).
Experiments andresults
e entire experimental phase was performed with Orange machine learning soft-
ware. Orange is a powerful and easy-to-use component-based DM programming tool
for expert data scientists as well as for data science beginners. In Orange, data analysis
is done by stacking widgets into workflows. Each widget includes some data retrieval,
data pre-processing, visualization, modelling, or evaluation task. A workflow is a series
of actions or actions that will be performed on the platform to perform a specific task.
Comprehensive data analysis charts can be created by combining different components
in a workflow. Figure1 shows the workflow diagram designed.
e dataset included midterm exam grades, final exam grades, Faculty, and Depart-
ment of 1854 students taking the Turkish Language-I course in the 2019–2020 Fall
Semester. e entire dataset is provided as Additional file1. Table3 shows part of the
dataset.
In the dataset, students’ midterm exam grades, final exam grades, faculty, and
department information were determined as features. Each measure contains
data associated with a student. Midterm exam and final exam grade variables were
explained under the heading "dataset". e faculty variable represents Faculties in
Kırşehir Ahi Evran University and the department variable represents departments in
faculties. In the development of the model, the midterm, the faculty, and the depart-
ment information were determined as the independent variable and the final was
determined as the dependent variable. Table4 shows the variable model.
Fig. 1 The workflow of the designed model
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 10 of 19
Yağcı Smar t Learning Environments (2022) 9:11
After the variable model was determined, the midterm exam grades and final exam
grades were categorized according to the equal-width discretization model. Table5
shows the criteria used in converting midterm exam grades and final exam grades
into the categorical format.
In Table6, the values in the final column are the actual values. e values in the
RF, SVM, LR, KNN, NB, and NN columns are the values predicted by the proposed
model. For example, according to Table5, std1’s actual final grade was in the range 55
to 77. While the predicted value of the RF, SVM, LR, NB, and NN models were in the
range of, the predicted value of the kNN model was greater than 77.
Evaluation ofthemodel performance
e performance of model was evaluated with confusion matrix, classificationaccu-
racy (CA), precision, recall, f-score (F1), and area under roc curve(AUC) metrics.
Table 3 Part of the dataset consist of 1854 rows
stdID Midterm Final Faculty Department
std1 60 68 Faculty of Economics and Administrative
Sciences Political Science and Public Administra-
tion
std2 34 67 School of Physical Education and Sports Coaching Education
std3 25 75 Faculty of Education Computer Education and Instructional
Technology
std4 50 66 Faculty of Education Social Sciences Teaching
std5 50 66 Faculty of Education Early Childhood Education
std6 88 72 Faculty of Education Garden Plants
std7 45 37 School of Physical Education and Sports Physical Education and Sports Teaching
std8 52 50 School of Physical Education and Sports Coaching Education
std1853 88 88 School of Physical Therapy and Reha-
bilitation Physiotherapy and Rehabilitation
std1854 84 96 School of Physical Therapy and Reha-
bilitation Physiotherapy and Rehabilitation
Table 4 The model of variables
Features Target variable Meta Attributes
Midterm Final stdID
Faculty
Department
Table 5 Categorical criteria
Category Criteria
1grade < 32.5
232.5 < = grade < 55
355 < = grade < 77.5
4grade > = 77.5
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 11 of 19
Yağcı Smar t Learning Environments (2022) 9:11
Table 6 Probabilities and final decisions of predictive models (RF, LR, SVM, kNN, NB, NN)
stdID RF SVM LR kNN NB NN nal midterm faculty Department
st1 55–77.5 55–77.5 55–77.5 77.5 55–77.5 55–77.5 55–77.5 < 32.5 Faculty of Education Computer Education and Instructional Technology
st2 55–77.5 77.5 77.5 55–77.5 55–77.5 55–77.5 55–77.5 32.5–55 Faculty of Education Social Sciences Teaching
st3 55–77.5 55–77.5 55–77.5 55–77.5 55–77.5 55–77.5 55–77.5 32.5–55 Faculty of Education Early Childhood Education
st4 55–77.5 55–77.5 55–77.5 55–77.5 55–77.5 55–77.5 55–77.5 77.5 Faculty of Agriculture Garden Plants
st5 55–77.5 55–77.5 55–77.5 < 32.5 55–77.5 55–77.5 32.5–55 32.5–55 School of Physical Education and Sports Physical Education and Sports Teaching
st6 32.5–55 55–77.5 55–77.5 55–77.5 32.5–55 32.5–55 32.5–55 32.5–55 School of Physical Education and Sports Coaching Education
st7 55–77.5 55–77.5 77.5 55–77.5 < 32.5 < 32.5 < 32.5 < 32.5 Faculty of Education Social Sciences Teaching
st8 55–77.5 55–77.5 55–77.5 77.5 55–77.5 55–77.5 55–77.5 55–77.5 Faculty of Education Psychological Counseling and Guidance
st9 < 32.5 32.5–55 77.5 < 32.5 < 32.5 32.5–55 77.5 < 32.5 Faculty of Education Primary Education
st10 55–77.5 55–77.5 55–77.5 55–77.5 55–77.5 55–77.5 55–77.5 32.5–55 Faculty of Arts and Sciences Archaeology
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 12 of 19
Yağcı Smar t Learning Environments (2022) 9:11
Confusion matrix
e confusion matrix shows the current situation in the dataset and the number of
correct/incorrect predictions of the model. Table7 shows the confusion matrix. e
performance of the model is calculated by the number of correctly classified instances
and incorrectly classified instances. e rows show the real numbers of the samples in
the test set, and the columns represent the estimation of the model.
In Table6, true positive (TP) and true negative (TN) show the number of correctly
classified instances. False positive (FP) shows the number of instances predicted as 1
(positive) while it should be in the 0 (negative) class. False negative (FN) shows the
number of instances predicted as 0 (negative) while it should be in class 1 (positive).
Table8 shows the confusion matrix for the RF algorithm. In the confusion matrix
of 4 × 4 dimensions, the main diagonal shows the percentage of correctly predicted
instances, and the matrix elements other than the main diagonal shows the percent-
age of errors predicted.
Table 8 shows that 84.9% of those with the actual final grade greater than 77.5,
71.2% of those with range 55–77.5, 65.4% of those with range 32.5–55, and 60% of
those with less than 32.5 were predicted correctly. Confusion matrixs of other algo-
rithms are shown in Tables9, 10, 11, 12, and 13.
Classification accuracy:CA is the ratio of the correct predictions (TP + TN) to the
total number of instances (TP + TN + FP + FN ).
Accuracy =
TN
+
TP
FN +TN +TP +FP
Table 7 The Confusion matrix
Predicted
Positive (1) Negative (0)
Actual Positive (1) TP FP
Negative (0) FN TN
Table 8 Confusion matrix of the RF algorithm
Predicted
< 32.5 32.5–55 55–77.5 77.5 Sum
Actual < 32.5 60% 3.8% 1.2% 0.6% 38
32.5–55 26.7% 65.4% 9.5% 0.8% 154
55–77.5 10.0% 30.8% 71.2% 13.6% 1016
77.5 3.3% 0.0% 18.1% 84.9% 646
Sum 30 26 1320 478 1854
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 13 of 19
Yağcı Smar t Learning Environments (2022) 9:11
Table 9 Confusion matrix of the NN algorithm
Predicted
< 32.5 32.5–55 55–77.5 77.5 Sum
Actual < 32.5 64% 9.7% 1.2% 0.6% 38
32.5–55 24% 61.3% 9.6% 1.0% 154
55–77.5 12.0% 25.8% 71.8% 14.9% 1016
77.5 0.0% 3.2% 17.4% 83.5% 646
Sum 25 31 1296 502 1854
Table 10 Confusion matrix of the SVM algorithm
Predicted
< 32.5 32.5–55 55–77.5 77.5 Sum
Actual < 32.5 68.8% 14.3% 1.6% 0.6% 38
32.5–55 31.2% 52.4% 9.9% 0.9% 154
55–77.5 0.0% 14.3% 70.1% 14.3% 1016
77.5 0.0% 19.0% 18.4% 84.2% 646
Sum 16 21 1349 468 1854
Table 11 Confusion matrix of the LR algorithm
Predicted
< 32.5 32.5–55 55–77.5 77.5 Sum
Actual < 32.5 56.0% 8.3% 1.5% 0.8% 38
32.5–55 24.0% 41.7% 10.3% 1.7% 154
55–77.5 4.0% 25.0% 70.0% 20.1% 1016
77.5 16.0% 25.0% 18.1% 77.4% 646
Sum 25 12 1295 522 1854
Table 12 Confusion matrix of the NB algorithm
Predicted
< 32.5 32.5–55 55–77.5 77.5 Sum
Actual < 32.5 40.0% 9.5% 0.9% 0.0% 38
32.5–55 18.2% 42.9% 9.4% 1.2% 154
55–77.5 18.2% 42.9% 70.4% 19.3% 1016
77.5 23.6% 4.8% 19.2% 79.5% 646
Sum 55 42 1270 487 1854
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 14 of 19
Yağcı Smar t Learning Environments (2022) 9:11
Precision: Precision is the ratio of the number of positive instances that are correctly
classified to the total number of instances that are predicted positive. Gets a value in the
range [0.1].
Recall: Recall is the ratio of the correctly classified number of positive instances to the
number of all instances whose actual class is positive. e Recall is also called the true
positive rate. Gets a value in the range [0.1].
F-Criterion (F1):ere is an opposite relationship between precision and recall. ere-
fore, the harmonic mean of both criteria is calculated for more accurate and sensitive
results. is is called the F-criterion.
Receiver operating characteristics (ROC) curve
e AUC-ROC curve is used to evaluate the performance of a classification problem.
AUC-ROC is a widely used metric to evaluate the performance of machine learning
algorithms, especially in cases where there are unbalanced datasets, and explains how
well the model is at predicting.
AUC: Area undertheROC curve
e larger the area covered, the better the machine learning algorithms at distin-
guishing given classes.AUC fortheideal value is 1. eAUC,ClassificationAccuracy
(CA),F-Criterion(F1),precision, andrecall values of themodels areshown inTable14.
e AUC value of RF, NN, SVM, LR, NB, and kNN algorithms were 0.860, 0.863, 0.804,
0.826, 0.810, and 0.810 respectively. e classification accuracy of the RF, NN, SVM, LR,
NB, and kNN algorithms were also 0.746, 0.746, 0.735, 0.717, 0.713, and 0,699 respec-
tively. According to these findings, for example, the RF algorithm was able to achieve
74.6% accuracy. In other words, there was a very high-level correlation between the
Precision
=
TP
TP +FP
Recall
=
TP
TP +FN
F
-Criterion =
2
×
Duyarlilik
×
Kesinlik
Duyarlilik
+
Kesinlik
Table 13 Confusion matrix of the kNN algorithm
Predicted
< 32.5 32.5–55 55–77.5 77.5 Sum
Actual < 32.5 50.0% 2.6% 1.1% 0.5% 38
32.5–55 30.0% 31.3% 8.9% 1.5% 154
55–77.5 15.0% 55.7% 72.9% 24.9% 1016
77.5 5.0% 10.4% 17.1% 73.1% 646
Sum 40 115 1089 610 1854
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 15 of 19
Yağcı Smar t Learning Environments (2022) 9:11
data predicted and the actual data. As a result, 74.6% of the samples were been classified
correctly.
Discussion andconclusion
is study proposes a new model based on machine learning algorithms to predict the
final exam grades of undergraduate students, taking their midterm exam grades as the
source data. e performances of the Random Forests, nearest neighbour, support vec-
tor machines, Logistic Regression, Naïve Bayes, and k-nearest neighbour algorithms,
which are among the machine learning algorithms, were calculated and compared to
predict the final exam grades of the students. is study focused on two parameters. e
first parameter was the prediction of academic performance based on previous achieve-
ment grades. e second one was the comparison of performance indicators of machine
learning algorithms.
e results show that the proposed model achieved a classification accuracy of
70–75%. According to this result, it can be said that students’ midterm exam grades are
an important predictor to be used in predicting their final exam grades. RF, NN, SVM,
LR, NB, and kNN are algorithms with a very high accuracy rate that can be used to pre-
dict students’ final exam grades. Furthermore, the predictions were made using only
three types of parameters; midterm exam grades, Department data and Faculty data.
e results of this study were compared with the studies that predicted the academic
achievement grades of the students with various demographic and socio-economic
variables. Hoffait and Schyns (2017) proposed a model that uses the academic achieve-
ment of students in previous years. With this model, they predicted students’ perfor-
mance to be successful in the courses they will take in the new semester. ey found
that 12.2% of the students had a very high risk of failure, with a 90% confidence rate.
Waheed etal. (2020) predicted the achievement of the students with demographic and
geographic characteristics. He found that it has a significant effect on students’ academic
performance. He predicted the failure or success of the students by 85% accuracy. Xu
etal. (2019) found that internet usage data can distinguish and predict students’ aca-
demic performance. Costa-Mendes etal. (2020), Cruz-Jesus etal. (2020), Costa-Mendes
etal. (2020) predicted the academic achievement of students in the light of income, age,
employment, cultural level indicators, place of residence, and socio-economic informa-
tion. Similarly, Babić (2017) predicted students’ performance with an accuracy of 65%
to 100% with artificial neural networks, classification tree, and support vector machines
methods.
Table 14 AUC, CA, F1, precision and recall values of the models
Model (AUC) Classication
accuracy (CA) F1 Precision Recall
Random Forest 0.860 0.746 0.721 0.752 0.746
Neural Network 0.863 0.746 0.723 0.748 0.746
SVM 0.804 0.735 0.704 0.735 0.735
Logistic Regression 0.826 0.717 0.685 0.700 0.717
Naïve Bayes 0.810 0.713 0.692 0.706 0.713
kNN 0.810 0.699 0.694 0.691 0.699
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 16 of 19
Yağcı Smar t Learning Environments (2022) 9:11
Another result of this study was RF, NN and SVM algorithms have the highest classifica-
tion accuracy, while kNN has the lowest classification accuracy. According to this result, it
can be said that RF, NN and SVM algorithms perform with more accurate results in pre-
dicting the academic achievement grades of students with machine learning algorithms.
e results were compared with the results of the research in which machine learning
algorithms were employed to predict academic performance according to various varia-
bles. For example, Hoffait and Schyns (2017) compared the performances of LR, ANN and
RF algorithms to identify students at high risk of academic failure on their various demo-
graphic characteristics. ey ranked the algorithms from those with the highest accuracy
to the ones with the lowest accuracy as LR, ANN, and RF. On the other hand, Waheed etal.
(2020) found that the SVM algorithm performed higher than the LR algorithm. According
to Xu etal. (2019), the algorithm with the highest performance is SVM, followed by the
NN algorithm, and the decision tree is the algorithm with the lowest performance.
e proposed model predicted the final exam grades of students with 73% accuracy.
According to this result, it can be said that academic achievement can be predicted with
this model in the future. By predicting students’ achievement grades in future, students
can be allowed to review their working methods and improve their performance. e
importance of the proposed method can be better understood, considering that there
is approximately 2.5months between the midterm exams and the final exams in higher
education. Similarly, Bernacki etal. (2020) work on the early warning model. He pro-
posed a model to predict the academic achievements of students using their behavior
data in the learning management system before the first exam. His algorithm correctly
identified 75% of students who failed to earn the grade of B or better needed to advance
to the next course. Ahmad and Shahzadi (2018) predicted students at risk for academic
performance with 85% accuracy evaluating their study habits, learning skills, and aca-
demic interaction features. Cruz-Jesus etal. (2020) predicted students’ end-of-semester
grades with 16 independent variables. He concluded that students could be given the
opportunity of early intervention.
As a result, students’ academic performances were predicted using different predic-
tors, different algorithms and different approaches. e results confirm that machine
learning algorithms can be used to predict students’ academic performance. More
importantly, the prediction was made only with the parameters of midterm grade, fac-
ulty and department. Teaching staff can benefit from the results of this research in the
early recognition of students who have below or above average academic motivation.
Later, for example, as Babić (2017) points out, they can match students with below-
average academic motivation by students with above-average academic motivation and
encourage them to work in groups or project work. In this way, the students’ motivation
can be improved, and their active participation in learning can be ensured. In addition,
such data-driven studies should assist higher education in establishing a learning analyt-
ics framework and contribute to decision-making processes.
Future research can be conducted by including other parameters as input variables
and adding other machine learning algorithms to the modelling process. In addition, it
is necessary to harness the effectiveness of DM methods to investigate students’ learning
behaviors, address their problems, optimize the educational environment, and enable
data-driven decision making.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 17 of 19
Yağcı Smar t Learning Environments (2022) 9:11
Abbreviations
EDM: Educational data mining; RF: Random forests; NN: Neural networks; SVM: Support vector machines; LR: Logistic
regression; NB: Naïve Bayes; kNN: K-nearest neighbour; DT: Decision trees; ANN: Artificial neural networks; ERT: Extremely
randomized trees; RT: Regression trees; MPNN: Multilayer perceptron neural network; FFNN: Feed-forward neural
network; PESFAM: Adaptive resonance theory mapping; LMS: Learning management systems; SIS: Student information
systems; ITS: Intelligent teaching systems; CA: Classification accuracy; F1: F-score; AUC : Area under roc curve; TP: True
positive; TN: True negative; FP: False positive; FN: False negative; ROC: Receiver operating characteristics.
Supplementary Information
The online version contains supplementary material available at https:// doi. org/ 10. 1186/ s40561- 022- 00192-z.
Additional le1: Dataset.
Acknowledgements
Not applicable.
Authors’ contributions
All authors read and approved the final manuscript.
Funding
Not applicable.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable
request.
Declarations
Competing interests
The authors declare that they have no competing interests.
Received: 15 November 2021 Accepted: 15 February 2022
References
Ahmad, Z., & Shahzadi, E. (2018). Prediction of students’ academic performance using artificial neural network. Bulletin of
Education and Research, 40(3), 157–164.
Alshanqiti, A., & Namoun, A. (2020). Predicting student performance and its influential factors using hybrid regression
and multi-label classification. IEEE Access, 8, 203827–203844. https:// doi. org/ 10. 1109/ access. 2020. 30365 72
Arias Ortiz, E., & Dehon, C. (2013). Roads to success in the Belgian French Community’s higher education system: predic-
tors of dropout and degree completion at the Université Libre de Bruxelles. Research in Higher Education, 54(6),
693–723. https:// doi. org/ 10. 1007/ s11162- 013- 9290-y
Asif, R., Merceron, A., Ali, S. A., & Haider, N. G. (2017). Analyzing undergraduate students’ performance using educational
data mining. Computers and Education, 113, 177–194. https:// doi. org/ 10. 1016/j. compe du. 2017. 05. 007
Aydemir, B. (2017). Predicting academic success of vocational high school students using data mining methods graduate.
[Unpublished master’s thesis]. Pamukkale University Institute of Science.
Babić, I. D. (2017). Machine learning methods in predicting the student academic motivation. Croatian Operational
Research Review, 8(2), 443–461. https:// doi. org/ 10. 17535/ crorr. 2017. 0028
Baker, R. S., & Inventado, P. S. (2014). Educational data mining and learning analytics. Learning analytics (pp. 61–75).
Springer.
Baker, R. S., & Yacef, K. (2009). The state of educational data mining in 2009: A review and future visions. Journal of Educa-
tional Data Mining, 1(1), 3–17.
Bernacki, M. L., Chavez, M. M., & Uesbeck, P. M. (2020). Predicting achievement and providing support before STEM majors
begin to fail. Computers & Education, 158(August), 103999. https:// doi. org/ 10. 1016/j. compe du. 2020. 103999
Burgos, C., Campanario, M. L., De, D., Lara, J. A., Lizcano, D., & Martínez, M. A. (2018). Data mining for modeling students’
performance: A tutoring action plan to prevent academic dropout. Computers and Electrical Engineering, 66(2018),
541–556. https:// doi. org/ 10. 1016/j. compe leceng. 2017. 03. 005
Capuano, N., & Toti, D. (2019). Experimentation of a smart learning system for law based on knowledge discovery and
cognitive computing. Computers in Human Behavior, 92, 459–467. https:// doi. org/ 10. 1016/j. chb. 2018. 03. 034
Casquero, O., Ovelar, R., Romo, J., Benito, M., & Alberdi, M. (2016). Students’ personal networks in virtual and personal
learning environments: A case study in higher education using learning analytics approach. Interactive Learning
Environments, 24(1), 49–67. https:// doi. org/ 10. 1080/ 10494 820. 2013. 817441
Chakraborty, B., Chakma, K., & Mukherjee, A. (2016). A density-based clustering algorithm and experiments on student
dataset with noises using Rough set theory. In Proceedings of 2nd IEEE international conference on engineering and
technology, ICETECH 2016, March (pp. 431–436). https:// doi. org/ 10. 1109/ ICETE CH. 2016. 75692 90
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 18 of 19
Yağcı Smar t Learning Environments (2022) 9:11
Costa-Mendes, R., Oliveira, T., Castelli, M., & Cruz-Jesus, F. (2020). A machine learning approximation of the 2015 Por-
tuguese high school student grades: A hybrid approach. Education and Information Technologies, 26, 1527–1547.
https:// doi. org/ 10. 1007/ s10639- 020- 10316-y
Cruz-Jesus, F., Castelli, M., Oliveira, T., Mendes, R., Nunes, C., Sa-Velho, M., & Rosa-Louro, A. (2020). Using artificial intel-
ligence methods to assess academic achievement in public high schools of a European Union country. Heliyon.
https:// doi. org/ 10. 1016/j. heliy on. 2020. e04081
Delen, D. (2010). A comparative analysis of machine learning techniques for student retention management. Decision
Support Systems, 49(4), 498–506. https:// doi. org/ 10. 1016/j. dss. 2010. 06. 003
Delen, D. (2011). Predicting student attrition with data mining methods. Journal of College Student Retention: Research,
Theory and Practice, 13(1), 17–35. https:// doi. org/ 10. 2190/ CS. 13.1.b
Fernandes, E., Holanda, M., Victorino, M., Borges, V., Carvalho, R., & Van Erven, G. (2019). Educational data mining : Predic-
tive analysis of academic performance of public school students in the capital of Brazil. Journal of Business Research,
94(February 2018), 335–343. https:// doi. org/ 10. 1016/j. jbusr es. 2018. 02. 012
Fidalgo-Blanco, Á., Sein-Echaluce, M. L., García-Peñalvo, F. J., & Conde, M. Á. (2015). Using Learning Analytics to improve
teamwork assessment. Computers in Human Behavior, 47, 149–156. https:// doi. org/ 10. 1016/j. chb. 2014. 11. 050
García-González, J. D., & Skrita, A. (2019). Predicting academic performance based on students’ family environment:
Evidence for Colombia using classification trees. Psychology, S ociety and Education, 11(3), 299–311. https:// doi. org/
10. 25115/ psye. v11i3. 2056
Gök, M. (2017). Predicting academic achievement with machine learning methods. Gazi University Journal of Science Part
c: Design and Technology, 5(3), 139–148.
Hardman, J., Paucar-Caceres, A., & Fielding, A. (2013). Predicting students’ progression in higher education by using the
random forest algorithm. Systems Research and Behavioral Science, 30(2), 194–203. https:// doi. org/ 10. 1002/ sres. 2130
Hellas, A., Ihantola, P., Petersen, A., Ajanovski, V.V., Gutica, M., Hynninen, T., Knutas, A., Leinonen, J., Messom, C., & Liao, S.N.
(2018). Predicting academic performance: a systematic literature review. In Proceedings companion of the 23rd annual
ACM conference on innovation and technology in computer science education (pp. 175–199).
Hoffait, A., & Schyns, M. (2017). Early detection of university students with potential difficulties. Decision Support Systems,
101(2017), 1–11. https:// doi. org/ 10. 1016/j. dss. 2017. 05. 003
Huang, S., & Fang, N. (2013). Predicting student academic performance in an engineering dynamics course: A compari-
son of four types of predictive mathematical models. Computers and Education, 61(1), 133–145. https:// doi. org/ 10.
1016/j. compe du. 2012. 08. 015
Kardaş, K., & Güvenir, A. (2020). Analysis of the effects of Quizzes, homeworks and projects on final exam with different
machine learning techniques. EMO Journal of Scientific, 10(1), 22–29.
Kaur, P., Singh, M., & Josan, G. S. (2015). Classification and prediction based data mining algorithms to predict slow learn-
ers in education sector. Procedia Computer Science, 57, 500–508. https:// doi. org/ 10. 1016/j. procs. 2015. 07. 372
Kılınç, Ç. (2015). Examining the effects on university student success by data mining techniques. [Unpublished master’s thesis].
Eskişehir Osmangazi University Institute of Science.
Kotsiantis, S., Tselios, N., Filippidi, A., & Komis, V. (2013). Using learning analytics to identify successful learners in a blended
learning course. International Journal of Technology Enhanced Learning, 5(2), 133–150. https:// doi. org/ 10. 1504/ IJTEL.
2013. 059088
Lara, J. A., Lizcano, D., Martínez, M. A., Pazos, J., & Riera, T. (2014). A system for knowledge discovery in e-learning environ-
ments within the European Higher Education Area—Application to student data from Open University of Madrid,
UDIMA. Computers and Education, 72, 23–36. https:// doi. org/ 10. 1016/j. compe du. 2013. 10. 009
Long, P., & Siemens, G. (2011). Penetrating the fog: Analytics in learning and education. Educause Review, 46(5), 31–40.
Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of con-
cept. Computers & Education, 54(2), 588–599. https:// doi. org/ 10. 1016/j. compe du. 2009. 09. 008
Musso, M. F., Hernández, C. F. R., & Cascallar, E. C. (2020). Predicting key educational outcomes in academic trajectories: A
machine-learning approach. Higher Education, 80(5), 875–894. https:// doi. org/ 10. 1007/ s10734- 020- 00520-7
Nandeshwar, A., Menzies, T., & Nelson, A. (2011). Learning patterns of university student retention. Expert Systems with
Applications, 38(12), 14984–14996. https:// doi. org/ 10. 1016/j. eswa. 2011. 05. 048
Ornelas, F., & Ordonez, C. (2017). Predicting student success: A naïve bayesian application to community college data.
Technology, Knowledge and Learning, 22(3), 299–315. https:// doi. org/ 10. 1007/ s10758- 017- 9334-z
Ortiz, E. A., & Dehon, C. (2008). What are the factors of success at University? A case study in Belgium. Cesifo Economic
Studies, 54(2), 121–148. https:// doi. org/ 10. 1093/ cesifo/ ifn012
Rebai, S., Ben Yahia, F., & Essid, H. (2020). A graphically based machine learning approach to predict secondary schools
performance in Tunisia. Socio-Economic Planning Sciences, 70(August 2018), 100724. https:// doi. org/ 10. 1016/j. seps.
2019. 06. 009
Rizvi, S., Rienties, B., & Ahmed, S. (2019). The role of demographics in online learning; A decision tree based approach.
Computers & Education, 137(August 2018), 32–47. https:// doi. org/ 10. 1016/j. compe du. 2019. 04. 001
Rubin, B., Fernandes, R., Avgerinou, M. D., & Moore, J. (2010). The effect of learning management systems on student and
faculty outcomes. The Internet and Higher Education, 13(1–2), 82–83. https:// doi. org/ 10. 1016/j. iheduc. 2009. 10. 008
Saqr, M., Fors, U., & Tedre, M. (2017). How learning analytics can early predict under-achieving students in a blended
medical education course. Medical Teacher, 39(7), 757–767. https:// doi. org/ 10. 1080/ 01421 59X. 2017. 13093 76
Shorfuzzaman, M., Hossain, M. S., Nazir, A., Muhammad, G., & Alamri, A. (2019). Harnessing the power of big data analytics
in the cloud to support learning analytics in mobile learning environment. Computers in Human Behavior, 92(Febru-
ary 2017), 578–588. https:// doi. org/ 10. 1016/j. chb. 2018. 07. 002
Vandamme, J.-P., Meskens, N., & Superby, J.-F. (2007). Predicting academic performance by data mining methods. Educa-
tion Economics, 15(4), 405–419. https:// doi. org/ 10. 1080/ 09645 29070 14099 39
Viberg, O., Hatakka, M., Bälter, O., & Mavroudi, A. (2018). The current landscape of learning analytics in higher education.
Computers in Human Behavior, 89(July), 98–110. https:// doi. org/ 10. 1016/j. chb. 2018. 07. 027
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 19 of 19
Yağcı Smar t Learning Environments (2022) 9:11
Waheed, H., Hassan, S. U., Aljohani, N. R., Hardman, J., Alelyani, S., & Nawaz, R. (2020). Predicting academic performance of
students from VLE big data using deep learning models. Computers in Human Behavior, 104(October 2019), 106189.
https:// doi. org/ 10. 1016/j. chb. 2019. 106189
Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining practical machine learning tools and techniques (3rd ed.). Morgan
Kaufmann.
Xing, W., Guo, R., Petakovic, E., & Goggins, S. (2015). Participation-based student final performance prediction model
through interpretable Genetic Programming: Integrating learning analytics, educational data mining and theory.
Computers in Human Behavior, 47, 168–181.
Xu, X., Wang, J., Peng, H., & Wu, R. (2019). Prediction of academic performance associated with internet usage behaviors
using machine learning algorithms. Computers in Human Behavior, 98(January), 166–173. https:// doi. org/ 10. 1016/j.
chb. 2019. 04. 015
Zabriskie, C., Yang, J., DeVore, S., & Stewart, J. (2019). Using machine learning to predict physics course outcomes. Physical
Review Physics Education Research, 15(2), 020120. https:// doi. org/ 10. 1103/ PhysR evPhy sEduc Res. 15. 020120
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... This is often used in chatbots. Chatbots can communicate with students and guide them to services or the answers they need [7,8]. AI is not only about data analysis and predictive modeling. ...
Article
Full-text available
Artificial Intelligence (AI) is rapidly redefining educational administration by automating routine tasks, enhancing decision-making processes, and enabling personalized learning experiences. This paper examines the current applications of AI in education, including predictive analytics, chatbots, and adaptive systems. It also highlights challenges such as financial constraints, ethical considerations, and the potential for data bias. The study underscores the necessity of integrating ethical frameworks and fostering professional development to ensure equitable and effective AI implementation. Lastly, it examines future trends, emphasizing the transformative potential of AI in resource allocation, stakeholder engagement, and fostering innovation while maintaining the human touch in decision-making. INTRODUCTION Artificial intelligence is a burgeoning area of technology development that has been extensively covered in the news media lately. AI research draws upon a wide variety of disciplines, from electronics and cognitive sciences to philosophy and system theory. Traditionally, AI has been concerned with a deep understanding of human beings and their many abilities and behaviors. It is now a commonly held belief in education that the future landscape of AI has the potential to fundamentally change the space of teaching and learning. The consensus is that the transformative power of AI is significantly underappreciated, but this is beginning to change [1, 2]. AI research on education has been concerned with the application of technologies to automate more mundane tasks such as assessment, scheduling, and personalization. For example, in the 1970s, one could purchase a personal computer whose vendors claimed that it could teach children arithmetic automatically. It managed this feat by asking math questions from a large pool, and if the user got the question incorrect, it would keep asking them similar questions until they got it correct-in essence, until it learned that it had taught them effectively. This case represents an early attempt at adaptive assessment or learning analytics, two practice areas popular in contemporary research on AI in education. The concept of learning analytics is connected with web-based educational systems and relies on the collection and analysis of web-based interactions in the name of improving student learning. In recent years, AI in academia has also been afforded the interpretative focus typical of inquiries. These systems support it in their classification of critical processes and inform academic decision-making about what resources it might be best to reassign or create [3, 4]. This essay begins with an exploration of some applications of AI to educational administration (the roles and the work of school leaders). It then offers a summary of possible critiques of this development, the purpose of which is to urge school leaders to address the matter of AI explicitly, to question whether the assets they are charged with administering are as seamlessly reportable as AI requires them to be [5, 6]. Current Applications of AI in Educational Administration There are many practical applications for AI in educational administration. One of the most popular optimizations on college campuses is machine learning to create predictive models and analyze student
... The use of DM techniques, particularly classification algorithms such as Decision Trees, Support Vector Machines, and Neural Networks, has the potential to revolutionize teaching and learning by enabling personalized education. The choice of technique depends on factors such as dataset size and complexity, and classification remains a promising area for improving educational effectiveness [19][20] [21]. A reliable and objective admissions system is crucial for selecting students likely to succeed academically, supporting institutions in maintaining high academic standards [22]. ...
Article
Educational Data Mining (EDM) and Exploratory Data Analysis (EDA) collaboratively enhance the quality of learning outcomes. Academic institutions strive to establish admission criteria that enable the selection of high-performing and deserving students. In recent years, EDM has garnered significant attention in research, facilitating the prediction of student performance and the identification of systemic inefficiencies. This study investigates the application of stacked ensemble models for student performance prediction and evaluates their performance against baseline classification models, including Decision Trees, Random Forests, Naïve Bayes, and Support Vector Machines (SVM). The findings demonstrate that the stacked model surpasses the baseline models in predictive accuracy and reliability. However, the study also reveals that the achieved prediction accuracies remain suboptimal. It highlights the need for incorporating additional parameters, alongside prior academic records, to improve predictive performance and establish a more robust merit-based admission criterion. The analysis underscores that the dataset utilized for predicting student academic performance, based on their prior or ongoing academic standing, does not yield promising outcomes. No significant correlation was observed between the variables derived from preliminary academic records (NTS scores, intermediate percentage, and matriculation percentage) and the predicted CGPAs. This indicates that the dataset exhibits stochastic behavior, suggesting that the factors currently employed in admission decision-making lack predictive utility for forecasting student performance. Consequently, the findings highlight the need for policymakers to reevaluate the admission criteria, as the existing parameters are insufficient for reliably predicting academic outcomes using historical data.
... Yağc (2022) [18] This study compared various machine learning methods-random forests, nearest neighbor, support vector machines, logistic regression, Naïve Bayes, and k-nearest neighbor-to predict final exam scores using a dataset of 1854 students from a Turkish state university. ...
Article
Predicting students’ academic performance is a crucial initiative in the field of education, as it allows educators and administrators to spot students who may require extra assistance, customize educational resources to meet individual requirements, and improve overall educational results. Conventional approaches to forecasting academic achievement, such as statistical analysis and expert evaluation, have certain drawbacks in terms of precision and scalability. The emergence of machine learning (ML) methods provides a possible alternative by utilizing extensive datasets and advanced algorithms to reveal patterns and generate more precise predictions. This investigation’s primary goal is to explore the predictive modelling of student academic performance by improving the accuracy of predictions through the utilization of machine learning techniques. The study was conducted utilizing the Python programming environment. The prediction of student academic performance was carried out utilising the Bidirectional long short-term memory (Bi-LSTM) based Weighted Cost Effective Random Forest algorithm. The study utilized the Deep Encoder CNN-Bi-LSTM for optimal feature extraction to foresee student academic performance. The extracted features were then classified using the Weighted Cost Effective Random Forest (WECRF) classifier, and the classification was evaluated in terms of accuracy, precision, specificity, sensitivity, and recall. The issues addressed include the class imbalance, computational complexity, cost, and huge dimensional issue, among others. The random forest method achieved Precision Score - 0.72, Recall Score - 0.68, F1 Score - 0.69, and Accuracy - 0.77 in this study. Moreover, the suggested technique facilitates the automated forecasting and enhancement of students' future academic performance. Keywords: Predictive Modelling; Students; Academic Performance; Machine Learning; Bi-LSTM; CNN-Bi-LSTM; Deep Learning.
... We chose to use ML models instead of DL models because DL models typically require a larger number of training samples (Sanusi et al., 2023), and training on a small dataset can easily lead to overfitting , making ML models more suitable for our dataset. Additionally, ML models require fewer computing resources, making them more cost-effective and practical for use in educational settings (Li et al., 2022;Yağcı, 2022). ML models can achieve comparable results to DL models on certain tasks (Jang et al., 2022) while maintaining a higher level of interpretability. ...
Article
Full-text available
The ability of large language models (LLMs) to generate code has raised concerns in computer science education, as students may use tools like ChatGPT for programming assignments. While much research has focused on higher education, especially for languages like Java and Python, little attention has been given to K-12 settings, particularly for pseudocode. This study seeks to bridge this gap by developing explainable machine learning models for detecting pseudocode plagiarism in online programming education. A comprehensive pseudocode dataset was constructed, comprising 7,838 pseudocode submissions from 2,578 high school students enrolled in an online programming foundations course from 2020 to 2023, along with 6,300 pseudocode samples generated by three versions of ChatGPT. An ensemble model (EM) was then proposed to detect AI-generated pseudocode and was compared with six other baseline models. SHapley Additive exPlanations were used to explain how these models differentiate AI-generated pseudocode from student submissions. The results show that students’ submissions have higher similarity with GPT-3 than with the other two GPT models. The proposed model can achieve a high accuracy score of 98.97%. The differences between AI-generated pseudocode and student submissions lies in several aspects: AI-generated pseudocode often begins with more complex verbs and features shorter sentence lengths. It frequently includes clear numerical or word-based indicators of sequence and tends to incorporate more comments throughout the code. This research provides practical insights for online programming and contributes to developing educational technologies and methods that strengthen academic integrity in such courses.
Article
Full-text available
In the higher education (HE) sector, balancing the number of students enrolled and those who pass out is a big challenge. This issue leads to the loss of potential talent and negatively affects higher education institutes in terms of finance and academics. Student dropout is a multi-factor related problem that needs a contemporary approach to identify the main factors for predicting the student who has the chance to drop out. Nowadays, applying various Machine Learning (ML) techniques in the Education sector has gained much attention from educators and education administrators. The principal research objective of the paper is to develop a machine learning approach to predict the possibility of academic failure of a student in the higher education path. The authors define academic failure as a dropout in the middle of a course and academic success as completing the course within a particular duration. So, from the ML‘s point of view, the study deals with classification problems, specifically binary classification. Through this study, the authors try to find the ML models for the issue by comparing the performance of different state-of-the-art algorithms based on the Enrolled Students (ES) dataset. The authors use a stacking model that uses a multilayer perceptron as a metaclassifier, Random forest and Gradient Boosting as Base Classifier, which gave better results than classical algorithms like Logistic Regression (LR), Naive Bayes (NB), Decision Tree (DT), Random Forest (RF), Multilayer Perceptron (MLP). The developed stacking model gave us the best accuracy 87%. This model also obtains good scores for other performance metrics like sensitivity, specificity, ROC-AUC and Kappa Statistics.
Article
In today’s rapidly evolving job market, university students face unprecedented challenges in navigating their career paths. Traditional career guidance approaches sometimes fail to provide students with the knowledge and skills necessary for successful transitions from academics to the profession because of the changing nature of industries and the growing complexity of job possibilities. The study aims to explore the integration of AI and ML in providing predictive career guidance and entrepreneurial development for university students. This study proposes a novel Wild Horse Optimized Resilient Extreme Gradient Boosting (WHO-RXGBoost) model to predict the personalized recommendations that guide students in their career choices and entrepreneurial endeavors. University records and questionnaires are used to collect demographic data about students as well as information about any prior employment or entrepreneurial experience. The data was pre-processed using data cleaning and normalization using a robust scaler for the obtained data. The PCA feature extraction method is utilized to extract the datasets. By using this methodology, students can efficiently travel a massive amount of employment information by creating an information recommendation system that is customized to satisfy their requirements. The results indicate the proposed method outperforms traditional algorithms in providing relevant and timely career insights with metrics, such as F1-score (90%), precision (93%), accuracy (95%), and specificity (91%). User satisfaction indicates that technology considerably increases students’ experiences in entrepreneurship and CP. This research contributes to enhancing career outcomes and encouraging an entrepreneurial spirit among university students by providing a practical and effective response to the job issues experienced by students.
Article
Full-text available
This research employs deep learning to enhance student assessment by analyzing the quality and structure of programming assignments, focusing on C code submissions. Traditional grading methods often fail to capture the intricate details of a student’s coding abilities, focusing primarily on code functionality over quality and comprehension. To address this limitation, it requires to extracts in-depth metrics from code—such as the total lines of code, use and quality of comments, variable declarations, control structures, and overall readability—and integrates them with traditional academic data like past grades and attendance. Using a Deep Neural network, the model predicts students' grades and academic performance percentages based on these rich, combined inputs, providing a more comprehensive and nuanced evaluation. This innovative approach empowers educators with insights into each student’s overall performance and areas needing improvement, enabling personalized feedback and fostering a balanced, skill-based assessment framework that goes beyond conventional grading systems.
Article
Full-text available
Education systems produce a large number of valuable data for all stakeholders. The processing of these educational data and making studies on the future of education based on the data reveal highly meaningful results. In this study, an insight was tried to be developed on the educational data collected from ninth-grade students by using data mining methods. The data contains demographic information about students and their families, studying routines, behaviours of attending learning activities, and their epistemological beliefs about science. Thus, this research aimed to solve a classification problem, two-class (successful or unsuccessful according to the exam result) was tried to be estimated from the collected data. In the study, the prediction accuracy of the supervised classification algorithms were compared and it was defined which variables were effective in the formation of classes. When the prediction accuracy of machine learning algorithms was compared, the findings indicated that the Neural Network algorithm (98.6%) had the highest score. The information gain coefficient of the variables was examined to determine the factors affecting the prediction accuracy. It was revealed that demographic variables of the family, scientific epistemological beliefs of the student, study routines and attitudes towards some courses affected the classification. It can be concluded that there was a relationship between these variables and academic success. Studies on these variables will support students' academic success.
Article
Full-text available
Understanding, modeling, and predicting student performance in higher education poses significant challenges concerning the design of accurate and robust diagnostic models. While numerous studies attempted to develop intelligent classifiers for anticipating student achievement, they overlooked the importance of identifying the key factors that lead to the achieved performance. Such identification is essential to empower program leaders to recognize the strengths and weaknesses of their academic programs, and thereby take the necessary corrective interventions to ameliorate student achievements. To this end, our paper contributes, firstly, a hybrid regression model that optimizes the prediction accuracy of student academic performance, measured as future grades in different courses, and, secondly, an optimized multi-label classifier that predicts the qualitative values for the influence of various factors associated with the obtained student performance. The prediction of student performance is produced by combining three dynamically weighted techniques, namely collaborative filtering, fuzzy set rules, and Lasso linear regression. However, the multi-label prediction of the influential factors is generated using an optimized self-organizing map. We empirically investigate and demonstrate the effectiveness of our entire approach on seven publicly available and varying datasets. The experimental results show considerable improvements compared to single baseline models (e.g. linear regression, matrix factorization), demonstrating the practicality of the proposed approach in pinpointing multiple factors impacting student performance. As future works, this research emphasizes the need to predict the student attainment of learning outcomes.
Article
Full-text available
This article uses an anonymous 2014–15 school year dataset from the Directorate-General for Statistics of Education and Science (DGEEC) of the Portuguese Ministry of Education as a means to carry out a predictive power comparison between the classic multilinear regression model and a chosen set of machine learning algorithms. A multilinear regression model is used in parallel with random forest, support vector machine, artificial neural network and extreme gradient boosting machine stacking ensemble implementations. Designing a hybrid analysis is intended where classical statistical analysis and artificial intelligence algorithms are blended to augment the ability to retain valuable conclusions and well-supported results. The machine learning algorithms attain a higher level of predictive ability. In addition, the stacking appropriateness increases as the base learner output correlation matrix determinant increases and the random forest feature importance empirical distributions are correlated with the structure of p-values and the statistical significance test ascertains of the multiple linear model. An information system that supports the nationwide education system should be designed and further structured to collect meaningful and precise data about the full range of academic achievement antecedents. The article concludes that no evidence is found in favour of smaller classes.
Article
Full-text available
Understanding academic achievement (AA) is one of the most global challenges, as there is evidence that it is deeply intertwined with economic development, employment, and countries' wellbeing. However, the research conducted on this topic grounds in traditional (statistical) methods employed in survey (sample) data. This paper presents a novel approach, using state-of-the-art artificial intelligence (AI) techniques to predict the academic achievement of virtually every public high school student in Portugal, i.e., 110,627 students in the academic year of 2014/2015. Different AI and non-AI methods are developed and compared in terms of performance. Moreover, important insights to policymakers are addressed.
Article
Full-text available
Predicting and understanding different key outcomes in a student’s academic trajectory such as grade point average, academic retention, and degree completion would allow targeted intervention programs in higher education. Most of the predictive models developed for those key outcomes have been based on traditional methodological approaches. However, these models assume linear relationships between variables and do not always yield accurate predictive classifications. On the other hand, the use of machine-learning approaches such as artificial neural networks has been very effective in the classification of various educational outcomes, overcoming the limitations of traditional methodological approaches. In this study, multilayer perceptron artificial neural network models, with a backpropagation algorithm, were developed to classify levels of grade point average, academic retention, and degree completion outcomes in a sample of 655 students from a private university. Findings showed a high level of accuracy for all the classifications. Among the predictors, learning strategies had the greatest contribution for the prediction of grade point average. Coping strategies were the best predictors for degree completion, and background information had the largest predictive weight for the identification of students who will drop out or not from the university programs.
Article
Full-text available
El entorno familiar, las condiciones económicas y sociales propias de las familias influyen en el desempeño académico de los estudiantes y, por ende, en los resultados de las pruebas académicas. No obstante, en Colombia es limitada la investigación y los métodos que se han usado en el estudio de estas variables. Esta investigación predice el desempeño académico de los estudiantes que presentaron el examen de Estado de 2016 para acceder a la educación superior (Saber 11) a partir de las observaciones y características familiares propias de los estudiantes. Los datos provienen de la base de datos del Instituto Colombiano para la evaluación de la educación (ICFES) y se realizan árboles de clasificación para predecir los resultados académicos. Los resultados muestran que las variables familiares que mejor predicen los resultados académicos son, en su orden: el nivel educativo de la madre, el estrato socioeconómico de la vivienda, el número de libros, el nivel educativo del padre y el poseer computador en la vivienda.
Article
Full-text available
The use of machine learning and data mining techniques across many disciplines has exploded in recent years with the field of educational data mining growing significantly in the past 15 years. In this study, random forest and logistic regression models were used to construct early warning models of student success in introductory calculus-based mechanics (Physics 1) and electricity and magnetism (Physics 2) courses at a large eastern land-grant university. By combining in-class variables such as homework grades with institutional variables such as cumulative GPA, we can predict if a student will receive less than a “B” in the course with 73% accuracy in Physics 1 and 81% accuracy in Physics 2 with only data available in the first week of class using logistic regression models. The institutional variables were critical for high accuracy in the first four weeks of the semester. In-class variables became more important only after the first in-semester examination was administered. The student’s cumulative college GPA was consistently the most important institutional variable. Homework grade became the most important in-class variable after the first week and consistently increased in importance as the semester progressed; homework grade became more important than cumulative GPA after the first in-semester examination. Demographic variables including gender, race or ethnicity, and first generation status were not important variables for predicting course grade.
Article
Prediction models that underlie “early warning systems” need improvement. Some predict outcomes using entrenched, unchangeable characteristics (e.g., socioeconomic status) and others rely on performance on early assignments to predict the final grades to which they contribute. Behavioral predictors of learning outcomes often accrue slowly, to the point that time needed to produce accurate predictions leaves little time for intervention. We aimed to improve on these methods by testing whether we could predict performance in a large lecture course using only students’ digital behaviors in weeks prior to the first exam. Early prediction based only on malleable behaviors provides time and opportunity to advise students on ways to alter study and improve performance. Thereafter, we took the not-yet-common step of applying the model and testing whether providing digital learning support to those predicted to perform poorly can improve their achievement. Using learning management system log data, we tested models composed of theory-aligned behaviors using multiple algorithms and obtained a model that accurately predicted poor grades. Our algorithm correctly identified 75% of students who failed to earn the grade of B or better needed to advance to the next course. We applied this model the next semester to predict achievement levels and provided a digital learning strategy intervention to students predicted to perform poorly. Those who accessed advice outperformed classmates on subsequent exams, and more students who accessed the advice achieved the B needed to move forward in their major than those who did not access advice.
Article
The abundance of accessible educational data, supported by the technology-enhanced learning platforms, provides opportunities to mine learning behavior of students, addressing their issues, optimizing the educational environment, and enabling data-driven decision making. Virtual learning environments complement the learning analytics paradigm by effectively providing datasets for analysing and reporting the learning process of students and its reflection and contribution in their respective performances. This study deploys a deep artificial neural network on a set of unique handcrafted features, extracted from the virtual learning environments clickstream data, to predict at-risk students providing measures for early intervention of such cases. The results show the proposed model to achieve a classification accuracy of 84%–93%. We show that a deep artificial neural network outperforms the baseline logistic regression and support vector machine models. While logistic regression achieves an accuracy of 79.82%–85.60%, the support vector machine achieves 79.95%–89.14%. Aligned with the existing studies - our findings demonstrate the inclusion of legacy data and assessment-related data to impact the model significantly. Students interested in accessing the content of the previous lectures are observed to demonstrate better performance. The study intends to assist institutes in formulating a necessary framework for pedagogical support, facilitating higher education decision-making process towards sustainable education.
Article
The main purpose of this paper is to identify the key factors that impact schools' academic performance and to explore their relationships through a two-stage analysis based on a sample of Tunisian secondary schools. In the first stage, we use the Directional Distance Function approach (DDF) to deal with undesirable outputs. The DDF is estimated using Data Envelopment Analysis method (DEA). In the second stage we apply machine-learning approaches (regression trees and random forests) to identify and visualize variables that are associated with a high school performance. The data is extracted from the Program for International Student Assessment (PISA) 2012 survey. The first stage analysis shows that almost 22% of Tunisian schools are efficient and that they could improve their students’ educational performance by 15.6% while using the same level of resources. Regression trees findings indicate that the most important factors associated with higher performance are school size, competition, class size, parental pressure and proportion of girls. Only, school location appears with no impact on school efficiency. Random forests algorithm outcomes display that proportion of girls at school and school size have the most powerful impact on the predictive accuracy of our model and hence could more influence school efficiency. The findings disclose also the high non-linearity of the relationships between these key factors and school performance and reveal the importance of modeling their interactions in influencing efficiency scores.