Conference PaperPDF Available

Running out of STEM: a comparative study across STEM majors of college students at-risk of dropping out early

Authors:

Abstract and Figures

Higher education institutions in the United States and across the Western world face a critical problem of attrition of college students and this problem is particularly acute within the Science, Technology, Engineering, and Mathematics (STEM) fields. Students are especially vulnerable in the initial years of their academic programs; more than 60% of the dropouts occur in the first two years. Therefore, early identification of at-risk students is crucial for a focused intervention if institutions are to support students towards completion. In this paper we developed and evaluated a survival analysis framework for the early identification of students at the risk of dropping out. We compared the performance of survival analysis approaches to other machine learning approaches including logistic regression, decision trees and boosting. The proposed methods show good performance for early prediction of at-risk students and are also able to predict when a student will dropout with high accuracy. We performed a comparative analysis of nine different majors with varying levels of academic rigor, challenge and student body. This study enables advisors and university administrators to intervene in advance to improve student retention.
Content may be subject to copyright.
Running Out of STEM: A Comparative Study across STEM Majors
of College Students At-Risk of Dropping Out Early
Yujing Chen
George Mason University
Fairfax, Virginia
ychen37@gmu.edu
Aditya Johri
George Mason University
Fairfax, Virginia
johri@gmu.edu
Huzefa Rangwala
George Mason University
Fairfax, Virginia
rangwala@cs.gmu.edu
ABSTRACT
Higher education institutions in the United States and across the
Western world face a critical problem of a!rition of college
students and this problem is particularly acute within the
Science, Technology, Engineering, and Mathematics (STEM)
elds. Students are especially vulnerable in the initial years of
their academic programs; more than 60% of the dropouts occur
in the rst two years. #erefore, early identication of at-risk
students is crucial for a focused intervention if institutions are to
support students towards completion. In this paper we
developed and evaluated a survival analysis framework for the
early identication of students at the risk of dropping out. We
compared the performance of survival analysis approaches to
other machine learning approaches including logistic regression,
decision trees and boosting. #e proposed methods show good
performance for early prediction of at-risk students and are also
able to predict when a student will dropout with high accuracy.
We performed a comparative analysis of nine dierent majors
with varying levels of academic rigor, challenge and student
body. #is study enables advisors and university administrators
to intervene in advance to improve student retention.
CCS CONCEPTS
Applied computing~Computer-managed
instructionApplied computing~Computer-assisted
KEYWORDS
Student retention, classification, regression, survival analysis
ACM Reference format:
Y. Chen, A. Johri, H. Rangwala. 2018. Running out of STEM: A
comparative study across STEM majors of college students At-Risk of
dropping out early. In Proceedings of the International Conference on
Learning Analytics and Knowledge, Sydney, Australia, March 2018
(LAK’18), 10 pages. DOI: [10.1145/3170358.3170410]
1 INTRODUCTION
There is a high need for Science, Technology, Engineering, and
Mathematics (STEM) professionals in the information economy
especially given its accelerated growth in the last couple of
decades. According to economic projections, there will be a
workforce deficit in these majors if college graduation rates
remain the same [25]. The rate at which students leave STEM
majors is alarming: according to a National Center for Education
Statistics (NCES) report 48 percent of bachelor’s degree students
and 69 percent of associate’s degree students who entered STEM
fields between 2003 and 2009 had left these fields by Spring 2009.
Roughly one-half of the students switched their major to a non-
STEM field and the rest of them left STEM fields by exiting
college before earning a degree or certificate [37]. Similar
findings have also been reported by other studies note that
fewer than 40% of students enrolled in STEM actually receive
their degree in STEM [26]. The high exit rate either from a
STEM major or from higher education all together has led to
many researchers to examine and analyze the matriculation and
retention of STEM majors at colleges across the nation and take
actions to increase the retention rate and thus the production of
STEM professionals [27]. These programs often conclude that
the most expedient and direct path to providing professionals in
STEM fields is to increase the retention rate in STEM majors but
to also look at contextual differences as retention rates,
availability of STEM programs, and student demographics vary
across institutions of higher education [25].
Comparing engineering with other majors in terms of
persistence, engagement, and migration, Ohland et al. [29] found
that engineering has the highest persistence rate compared to
other majors but the lowest inward migration rate. In their
dataset, students generally stayed within engineering once
admitted and rarely switched to other engineering majors. In
terms of engagement, they found that engineering students are
as engaged in their majors as their peers are in their majors. The
reasons for STEM students’ attrition at the undergraduate level
vary and include poor quality of teaching, lack of student-faculty
interaction, advising, and lack of belonging [28, 32, 30]. There is
also variation based on student demographics with several
studies showing that underrepresented groups fare worse than
others [31, 30].
Permissio n to m ake digital or hard copies of part or all of this w ork for personal or
classroom u se i s granted without fee provided that copies are no t made or
distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first pag e. Copyrights for components of this work owned
by others than ACM must be honored. Abstracting with credit is permitted. To
copy o therwise, or republish, to post on servers or to redistri bute to lis ts, requires
prior specific permission and/or a fee. Request permission from
Permissio ns@acm.o rg.
LAK’18, March 2018, Sydney, NSW, Australia
© 2018 Associatio n for Computing Machinery.
ACM ISBN 978-1-4503-6400-3/18/03…$15.00
https://doi.org/10.1145/3170358.3170410
LAK’18, March 2018, Sydney, Australia
Y. Chen et al.
!
2
Studies show that most of the major change and dropouts from
STEM majors occur in the first or second year of college [32] and
therefore efforts to reduce dropout focus on the first year [33].
Although it has been widely believed that if students are able to
make it through the first year, their likelihood to persist will
improve significantly recent research has also brought focus to
the middle years, especially for engineering students.
According to Barefoot (2004) [34], “Dropout (or stop out) from
higher education affects students differently, depending upon
their level of maturity, college readiness, or personal feelings of
belonging in college. Whereas the decision to leave a college or
university may be permanent for some studentsespecially
those who feel marginal in the first placeother students will
take time off to clarify academic and career decisions, deal with
external circumstances, or simply grow up (pg. 10).” The reasons
for dropping out are complex and interrelated and although
several models have been advanced to explain the phenomenon,
especially Tinto’s Student Integration and Bean’s Student
Attrition, the evidence for why student dropout is mixed [35].
Furthermore, the analysis of student drop-out and lack of
retention has been generic in nature except for a few studies
that compare STEM majors with others (e.g. Thompson and
Bolin, 2011), we have limited understanding of what the
retention and drop-out rates are across different STEM majors. A
better understanding of this will allow for improved
interventions to support student completion. It will also allow
contextual interpretation of findings of why students drop out.
Figure 1 shows the cumulative number of dropouts at George
Mason University increases with semester (two semesters of
each academic year, Spring semester and Fall semester) and the
overall dropout rate reaches 23.62% at the end of the 6th
semester. This motivates the need for developing predictive
models which can identify the at-risk students effectively and
analyze the varying characteristics associated with student
dropouts. Early identification of at-risk students is the first step
in the process of providing them later help. Usually the at-risk
students will be notified by emails, then the director of the
institution or the instructors can take measures to help them
based on their situations. Advisors can help these students with
degree planning, major suggestions and provide feedback in
terms of needed effort and focus on certain topics.
Many studies have focused on student retention problem for
decades [23]. Recently, survival analysis has shown to be useful
for analyzing the student dropout problem [5]. Student attrition
is not a sudden event and there are indicators/signals to forecast
the dropout events. In this paper, we provide a through
comparison between different machine learning models
including survival analysis approaches. We use the students
early stage information at university coupled with pre-
enrollment information to predict their survival status in the
future. Our results show that survival analysis methods performs
better at early identification of student dropout and can also be
used for comparing across different majors.
Figure 1: Cumulative dropouts and % of students dropping at each
semester for the rst 6 semesters at George Mason University from Fall
2009 to Spring 2013.
2 RELATED WORK
Researchers have studied the causes of lack of student retention
for decades in different educational environments and settings
[14]. As reported in Druzdzel et. al. [11], the average national
retention rate is about 55%, and in the engineering programs, the
dropout rate is about 50% [20]. For many years, statistical
methods have been used widely for predicting student dropouts.
[16,19]. Within a regression framework, Golding et.al [1]
identified key features associated with dropout included
studentsdemographic information, enrollment information and
the first-year information. Logistic regression is one of the most
widely used statistic methods in the student retention problem
[21,24, 13, 22].
Recently, researchers have adopted various data mining
approaches to address the student retention problem [17, 18, 8].
Yu et.al [15] used decision tree with various features on the data
from Arizona State University. Barker et.al [2] predicted student
graduation rates with Support Vector Machines and Neural
Network. Atwell et.al [6] used studentsdemographic and survey
data to study student retention problem. Yehuala et.al [12]
predict the likelihood of success/failure at university with
decision tree and Naïve Bayes methods. Pittman et.al [7] studied
the student retention problem by comparing various data mining
methods (Logistic regression, decision tree, Bayesian classifiers
and neural network) and concluded that logistic regression had
the best performance metrics. In [3, 9, 10], it was demonstrated
empirically that no classifier that does better than all others
universally and some classifiers perform better due to different
feature combinations. Alkhasawneh et. al. [4] proposed a hybrid
model with neural network for prediction and genetic algorithms
for feature engineering in order to identify at-risk students.
Although the methods discussed above have shown promising
results, our approach is slightly different as in most cases
student dropout is not a sudden event but a long process affected
by time. Therefore, it is helpful to define it as a longitudinal
problem. Survival analysis is aimed at utilizing longitudinal data
to predict future status. Sattar et. al. [5] applied Cox Proportional
234$(2.17%)
702$(6.53%)
1628$(15.15%)
1884$(17.53%)
2343$(21.80%)
2538$(23.62%)
0.00% 5.00% 10.00% 15.00% 20.00% 25. 00%
semester1
semester2
semester3
semester4
semester5
semester6
Cumulat ive $dropout number $ and$ rate $in each semester
Running Out of STEM: A Comparative Study across STEM Majors
of College Students At-Risk of Dropping Out Early
LAK’18, March 2018, Sydney, Australia
!
3
Model (COX) and time-dependent COX (TD-COX) to predict
student dropout. Compared with this prior work, we aim to
accurately identity the dropout semester and use far fewer
features. In this paper we study the performance of survival
analysis approaches in comparison to several standard machine
learning approaches at identifying students at-risk of dropping
out.
3 METHODS
The primary objective of this study is to predict students who
dropout and when they dropout. Using this information the
decision guidance system can identify at-risk students and
intervene to improve their chances of graduation by providing
timely feedback. Specifically, we evaluated survival analysis
approaches in comparison to standard machine learning
algorithms like Logistic Regression, Decision tree, Random
Forest, Naïve Bayes and AdaBoost approaches. The survival
analysis approaches were tested with Aalens Additive model
and Coxs Proportional Hazard model. Below we provide key
definitions used in this study.
Dropout: A student with a semester-wise GPA (Grade Point
Average) of 0.0 or who does not register for two consecutive
semesters by a cut-off time point is defined as a dropout.
Duration: is the number of semesters a student is enrolled
continuously by a cut-off time point.
Censored: A student who is still enrolled and has not been subject
to dropout event (as defined above) by a cut-off time point is
considered as censored data in survival analysis.
The traditional definition of student retention is students who
after completing a semester return to the university in the
following semester. However, this definition leads to classifying
students who take a semester break as dropouts. This situation is
common among most part-time students. We consider a student
who does not show up in two consecutive semesters as a
dropout.
3.1 Survival Analysis
Survival analysis is a set of statistical methods for longitudinal
data analysis on the occurrence of events. In our study, the event
is student dropout. We list some common notations in Table 1.
Table 1: Notations used in this paper
Notation
Definition
!"!
#"!
T"
$%&!matrix of features or student!'!
vector of survival regression coefficients
a random lifetime taken from the data!!
( ) !
* ) !
#+)!
survival function
hazard function
baseline hazard rate!
Figure 2: An illustration to show the student dropout problem. When
the cut-o time is the 8th semester, Student2 and Student5 dropped out
before the 8th semester. Student4 dropped out a%er the 8th semester.
Student1, Student3 and Student4 are censored data since they do not
dropout by the cut-o time.
Survival analysis was originally developed to solve the
estimation of right-censored data and used in clinical sciences
(i.e., life/death). In survival applications, there are left-censored,
right-censored and interval-censored. A left-censored value is
one that is less than some certain value, i.e., , -. A right-
censored value is one that is greater than some certain value. An
interval-censored value is one that is between specic intervals.
Student data belongs to right-censored type because enrolled
semesters are always greater than 0. For convenience, we use
censored instead of right-censored in this paper. Figure 2
illustrates the student dropout problem with survival analysis, in
which Student1, Student3 and Student4 are right-censored data
because they have still survived at the cut-off time, while
Student2 and Student5 drop at the 8th semester. The survival
function ( ) denes the probability the dropout event has not
occurred yet at time . , can be dened as:
(/)0 1 234/5 6 .0
(1)
In contrast to survival function, the probability of the dropout
event occurring at time . is Hazard function given by:
* ) 1 7'8
9:;+
234/. < 5 < .4 = >.?5 6 .0
@)
(2)
To estimate the survival function, we first use the Kaplan-Meier
Estimate, which is defined as
( ) 1 A
:BC:
&"D E"
&"
(3)
where E" is the number of dropout students at time . and &" is
the number of students at risk of dropout just prior to time ..
For the estimation of hazard rates, we use the Nelson-Aalen
estimator which estimates the cumulative hazard rates, can be
defined as
F ) 1 E"
&"
:BG:
(4)
Semester
Student1
Student2
Student3
Student4
Student5
1 2 3 4 5 6 7 8
X
X
X
LAK’18, March 2018, Sydney, Australia
Y. Chen et al.
!
4
where E" are the number of dropout students at time . and &" is
the number of susceptible students.
There are two popular competing approaches for survival
regression: Coxs model and Aalens additive model.
3.1.1 Coxs model
The Coxs model is a semi-parametric technique, which makes
no assumptions about the shape of the baseline hazard function.
The hazard function of the Coxs model is defined by
* ) 1 #+) 4HIJ4/#K!K= L = #M!N0
(5)
where #+) is the baseline hazard function at time . and the
HIJ4/#K!K= L = #M!N0 estimates the risk associated with the
covariate values.
3.1.2 Aalens Additive model
Aalens additive model (AAF) assumes the following form
* ) 1 #+) = #K/)0!K= L = #M/)0!N
(6)
where #+) is the same as Coxs model. The difference is,
Aalens additive model typically does not estimate he individual
#") but instead of estimating #"/O0EP
:
+.
3.2 Comparative approaches
To compare the viability of our approach, we further
implemented several other machine learning methods including
Logistic Regression (LR), Decision tree (DT), Random Forest (RF),
Naïve Bayes (NB) and Adaboost (AB) to predict students who
drop out. The common challenge with these approaches is their
inability to infer directly when a student will drop out. We used
several features associated with students including their high
school information, demographics, admissions variables and data
from their first few semesters. The features we used are listed in
Table 2. The predicted results are binary, 1 indicates dropout and
0 represents not dropout.
Table 2: Features used in the proposed methods
High school information
High school GPA!!
Demographic
Race
Gender
Age
College enrollment
SAT (Scholastic Assessment Test) score
SAT math score
SAT verbal score
Semester-wise information
Primary major
Semester-wise GPA
Credit hours per semester
Duration
Graduation term
Enrolled year!
4 EXPERIMENTS
In this section, we will present the results of our proposed
framework for identifying at-risk students in a timely manner.
We first present details and relevant statistics about our dataset
followed by a comprehensive study evaluating the performance
of survival analysis methods in comparison to other supervised
learning frameworks. We also analyze the effect of different
features extracted from data in relation to the performance of the
predictive methods.
4.1 Data
We performed experiments on a dataset containing 12,293
students enrolled from Fall 2009 to Spring 2016 at George Mason
University.
Figure 3: Number of dropouts of students enrolled from Fall 2009 to Fall
2013 at each semester for their rst 6 semesters.
We focus on First Time Undergraduates (FTU) since transfer
students have varying performance metrics as they relate to
graduation rates, time to graduation and choice of academic
majors. Figure 3 shows the number of students who dropped
from their field of study in the first six semesters. And the
dropout rate of the test data (Fall 2013) of the first 6 semesters is
0.13%, 3.28%, 8.37%, 2.64%, 3.99%and 0%, respectively. We
consider students who start in-between Fall 2009 to Fall 2013.
Since most students start in Fall we do not show data for
students starting in Spring semesters. The highest dropout rate
occurs for the 3rd semester. We consider a student as “dropout”
when the student does not enroll for two consecutive semesters.
After the needed data pre-processing, we end up with 13 features
which can be divided into 4 different groups: high school
information, demographic information, college enrollment and
semester-wise information. The complete list of the selected
features is shown in Table 2.
4.2 Experimental Protocol
For a thorough evaluation of the proposed models and
comparative baselines we ran multiple experiments with
different feature combinations and test benchmarks. All our
Running Out of STEM: A Comparative Study across STEM Majors
of College Students At-Risk of Dropping Out Early
LAK’18, March 2018, Sydney, Australia
!
5
models were implemental using the Python language. Due to our
dropout definition, we begin our training data from the first two
semester’s data. We describe three different experimental
settings below.
Figure 4: Illustration of extending window for training and testing data.
Semis short for semester.
Experiment 1: We collect the information of students enrolled
from Fall 2009 to Spring 2013, and keep track of their status up
to 8 semesters, and use this as training data. Our test data is the
students enrolled at Fall 2013. We use the sliding window to
show how our model works. As shown in Figure 4in Step 1,
the training features are from the student information of their
first year. Training labels and predicted labels range from
semester 2 to semester 6. We slide the feature window by one
semester in the next experiment, same as the label window.
Then continue this process in the following experiments. In Step
4, our training features are information collected from the first
five semesters. Training labels and predicted labels are range
from the 5th semester to the 6th semester.
Experiment 2: This settings of experiment is the same as
Experiment 1, except we include the duration feature for the
different methods.
Experiment 3: In this experiment, we just use the survival
analysis framework. The test data is the same as experiment 1,
but we do not follow the students for 8 semesters. Instead, we
cut the observation at different time points --- Spring 2014, Fall
2014, Spring 2015, Fall 2015 and Spring 2016. We illustrate this in
Figure 5.
Figure 5: An illustration to demonstrate the implementation of
Experiment 3. In which the cut-o time is Spring 2014, Fall 2014, Spring
2015, Fall 2015 and Spring 2016, respectively.
4.3 Evaluation Metrics
In order to assess the performance of the proposed models, we
used the following standard metrics:
4.3.1 F1-score is the harmonic mean between the precision and
recall. A high F1-score indicates the precision and recall score
are both high. F1-score is given by:
Q$ 1 4 R%/STUVWOWX&%YUVZ[[0
STUVWOW\& =YUVZ[7
where STUVWOW\& 1 4 ]^
]^_`^ , YUVZ[[ 1 4 ]^
]^_`M . TP is true
positive, FP is false positive, FN is false negative.
4.3.2 AUC is the area under the Receiver Operating
Characteristic (ROC)-curve, which is created by plotting the true
positive rate against the false positive rate under different
thresholds.
4.3.3 PRAUC is Precision-Recall curve that shows the tradeoff
between precision and recall scores. A high area under curve
indicates both good precision and recall. For imbalanced dataset
where the minority class is more important; compared with
Sem1 Sem2 Sem3 Sem4 Sem5 Sem6
Fall2009
Spring2010
Fall2010
Spring2011
Sem1 Sem2 Sem3 Sem4 Sem 5 Sem6
Training feat ure s
Test data
Training labels
Test feat ure s Test labels
Fall2013
Training data
Sem1 Sem2 Sem3 Sem4 Sem5 Sem6
Fall2009
Spring2010
Fall2010
Spring2011
Sem1 Sem2 Sem3 Sem4 Sem 5 Sem6
Training feat ure s
Test data
Training labels
Test feat ure s Test labels
Fall2013
Training data
Sem1 Sem2 Sem3 Sem4 Sem 5 Sem6
Fall2009
Spring2010
Fall2010
Spring2011
Sem1 Sem2 Sem3 Sem4 Sem 5 Sem6
Training feat ure s
Test data
Training labels
Test feat ure s Test labels
Fall2013
Training data
Sem1 Sem2 Sem3 Sem4 Sem 5 Sem6
Fall2009
Spring2010
Fall2010
Spring2011
Sem1 Sem2 Sem3 Sem4 Sem 5 Sem6
Training feat ure s
Test data
Training labels
Test feat ure s Test labels
Fall2013
Training data
Semester
Student1
Student2
Student3
Student4
Student5
F13 S14 F14 S15 F15 S16
cut-off
time 1
cut-off
time 5
cut-off
time 2
cut-off
time 3
cut-off
time 4
Step1
Step2
Step4
Step3
!
LAK’18, March 2018, Sydney, Australia
Y. Chen et al.
!
6
AUC, the PRAUC curve gives more information about the
algorithm performance on the target class (i.e., whether the
student drops out or not).
4.4 Results and Discussion
4.4.1 Analysis of Survival methods
In Figure 6, we show the results of survival analysis approach on
the largest nine majors (by enrollment). The y-axis represents
the survival rate (blue lines) and cumulative hazard rate (orange
lines) of the students at t semester, where the t-th semester is
along the x-axis. When looking at the retention rates there is
little clarity on how these rates might differ across majors, and
even within STEM majors. There are reasons to believe that
learning progresses differently across majors and also the way
courses and curriculum are structured there are bound to be
differences in students’ progress. One of the issues with
analytical approaches is interpreting them within a specific
context. The findings from our study show that within one
institution, there are significant differences in retention rates
across different STEM majors. The retention and graduation for
“Information Technology” major is the highest followed by CS
and other engineering majors. We observe that students within
the Applied Information Technology (AIT) have a higher
survival probability in comparison to students within the Physics
(PHYS) major. Hence if a student survives the first few semesters
then they are more likely to graduate. On the other hand,
Physics as a subject is known to be challenging and students
who do not put in effort tend to give up and drop out. The
course and curriculum are different across these majors as are
student characteristics and our analysis shows that for IT the
student age is much higher than others suggesting that the
students are more mature.
Figure 7: Survival function for George Mason University Dataset
Figure 7 shows the survival rate (y-axis) of the whole students at
certain semesters (x-axis) and Figure 8 is the cumulative hazard
rate (y-axis) of the dataset at certain semesters (x-axis).
Specifically, we estimate the parameters of the survival function
using Kaplan-Meier and hazard rates using Nelson-Aalen. The
survival rate decreases as the semester grows and the cumulative
hazard rate increases as semester grows. At the 8th semester, the
!
Figure 6: Survival function (blue lines) and Cumulative hazard rate (orange lines) of the top 9 big majors.
Semester !
Running Out of STEM: A Comparative Study across STEM Majors
of College Students At-Risk of Dropping Out Early
LAK’18, March 2018, Sydney, Australia
!
7
survival rate is about 74%, which means that there are about 74%
of the students who stay enrolled for 8 semesters.
Figure 8: Cumulative hazard rate for George Mason University Dataset.
4.4.2 Comparison between Survival analysis methods and non-
survival methods
We did three experiments to compare the performance of
Survival analysis framework and comparative approaches. The
results of Experiment 1 and Experiment 3, are shown in Table 3,
with the features of the first two semesters without duration.
AdaBoost performs best in the next semesters prediction,
however, survival models have the best performance metrics in
the 4th, 5th and 6th semesters. This shows that the survival
analysis approaches are better at early identification of at-risk
students. In particularly at the 2nd semester the survival analysis
is able to identify students who dropout in their 4th, 5th and 6th
semesters. Further most dropouts happen at the third and fifth
semester. Adding more information about student as they take
more classes leads to improved performance of machine learning
methods. Due to the space limitation, we will not show the
results with more semester-wise features but include them on
our supplementary webpage.
Tables 4-7 show the results of different methods with duration
feature added in. We can see that the Logistic Regression (LR),
Decision Tree (DT), Random Forest (RF), Naïve Bayes (NB) and
Adaboost (AB) methods achieve better results. Semester-wise
GPA and duration are the key features with high predictive
power. Since our objective here is early identification of student
dropout, we prefer to incorporate as few semester-wise features
to achieve better results. Thus the survival methods perform
better than non-survival methods with few semester-wise
information.
Figures 9 and 10 show a comparison among the different
methods on the prediction of True Positives (TP) and False
Positives (FP). Clearly, we expect the TP to be high, and the FP to
be low. A high value of TP and low value of FP indicate a strong
predictive ability of the proposed method. From the charts we
observe that TP increases as more semester-wise features are
incorporated, but the growth is pretty small after we have used
up to three semester-wise features. This is explainable, as most
dropouts occur in the 3rd semester. Methods like decision tree
and Naïve Bayes can achieve very high TP, but also lead to high
FP. Logistic regression has the relatively better performance with
high TP and low FP. Survival methods can get better results in
the prediction at the time after the next semester. Figure 10
shows the performance of non-survival approaches improves
with addition of duration feature.
With more semester-wise features, the predictions are more
accurate. Take the prediction of the 6th semester for example, the
results of adding information from three semesters is better than
just using information from two semesters. Finally, the closer to
the cut-off date, the more accurate the prediction is.
From the results we make the following conclusions:
1. For the prediction of students with less than two semesters’
features, survival analysis approach is be!er than the
proposed non-survival methods.
2. Standard machine learning approaches like Logistic
Regression and AdaBoost are good at utilizing engineered
features with high predictive power (GPA and duration) to
get more accurate predictions.
3. With more semester-wise features, the predictions are more
accurate and reliable.
4.4.3 Predicting dropout students and dropout semester
Another important objective of this work is to estimate whether
and when the students will dropout at the beginning of their
studies. With the standard supervised learning methods, we can
only answer the questions of whether the students will dropout
Table 3: Results of Machine Lea rning methods (ML) and Survival Analysis (SA) for students beginning in Fall 2013, using features of the rst two
semesters except duration feature (Experiment 1 and Experiment 3). In which, 3rd indicates the third semester, 4th is the fourth semester, 5th is
the fi%h semester, 6th is the sixth semester, same indications of Table 4 to Table 7.
!
F-1 score!
AUC
PRAUC!
!
ML
SA
ML
SA
ML
SA
LR
DT
RF
NB
AB
AAF
COX
LR
DT
RF
NB
AB
AAF
COX
LR
DT
RF
NB
AB
AAF
COX
3rd
0.706
0.612
0.563
0.73
0.734
0.695
0.654
0.918
0.802
0.757
0.982
0.987
0.946
0.948
0.73
0.62
0.57
0.78
0.79
0.74
0.741
4th
0.54
0.295
0.45
0.318
0.533
0.549
0.552
0.705
0.621
0.659
0.675
0.709
0.718
0.733
0.6
0.37
0.54
0.48
0.59
0.6
0.6
5th
0.518
0.315
0.439
0.332
0.537
0.551
0.562
0.688
0.611
0.658
0.646
0.703
0.72
0.725
0.6
0.4
0.5
0.51
0.61
0.6
0.61
6th
0.497
0.386
0.453
0.308
0.495
0.525
0.525
0.672
0.627
0.659
0.5
0.672
0.693
0.694
0.62
0.45
0.52
0.59
0.61
0.61
0.62
!
Semester !
LAK’18, March 2018, Sydney, Australia
Y. Chen et al.
!
8
in a specific semester, but cannot determine when the students
will dropout. With Survival methods we can predict the survival
rate of students in the future semesters. The survival rate of a
student drops below a certain value indicates the dropout of this
student. Thus we predict the actual time when the dropout event
will happen. We show these results in Table 8.
Table 8: Performance of Aalen’s Additive Filter(AAF) and Cox on
predicting dropouts and actual dropout time (APDS)
F-1 score
Precision
Recall
APDS
AAF
0.733
0.766
0.714
0.618
COX
0.725
0.81
0.693
0.614
We use F1-score, precision and recall to evaluate the prediction
of whether students will dropout. We also report the accuracy of
predicted dropout semester (APDS), which is the accuracy of
predicting when the students will dropout. For the features, we
only used pre-enrolled information and the GPA of first
semester. We notice that survival model can utilize few
semester-wise features to achieve good performance on
predicting student dropout status and dropout time in the future.
4.5 Feature discussion
The features used in this study are shown in Table 2. From the
above results, we see that duration is a useful feature for
improving the predictions (see results of Experiment 1 versus
Experiment 2). Non-survival methods have big improvements in
the prediction of dropouts in next term and small improvements
in the next two, three and four semesters predictions. Duration
is a necessary feature in survival methods, so the results of
survival methods are not change in Experiment 1 and
Experiment 2.
With survival methods, we acquire the summary of coefficients
and statistics in Table 9. Column “z” gives the statistical
significance, corresponds to the ratio of each regression
coefficients to its standard error (z=coef/se(coef)). Column “p” is
the global statistical significance of the model. The last two
columns are the confidence intervals of the hazard ratios. first1
Table 4: Results of Machine Learning methods (ML) and Sur vival Analysis (SA) for students beginning in Fall 2013, using features of the rst two
semesters including duration feature (Experiment 2 and Experiment 3).
!
F-measure
AUC
PRAUC
!
ML
SA
ML
SA
ML
SA
LR
DT
RF
NB
AB
AAF
COX
LR
DT
RF
NB
AB
AAF
COX
LR
DT
RF
NB
AB
AAF
COX
3rd
0.937
0.905
0.905
0.808
0.937
0.695
0.654
0.997
0.966
0.966
0.991
0.997
0.946
0.948
0.94
0.91
0.91
0.84
0.94
0.74
0.741
4th
0.492
0.308
0.448
0.482
0.533
0.549
0.552
0.668
0.624
0.672
0.733
0.709
0.718
0.733
0.64
0.37
0.49
0.52
0.59
0.6
0.6
5th
0.49
0.311
0.472
0.485
0.537
0.551
0.562
0.669
0.607
0.677
0.723
0.703
0.72
0.725
0.63
0.4
0.53
0.53
0.61
0.6
0.61
6th
0.462
0.405
0.433
0.471
0.505
0.529
0.525
0.653
0.641
0.649
0.695
0.68
0.698
0.695
0.63
0.46
0.49
0.53
0.59
0.61
0.62
Table 5: Results of Machine Learning methods (ML) and Survival Analysis (SA) for student data beginning in Fall 2013, using features of the rst
three semesters including duration feature (Experiment 2 and Experimen t 3)
!
F-measure
AUC
PRAUC
!
ML!
SA!
ML!
SA!
ML!
SA!
LR
RF
NB
DT
AB
AAF
COX
LR
DT
RF
NB
AB
AAF
COX
LR
DT
RF
NB
AB
AAF
COX
4th
0.955
0.946
0.882
0.908
0.955
0.838
0.847
0.993
0.942
0.983
0.982
0.993
0.966
0.976
0.96
0.91
0.95
0.9
0.96
0.86
0.87
5th
0.868
0.859
0.792
0.72
0.862
0.844
0.855
0.908
0.86
0.911
0.93
0.927
0.924
0.932
0.88
0.74
0.87
0.82
0.87
0.854
0.864
6th
0.779
0.754
0.697
0.555
0.784
0.784
0.787
0.833
0.765
0.823
0.847
0.845
0.86
0.853
0.82
0.61
0.8
0.73
0.82
0.81
0.82
Table 6: Results of Machine Learning methods (ML) and Sur vival Analysis (SA) for students beginning in Fall 2013, using features of the rst two
four except duration feature (Experiment 1 and Experiment 3)
F-measure
AUC
PRAUC
ML
SA
ML
SA
ML
SA
LR
RF
NB
DT
AB
AAF
COX
LR
DT
RF
NB
AB
AAF
COX
LR
DT
RF
NB
AB
AAF
COX
5th
0.963
0.955
0.91
0.941
0.967
0.909
0.909
0.992
0.96
0.982
0.983
0.994
0.945
0.95
0.96
0.95
0.96
0.92
0.97
0.916
0.915
6th
0.85
0.835
0.734
0.706
0.841
0.829
0.835
0.888
0.844
0.878
0.882
0.902
0.873
0.875
0.87
0.73
0.86
0.77
0.86
0.856
0.862
Table 7: Results of Machine Learning methods (ML) and Survival Analysis (SA) for student data at Fall 2013, using features of the rst ve
semesters except duration feature (Experiment 1 and Experiment 3)
F-measure
AUC
PRAUC
ML
SA
ML
SA
ML
SA
LR
RF
NB
DT
AB
AAF
COX
LR
DT
RF
NB
AB
AAF
COX
LR
DT
RF
NB
AB
AAF
COX
6th
0.981
0.996
0.926
0.944
0.876
0.879
0.885
0.993
0.956
0.998
0.981
0.89
0.914
0.92
0.98
0.95
0.99
0.93
0.91
0.895
0.899
!
!
Running Out of STEM: A Comparative Study across STEM Majors
of College Students At-Risk of Dropping Out Early
LAK’18, March 2018, Sydney, Australia
!
9
and first2denote the first and second semesters GPA features,
cohort represents the enrolled year. The number of stars
indicates the importance/significance of the features.
Table 9: Coecients and related statistic values
!!! coef exp(coef) se(coef) z p lower!0.95 upper!0.95 !!!!
cohort&&&&&&&&&-3.363e-03&&9.966e-01&5.263e-04&-6.390e+00&1.657e-10&&-4.395e-03&&-2.331e-03&&***
id 1.44E-05 1.00E+00 1.86E-05 7.75E-01 4.38E-01 -2.20E-05 5.09E-05
first1&&&&&&&&&-2.366e-01&&7.893e-01&2.610e-02&-9.067e+00&1.224e-19&&-2.878e-01&&-1.855e-01&&***
cohort -3.36E-03 9.97E-01 5.26E-04 -6.39E+00 1.66E-10 -4.40E-03 -2.33E-03 ***
first2&&&&&&&&&-8.431e-01&&4.304e-01&2.118e-02&-3.980e+01&0.000e+00&&-8.846e-01&&-8.015e-01&&***
first1 -2.37E-01 7.89E-01 2.61E-02 -9.07E+00 1.22E-19 -2.88E-01 -1.86E-01 ***
SAT_Total_1600&&1.276e-04&&1.000e+00&3.583e-04&&3.562e-01&7.217e-01&&-5.748e-04&&&8.301e-04&&&&&
first2 -8.43E-01 4.30E-01 2.12E-02 -3.98E+01 0.00E+00 -8.85E-01 -8.02E-01 ***
SAT_Verbal&&&&&&4.890e-04&&1.000e+00&4.695e-04&&1.042e+00&2.976e-01&&-4.313e-04&&&1.409e-03&&&&&
SAT_Total_1600 1.28E-04 1.00E+00 3.58E-04 3.56E-01 7.22E-01 -5.75E-04 8.30E-04
SAT_Math&&&&&&&-9.107e-04&&9.991e-01&4.677e-04&-1.947e+00&5.151e-02&&-1.828e-03&&&6.165e-06&&&&.
SAT_Verbal 4.89E-04 1.00E+00 4.70E-04 1.04E+00 2.98E-01 -4.31E-04 1.41E-03
ENTRY_AGE&&&&&&&3.962e-02&&1.040e+00&1.403e-02&&2.824e+00&4.747e-03&&&1.211e-02&&&6.713e-02&&&**
SAT_Math -9.11E-04 9.99E-01 4.68E-04 -1.95E+00 5.15E-02 -1.83E-03 6.17E-06 .
ENTRY_AGE 3.96E-02 1.04E+00 1.40E-02 2.82E+00 4.75E-03 1.21E-02 6.71E-02 **
SEX -7.04E-02 9.32E-01 4.67E-02 -1.51E+00 1.32E-01 -1.62E-01 2.12E-02
race -1.35E-02 9.87E-01 1. 22E-02 -1.11E+00 2.69E-01 -3.74E-02 1.04E-02
HSGPA -6.09E-03 9.94E-01 6.83E-03 -8.91E-01 3.73E-01 -1.95E-02 7.31E-03
The most significant features are GPA and enrolled time. The
second important feature is ENTRY_AGE (Age), which indicates
the enrolled ages of students can affect their dropouts, perhaps
students with more life experience would have more specific
study targets, thus they are more likely to graduate.
5 CONCLUSIONS
Student retention is a challenging problem across multiple
academic institutions. The problem is more acute for STEM
majors. Student dropout is a complicated problem and cannot be
explained by one single model. The reasons for student dropout
vary based on a host of factors including the majors, students
body and institutional environment. In this paper, we propose a
!!!!!!!!!! !!!!!!!!!!!!!!!!!!!! !
(1) (2)
!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!! !
(3) (4)
Figure 9: #e predicted True Positive (TP) value and False Positive (FP) value of the proposed methods at dierent semesters (Experiment 1 and
Experiment 3).
!!!!!!!!!!!! !!!!!!!!!!!!!!!!!! !
(5) (6)
!!!!!!!!!!!! !!!!!!!!!!!!!!!!!! !
(7) (8)
Figure 10: #e predicted True Positive (TP) value and False Positive (FP) value of the proposed methods at dierent semesters (Experiment 2 and
Experiment 3).
LAK’18, March 2018, Sydney, Australia
Y. Chen et al.
!
10
survival analysis based approach identify students at-risk of
dropping out. We also performed a comparison of nine different
majors (majority in STEM) and noticed variations in the survival
rates using the proposed frameworks. Survival analysis
approaches were found to be promising for early identification
of student dropout with fewer semester-wise information.
Survival methods can predict not only whether a student will
dropout, but also when the student will dropout. Also, features
like duration and GPA have strong predictive ability. Equipped
with these developed algorithms, the future vision is to
incorporate these proposed methods within degree planning and
early warning systems to improve the overall student graduation
rates as well as the time to degree completion.
ACKNOWLEDGEMENTS
This research was funded by NSF IIS grant 1447489.
REFERENCES
[1]
Golding, P., and Donaldson, O. Predicting academic performance. In 36th
ASEE/IEEE frontiers in education conference, 2006
[2]
Barker, K., Trafalis, T., and Rhoads, T. R. Learning from student data. Systems
and Information Engineering Design Symposium, 7986, 2004.
[3]
Kabakchieva, D. Predicting student performance by using data mining methods
for classification. Cybernetics and Information Technologies, 13(1), 61e72,
2013.
[4]
Alkhasawneh, R. Developing a hybrid model to predict student first year
retention and academic success in STEM disciplines using neural networks.
Virginia Commonwealth University, 2011.
[5]
Sattar Ameri, Mahtab J. Frad, Ratna B. Chinnam and Chandan K. Reddy.
Survival analysis based framework for early prediction of student dropouts.
CIKM 16, October 24-28, 2016.
[6]
Atwell, R. H., Ding, W., Ehasz, M., Johnson, S., and Wang, M. Using data
mining techniques to predict student development and retention. In Proceedings
of the national symposium on student retention, 2006.
[7]
Pittman, K. Comparison of data mining techniques used to predict student
retention. Ph.D. thesis, Nova Southeastern University, 2008.
[8]
Zhang, Y., & Oussena, S. Use data mining to improve student retention in higher
education - a case study. In 12th International conerence on enterprise
information systems. Portugal, 2010.
[9]
Kabakchieva, D., Stefanova, K., and Kisimov, V. Analyzing university data for
determining student profiles and predicting performance. In 4th inter-national
conference on educational data mining (Eindhoven, the Netherlands), 2011.
[10]
Oskouei, R. J., & Askari, M. Predicting academic performance with applying
data mining techniques (Generalizing the results of two different case
studies). Computer Engineering and Applications Journal, 79e88, 2014.
[11]
Druzdzel, M. J., and Glymour, C. Application of the TETRAD II program to the
study of student retention in US colleges. In Working notes of the AAAI-94
workshop on knowledge discovery in databases (KDD-94) Seattle, WA (pp.
419430), 1994.
[12]
Yehuala, M. A. Application of data mining techniques for student success and
failure prediction (The case of Debre Markos university). International Journal
of Scientific & Technology Research, 4(4), 91e94, 2015.
[13]
J. Luna. Predicting student retention and academic success at New Mexico Tech.
PhD thesis, New Mexico Institute of Mining and Technology, 2000.
[14]
R. W. Rumberger and S. A. Lim. Why students drop out of school: A review of 25
years of research. California Dropout Research Project, Policy Brief 15, 2008.
[15]
Yu, C. H., DiGangi, S., Jannasch-Pennell, A., Lo, W., and Kaprolet, C. A data-
mining approach to differentiate predictors of retention between online and
traditional students. 2007.
[16]
D. R. Jones-White, P. M. Radcliffe, R. L. Huesman Jr, and J. P. Kellogg.
Redefining student success: Applying different multinomial regression
techniques for the study of student graduation across institutions of higher
education. Research in Higher Education, 51(2):154174, 2010.
[17]
D. Delen. Predicting student attrition with data mining methods. Journal of
College Student Retention: Research, Theory & Practice, 13(1):1735, 2011.
[18]
S. K. Yadav, B. Bharadwaj, and S. Pal. Mining education data to predict
student’s retention: A comparative study. International Journal of Computer
Science and Information Security, 10(2):113, 2012.
[19]
G. Zhang, T. J. Anderson, M. W. Ohland, and B. R. Thorndyke. Identifying
factors influencing engineering student graduation: A longitudinal and cross-
institutional study. Journal of Engineering education, 93(4):313320, 2004.
[20]
Scalise, A., Besterfield-Sacre, M., Shuman, L., and Wolfe, H. First term
probation: Models for identifying high risk students. 30th Annual Frontiers in
Education Conference (Vol. 1, pp. 1116). Kansas City, MO, USA: Stripes
Publishing, 2000.
[21]
J. Lin, P. Imbrie, and K. J. Reid. Student retention modelling: An evaluation of
different methods and their impact on prediction results. Research in
Engineering Education Sysmposium, pages 16, 2009.
[22]
E. L. Dey and A. W. Astin. Statistical alternatives for studying college
student retention: A comparative analysis of logit, probit, and linear
regression. Research in Higher Education, 34(5):569581, 1993.
[23]
V. Tinto. Research and practice of student retention: what next? Journal of
College Student Retention: Research, Theory & Practice, 8(1):119, 2006.
[24]
M. S. DeBerard, G. Spielmans, and D. Julka. Predictors of academic
achievement and retention among college freshmen: A longitudinal study.
College student journal, 38(1):6680, 2004.
[25]
S Olson, DG Riordan. Engage to Excel: Producing One Million Additional
College Graduates with Degrees in Science, Technology, Engineering, and
Mathematics. Report to the President. Executive Office of the President, 2012 -
ERIC
[26]
President's Council of Advisors on Science and Technology (PCAST). 2012.
[27]
Hayes RQ, Whalen SK, Cannon B. Center for Institutional Data Exchange and
Analysis, University of Oklahoma, Norman. Csrde stem retention report, (2009)
20082009.
[28]
Watkins J, Mazur E. Retaining students in science, technology, engineering, and
mathematics (STEM) majors. J Coll Sci Teach. 2013;42:3641.
[29]
Ohland, MW, Sheppard, SD, Lichtenstein, G., Eris, O., Chachra, D., & Layton.
Persistence, engagement, and migration in engineering. Journal of Engineering
Education 97 (3): 259278, RA 2008.
[30]
Marra, R. M., Rodgers, K. A., Shen, D., & Bogue, B. Leaving engineering: A
multi-year single institution study. Journal of Engineering Education, 101(1),
6–27, 2012.
[31]
Chen-Wei Chang, Jun Heo. Visiting theories that predict college students’
self-disclosure on Facebook, 2014.
[32]
Seymour, E., & Hewitt, N. M. Talking about leaving: Why undergraduates leave
the sciences. Boulder: Westview Press, 1997.
[33]
Elizabeth M.king and Berk Ozler. What’s Decentralization Got To Do With
Learning? School Autonomy and Student Performance. Discussion Paper
No.054. 2005.
[34]
Barefoot, Betsy O.Higher. Education's Revolving Door: Confronting the Problem
of Student Drop out in US Colleges and Universities. Open Learning, v19 n1 p9-
18 Feb 2004.
[35]
Cabrera, A. F., Nora, A., & Castañeda, M. B. College persistence: Structural
equations modeling test of an integrated model of student retention. The Journal
of Higher Education, 64(2), 123139, 1993
[36]
Thompson, R. & Bolin, G. (2011). Indicators of Success in STEM Majors: A
Cohort Study. Journal of College Admission, Summer 2011, pg. 18-24.
[37]
Xianglei Chen, Matthew Soldner. STEM Attrition: College Students’ Paths Into
and Out of STEM Fields. Statistical Analysis Report. 2013.
!
... Even some works that explicitly mention temporal aspects in their methods do not segment the dataset in the same temporal way for the training and testing processes. Only a few set of papers Ren et al. [2017], Chen et al. [2018], Krauss et al. [2019], Nguyen and Vo [2019], Borrella et al. [2019] used inherently time-dependent approaches. Based on them, it can be concluded that temporal splitting is useful for creating more efficient predictive models for real-world educational data mining. ...
... Another way would be to group the feature vectors by the number of academic terms already attended, in order to train a ML model for each group, as carried out by Chen, Johri and Rangwala Chen et al. [2018]. ...
Preprint
Full-text available
The prediction of academic dropout, with the aim of preventing it, is one of the current challenges of higher education institutions. Machine learning techniques are a great ally in this task. However, attention is needed in the way that academic data are used by such methods, so that it reflects the reality of the prediction problem under study and allows achieving good results. In this paper, we study strategies for splitting and using academic data in order to create training and testing sets. Through a conceptual analysis and experiments with data from a public higher education institution, we show that a random proportional data splitting, and even a simple temporal splitting are not suitable for dropout prediction. The study indicates that a temporal splitting combined with a time-based selection of the students' incremental academic histories leads to the best strategy for the problem in question.
... Student dropout in higher education (HE) is a prominent topic in many countries, such as Spain [1,2], United States [3], Germany [4,5], as well as Indonesia. Based on data from Pangkalan Data Perguruan Tinggi (PDDIKTI) (Higher Education Database) (2018, 2019), the percentage of students dropping out within the last 2 years was getting higher in Indonesia. ...
... These variables are used to predict dropout students, and the resulting variables significantly affect dropout students. Chen et al. [3] also researched the predictions of dropouts in the United States. In Chen's study, the variables used to predict dropout were high school information, demographics, college enrollment, and information per semester. ...
Article
Full-text available
Dropout students are a severe problem in higher education (HE) in many countries. Student dropout has a tremendous negative impact not only on individuals but also on universities and socioeconomic. Consequently, preventing educational dropouts is a considerable challenge for HE’s institutions. Therefore, knowing the factors influencing student dropout is an essential first step in preventing students from dropping out. This study uses a mix of qualitative and quantitative approaches. To determine what variables affect student dropout, we use a qualitative approach, after which the variables found will be validated by the public and stakeholders using a quantitative approach. Then, the next step is to classify variables using a quantitative approach. This study observes dropout students at private universities in Central Java, Indonesia. The findings reveal that personal economic factors, academic satisfaction, academic performance, and family economics are the most influential. The results of this paper are significant for universities in Indonesia, especially Central Java, to overcome the problem of student dropouts, so that they are more precise in making decisions. In addition, the results of this study are also helpful for further research as a basis for predicting students dropping out of university.
... Such forecast messages are seen as interventions that serve to both inform students and motivate them to improve their academic performance [7,13,16]. The customizability and low implementation costs of AI-based solutions make them a potentially cost-effective, scalable approach for improving academic achievement, particularly in courses during the first two years of college, where STEM curriculum is fairly standardized, and performance is critical to long-term student retention [13,16,17]. ...
Article
Full-text available
We present results from a small-scale randomized controlled trial that evaluates the impact of just-in-time interventions on the academic outcomes of N = 65 undergraduate students in a STEM course. Intervention messaging content was based on machine learning forecasting models of data collected from 537 students in the same course over the preceding 3 years. Trial results show that the intervention produced a statistically significant increase in the proportion of students that achieved a passing grade. The outcomes point to the potential and promise of just-in-time interventions for STEM learning and the need for larger fully-powered randomized controlled trials.
Article
Full-text available
Previous research has demonstrated a link between prior knowledge and student success in engineering courses. However, while course-to-course relations exist, researchers have paid insufficient attention to internal course performance development. This study aims to address this gap—designed to quantify and thus extract meaningful insights—by examining a fundamental engineering course, Statics, from three perspectives: (1) progressive learning reflected in performance retention throughout the course; (2) critical topics and their influence on students’ performance progression; and (3) student active participation as a surrogate measure of progressive learning. By analyzing data collected from 222 students over five semesters, this study draws insights on student in-course progressive learning. The results show that early learning had significant implications in building a foundation in progressive learning throughout the semester. Additionally, insufficient knowledge on certain topics can hinder student learning progression more than others, which eventually leads to course failure. Finally, student participation is a pathway to enhance learning and achieve excellent course performance. The presented analysis approach provides educators with a mechanism for diagnosing and devising strategies to address conceptual lapses for STEM (science, technology, engineering, and mathematics) courses, especially where progressive learning is essential.
Chapter
Educational data mining (EDM) contributes cutting-edge methodologies, strategies, and applications to the advancement of the education system, hence playing a crucial part in its development. Utilising machine learning and data mining approaches to explore and utilise educational data, the current advancement gives essential tools for comprehending the student learning environment. Academic institutions in the twenty-first century operate in a highly competitive and complicated environment. Among the prevalent issues faced by universities are performance analysis, the provision of a high-quality education, systems for evaluating the performance of students, and the planning of future activities. Student intervention programmes must be created in these universities in order to address the academic difficulties encountered by students. From 2009 through 2021, the relevant EDM literature relative to predicting student attrition and students at risk is examined in this review. According to the review’s results, several machine learning (ML) methodologies are used to discover and address the fundamental challenges of forecasting students at risk and student withdrawal rate. Furthermore, the bulk of studies make use of data from student college/university database and online learning portals. It was determined that ML techniques play crucial roles in forecasting students at risk and withdrawal rates, hence boosting student performance.KeywordsMachine learningPredictionStudent performanceDeep learningEducation data mining (EDM)
Chapter
Full-text available
Machine translation (MT) aims to remove linguistic barriers and enables communication by allowing languages to be automatically translated. The availability of a substantial parallel corpus determines the quality of translations produced by corpus-based MT systems. This paper aims to develop a corpus-based bidirectional statistical machine translation (SMT) system for Punjabi-English, Punjabi-Hindi, and Hindi-English language pairs. To create a parallel corpus for English, Hindi, and Punjabi, the IIT Bombay Hindi-English parallel corpus is used. This paper discusses preprocessing steps to create the Hindi, Punjabi, and English corpus. This corpus is used to develop MT models. The accuracy of the MT system is carried out using an automated tool: Bilingual Evaluation Understudy (BLEU). The BLEU score claimed is 17.79 and 19.78 for Punjabi to English bidirectional MT system, 33.86 and 34.46.46 for Punjabi to Hindi bidirectional MT system, 23.68 and 23.78 for Hindi to English bidirectional MT system.KeywordsMachine translationSMTCorpus-basedParallel corpusBLEU
Chapter
This research aims to make a systematic review of the literature with the theme of predictive learning analytics (PLA) for student dropouts using data mining techniques. The method used in this systematic review research is the literature from empirical research regarding the prediction of dropping out of school. In this phase, a review protocol, selection requirements for potential studies, and methods for analyzing the content of the selected studies are provided. The PLA is a statistical analysis of current data and historical data derived from student learning processes to develop predictions for improving the quality of learning by identifying students who are at risk of failing in their studies. PLA in higher education (HE) is essential to improve knowledge. The failure of the HE to identify the potential factors contributing to student failure rate will risk both the HE images and the student’s life. The systematic literature review conducted in this study was taken from selected journals published from 2016 to 2021.KeywordsPLAStudent dropoutData miningSLR
Article
Background Though minoritized undergraduate engineering students earn less than 25% of engineering bachelor's degrees, minority‐serving institutions (MSIs) are leading the way in producing a large percentage of those underrepresented engineering bachelor's degree holders. However, much of the published research about the experiences of underrepresented engineering students occurs within the context of predominantly White institutions. Upon deeper inspection into the apparent success of some MSIs, graduation rates of specific minoritized populations (e.g., Black students) remain critically low. This suggests that there is more to be learned about how to better support Black engineering students' success. Purpose We explored the experiences of Black undergraduate engineering students at a large public doctoral university with very high research activity. Design/Method We used interpretative phenomenological analysis to understand the experiences of eight participants. Findings We inductively developed two themes to describe how Black engineering students experience success at a Hispanic‐serving institution, which include building success networks and implementing rules of engagement. Conclusion Participants enacted their cultural capital to construct their circles of success through the intentional engagement of others, resources, and themselves to realize success. This work sheds light on how Black students describe what it means to be successful in their engineering environment.
Article
Full-text available
Understanding the reasons behind the low enrollment and retention rates of Underrepresented Minority (URM) students (African Americans, Hispanic Americans, and Native Americans) in the disciplines of science, technology, engineering, and mathematics (STEM) has concerned researchers for decades. Statistics show that students of color have higher attrition rates compared with other groups, although this trend has been decreasing over the past twenty years (Besterfield-Sacre, Atman, & Shuman, 1997; Mitchell & Daniel, 2007; Fleming, Ledbetter, Williams, & McCain, 2008). These groups tend to enroll in STEM majors in small numbers and leave in higher numbers (Urban, Reyes, & Anderson-Rowland, 2002; Alkasawneh & Hobson, 2009). Increasing the number of minorities (women and ethnic groups) is a practical way of increasing the workforce pool in STEM fields where white male representation is still dominant. Unfortunately, this solution is difficult for many institutions. Only two out of five African American and/or Hispanic American students remain in their majors and receive bachelor’s degrees in a STEM discipline nationwide (Markley, 2005). In order to impact workforce demographics, the population of students choosing STEM majors must change. The literature reflects a substantial interest in increasing URM student retention in higher education (Sidle & McReynolds, 1999; Nave, Frizell, Obiomon, Cui, & Perkins, 2006; Hargrove & Burge, 2002). Retention is of significant interest because of its positive impact on college reputation and workforce demographics (Williford & Schaller, 2005). Several studies emphasize the importance of identifying college students with higher risk of dropping out in early stages in order to allocate the available resources based upon student needs (Herzog, 2006; Lin, Imbrie, & Reid, 2009). Research by Zhang, Anderson, Ohland, Carter, & Thorndyke (2002) stated that identifying factors that affect student retention could play an effective role in the counseling and advising process for engineering students. This equips institutions to utilize their available resources based upon those groups’ needs (Herzog, 2006). Traditional methods of statistical analysis have been used to predict student retention, such as logistic regression (Gaskins, 2009). Recently, research has focused on data mining techniques to study student retention in higher education (Brown, 2007). These techniques are highly accurate, robust with missing data, and do not need to be built on a hypothesis. Data mining is defined as recognizing patterns in a large set of data and then trying to understand those patterns.
Conference Paper
Full-text available
Retention of students at colleges and universities has been a concern among educators for many decades. The consequences of student attrition are significant for students, academic staffs and the universities. Thus, increasing student retention is a long term goal of any academic institution. The most vulnerable students are the freshman, who are at the highest risk of dropping out at the beginning of their study. Therefore, the early identification of {\emph{``at-risk''}} students is a crucial task that needs to be effectively addressed. In this paper, we develop a survival analysis framework for early prediction of student dropout using Cox proportional hazards regression model (Cox). We also applied time-dependent Cox (TD-Cox), which captures time-varying factors and can leverage those information to provide more accurate prediction of student dropout. For this prediction task, our model utilizes different groups of variables such as demographic, family background, financial, high school information, college enrollment and semester-wise credits. The proposed framework has the ability to address the challenge of predicting dropout students as well as the semester that the dropout will occur. This study enables us to perform proactive interventions in a prioritized manner where limited academic resources are available. This is critical in the student retention problem because not only correctly classifying whether a student is going to dropout is important but also when this is going to happen is crucial for a focused intervention. We evaluate our method on real student data collected at Wayne State University. Results show that the proposed Cox-based framework can predict the student dropouts and semester of dropout with high accuracy and precision compared to the other state-of-the-art methods.
Article
Full-text available
Cabrera, Castaneda, Nora, and Hengstler [18] found considerable overlap between Tinto's [50, 52] and Bean's [4, 5, 7] models of student attrition. This study integrated the major propositions underlying both theoretical frameworks. Findings supported most of the hypothesized links and uncovered that environmental factors play a far more complex role than the one envisioned by Tinto [52].
Article
Full-text available
It has become universally known that we as a nation have fallen behind other nations in the areas of science, technology, engineering, and mathematics (STEM). According to the National Science and Engineering Indicators, produced by the National Science Foundation in 2006, the United States has one of the lowest STEM to non-STEM degree rates in the world. In 2002, STEM degrees accounted for only 16.8 percent of all first university degrees awarded in the US. The international average was 26.4 percent, Japan leading with 64 percent and Brazil just below the US with the lowest at 15.5 percent (NSF 2006).
Article
Full-text available
Data mining methods are often implemented at advanced universities today for analyzing available data and extracting information and knowledge to support decision-making. This paper presents the initial results from a data mining research project implemented at a Bulgarian university, aimed at revealing the high potential of data mining applications for university management.
Chapter
Producing sufficient numbers of graduates who are prepared for science, technology, engineering, and mathematics (STEM) occupations has become a national priority in the United States. To attain this goal, some policymakers have targeted reducing STEM attrition in college, arguing that retaining more students in STEM fields in college is a low-cost, fast way to produce the STEM professionals that the nation needs (President's Council of Advisors on Science and Technology [PCAST] 2012). Within this context, this Statistical Analysis Report (SAR) presents an examination of students' attrition from STEM fields over the course of 6 years in college using data from the 2004/09 Beginning Postsecondary Students Longitudinal Study (BPS:04/09) and the associated 2009 Postsecondary Education Transcript Study (PETS:09). In this SAR, the term STEM attrition refers to enrollment choices that result in potential STEM graduates (i.e., undergraduates who declare a STEM major) moving away from STEM fields by switching majors to non-STEM fields or leaving postsecondary education before earning a degree or certificate.1 The purpose of this study is to gain a better understanding of this attrition by: • determining rates of attrition from STEM and non-STEM fields; • identifying characteristics of students who leave STEM fields; • comparing the STEM coursetaking and performance of STEM leavers and persisters; and • examining the strength of various factors' associations with STEM attrition. Data from a cohort of students who started their postsecondary education in a bachelor's or associate's degree program in the 2003-04 academic year were used to examine students' movement into and out of STEM fields over the subsequent 6 years through 2009. Analyses were performed separately for beginning bachelor's and associate's degree students. For brevity, these two groups are frequently referred to as bachelor's or associate's degree students in this study. Selected findings from this SAR are described below.
Article
BACKGROUND As estimates continue to indicate a growing demand for engineering professionals, retention in engineering remains an issue. Thus, the engineering education community remains concerned about students who leave engineering and must work to identify the factors that influence those students' decisions. PURPOSE (HYPOTHESIS) Our purpose was to identify a set of factors describing the experiences of students' in a college of engineering that are strong influences on decisions to leave and study how those factors are related to both predictor variables (e.g., high school preparation) and future behaviors (e.g., new major chosen). DESIGN/METHOD We solicited survey data from students who had recently transferred out of a large engineering college. We conducted exploratory factor analysis to determine the main factors for leaving engineering and then used these factors to answer the research questions. RESULTS Results indicate that both academic (e.g., curriculum difficulty and poor teaching and advising) and a non-academic factor (lack of belonging in engineering) contribute to students' decisions to leave engineering. We did find differences for some factors between majority and non-majority students; however, there were no gender differences. CONCLUSIONS Both academic and non-academic factors contribute to students' decisions to leave engineering; however, our sample indicated the non-academic factors may be a stronger influence. Implications for educators focus on addressing both academic and the belonging factor and include examining pedagogical activities that may be less welcoming to a wide variety of student groups, providing opportunities for meaningful faculty interaction and other activities designed to support students pursuing engineering degrees.
Article
Affecting university rankings, school reputation, and financial well-being, student retention has become one of the most important measures of success for higher education institutions. From the institutional perspective, improving student retention starts with a thorough understanding of the causes behind the attrition. Such an understanding is the basis for accurately predicting at-risk students and appropriately intervening to retain them. In this study, using 8 years of institutional data along with three popular data mining techniques, we developed analytical models to predict freshmen student attrition. Of the three model types (artificial neural networks, decision trees, and logistic regression), artificial neural networks performed the best, with an 81% overall prediction accuracy on the holdout sample. The variable importance analysis of the models revealed that the educational and financial variables are the most important among the predictors used in this study.