ArticlePDF Available

Predictive Analysis of the Enrolment of Elementary Schools Using Regression Algorithms


Abstract and Figures

By fitting a linear equation to observable values, linear regression determines the relationship between two variables. The Department of Education enrollment data in the Philippines, specifically in the School Division of Batangas, is needed to produce modules. The data collected is from the division office, where data cleaning was applied. Deep Learning, Decision Tree, Random Forest, Gradient Boosted Tree, Support Vector Machine, and Linear Regression were used to perform the prediction, and linear regression performed the best with an absolute value of 14.465 and a relative error of 84.81%.
Content may be subject to copyright.
International Journal of Emerging Technology and Advanced Engineering
Website: (E-ISSN 2250-2459, Scopus Indexed, ISO 9001:2008 Certified Journal, Volume 11, Issue 11, November 2021)
Manuscript Received: 09 October 2021, Received in Revised form: 06 November 2021, Accepted: 11 November 2021 DOI: 10.46338/ijetae1121_21
Predictive Analysis of the Enrolment of Elementary Schools
Using Regression Algorithms
Elizalde Lopez Piol1,1, Luisito Lolong Lacatan1,2, Jaime P. Pulumbarit2,3
1Department of Education - Philippines (DepEd)
2College of Engineering, Laguna University, Philippines
3Bulacan State University, Malolos Bulacan, Philippines
Abstract By fitting a linear equation to observable values,
linear regression determines the relationship between two
variables. The Department of Education enrollment data in
the Philippines, specifically in the School Division of Batangas,
is needed to produce modules. The data collected is from the
division office, where data cleaning was applied. Deep
Learning, Decision Tree, Random Forest, Gradient Boosted
Tree, Support Vector Machine, and Linear Regression were
used to perform the prediction, and linear regression
performed the best with an absolute value of 14.465 and a
relative error of 84.81%.
Keywords Prediction, Information Management, Linear
Regression, Cloud Computing, LDM.
Department of Education (DepEd) has established
Learning Delivery Modalities for their clients. As part of
this, synchronous and asynchronous mode of learning [1] is
also presented to the parents and students. The LDM
implementation has continuously met the gap for the
learners and teachers to acquire knowledge in a modular
arrangement. Most of the learning materials are gathered
through a Cloud Computing (CC), the Information
Management model [2]. However, the learning materials
are still evaluated to see if the modules adequately assess
students' understanding of a particular area. Moreover,
there is a need to assess the learners' success in all grade
levels and across domains.
In the presence of the COVID-19, the LDM produced a
solution to continue the educations of the learners. The
elementary and high school elementary and high school
students implemented the distribution of modules to the
learners that have to be accomplished per week. Other
institution also prepared for the synchronous and
asynchronous model of learning which depends on the type
of system applied for upon enrollment.
Two major difficulties arise from the implementation of
the program such as no means of computer or smartphone
and some have difficulty in internet connection. For the
implementation of learning modules; excessive or lacking
number of printed learning modules, damaged modules due
to wear and tear, and others.
A model will be proposed to evaluate the process of
allocating learning resources and the resources required by
the various schools in DepEd Region 4A [3], which
comprises 21 Divisions. The study will concentrate on the
prediction of the province of Batangas for the elementary
education department, starting from Kindergarten to Grade
6 only with the coverage of the academic year of 2016-
2017 until the academic year 2019-2020. A prediction
based on linear regression was used to measure each
institution's performance and success rate using cloud-
based learning resources [4]. Furthermore, analyzing the
trend of the different data collected from various sources
and determining the acceptability of cloud-based learning.
Different predicting models had been used in different
studies like gradient boosted tree [5], naive Bayes, random
forest, and others; however, these models may not be
appropriate in predicting enrollment of primary schools in
region 4a. With the different trends and pandemics, it will
be harder to predict this dilemma's enrollment pattern [6].
The research will be more accurate by determining the
different parameters and the best fit predictive algorithm.
One of the leading predictive algorithms is Regression
[7] ; this algorithm has been used in medical, statistical,
environmental predictions, and even enrollment analysis
[8][9]. It has also been proven that the regression algorithm
fits multidimensional datasets [10]. In this case, using this
method would allow a broader scope with higher
accuracy [11].
International Journal of Emerging Technology and Advanced Engineering
Website: (E-ISSN 2250-2459, Scopus Indexed, ISO 9001:2008 Certified Journal, Volume 11, Issue 11, November 2021)
Linear regression is fitted for the dataset for it depends
on other variables in line with time series. Works best with
the enrollment prediction since each year creates a trend
that would be the basis of a good predictor result.
Applying a regression approach in a cloud computing
environment [12] would be a more significant challenge,
but the possibility of integration would show the study's
The data was collected in the different district offices of
the Cavite, Laguna, Batangas, Rizal, and Quezon
(CALABARZON) Area but will only concentrate on the
elementary education of the Schools Division of Batangas
The distribution of data for the province of Batangas for
elementary education is for the three consecutive academic
years. This data will be used to predict the next academic
year by using the trend and linear regression.
In line with the data collection is the preparation for data
cleaning. The different data representations collected will
be merged to a single dataset hat would represent the total
students to receive a learning module. The data must be
rebuilt to check if the fields are complete or has different
datatypes, in this process, the completion of data is
important for the data preparation. The data must be
standardized so that it follows the same datatype and input
patterns, having the same field size and determined values
of input is important in the preparation of data. The data
must be normalized to determine the significance of the
data and relationship to each other. Deduplication is done
to remove the redundant values in the table which would
make the prediction process invalid. Verifying the
legitimacy of each entry is done to determine if the values
is a sustainable data that would support other data. Lastly,
importation is needed in the process to establish the cleaned
data and ready for prediction.
International Journal of Emerging Technology and Advanced Engineering
Website: (E-ISSN 2250-2459, Scopus Indexed, ISO 9001:2008 Certified Journal, Volume 11, Issue 11, November 2021)
The methods contain different stages, from the
preparation of data to model generation for different
predictions. The data will run in different algorithms; a
simultaneous process will stimulate the best prediction
model [13]. This strategy will compare the accuracy of
each model that will be used.
The research will focus on the different layers of the
process. The data collection process involves the Division
office of CALABARZON collecting the number of
enrollees per year and per level of the different districts.
Data cleaning is the preparation of the data in terms of the
process involving unique data per item. A data splitting
technique was used to separate the training and testing
dataset. The training data is set to seventy (70%) and the
testing data is set to thirty (30%) [14]. Training is the
processing of the data cleansed to create a model. This
model will determine the flow of the data if it will obtain a
higher accuracy rate specifying the relative error and the
absolute value. The relative error is the measure of
precision which is coordinated with the size of the item
[15]. The model generation will accept two specific data,
one is the trained dataset, and the other one is the testing
dataset. It would test the model's absolute error with the
leniency of plus/minus percentage, the computation of the
[3]. The analysis phase is the generated predictive model; it
has been noted that different algorithms have different
results depending on the dataset used [16] . Determining
the predictive model that would be used will be based on
the accuracy rate of the algorithm as it was applied to the
dataset. [17]
Linear regression must state that all values must be set to
numerical counterparts evaluated to predict the enrollees
for the academic year 2020-2021 for DepEd Batangas
The data was split into a coefficient of 70 percent and 30
percent for training and testing dataset with random values.
The result of the prediction with the average number of
values is presented in the table below. The average values
represent the closest prediction for the dataset.
The range of values for the total and the prediction is
closely related but for the values of the minimum and the
maximum values has a significant range. The specific
values contain the prediction which varies from the three
consecutive academic years; hence, the prediction follows
the pattern of the assumption of the incoming kindergarten
based on the trend.
Based on the findings, the actual data is represented by
green in figure 4, while blue is the prediction. The
prediction has overlapped the actual data based on the
training data and assumed more students would enroll for
the upcoming academic year.
Figure 4. Bell Curve Representation
International Journal of Emerging Technology and Advanced Engineering
Website: (E-ISSN 2250-2459, Scopus Indexed, ISO 9001:2008 Certified Journal, Volume 11, Issue 11, November 2021)
Figure 5. Distribution of Enrollees
One of the main parameters in the prediction is gender.
The viability of prediction is based on the statistical value
of each parameter; thus, the strength of the dataset will
show the relationship of each predictive value shown in
figure 5.
Table 2.
Comparative Results
Deep Learning
Decision Tree
Random Forest
Gradient Boosted
Support Vector
The dataset had been tested to different algorithms to
compare the best model to apply. Based on the results,
linear regression has the lowest absolute error compared
with five other algorithms. They are obtaining 14.465 and
7.15 points higher compared to the nearest algorithm,
which supports vector machines. While the relative error
shows that the support vector machine performed better
than Linear regression, the values of the enrollees show
small values making the difference minimal. The relative
error of the linear regression is 84.81% which fits the
Integration of another neighboring municipality to
predict the enrollment of the Region 4A is recommended.
This is to determine the predicted number of the printed
modules to be used in each school. Also include the junior
high and Senior high of the region for future study.
[1] M. M. Shahabadi and M. Uplane, “Synchronous and Asynchronous
e-learning Styles and Academic Performance of e-learners,”
Procedia - Soc. Behav. Sci., vol. 176, pp. 129138, 2015, doi:
[2] M. Chamilco, A. Pacheco, C. Peñaranda, E. Felix, and M. Ruiz,
“Materials and methods on digital enrollment system for educational
institutions,” Mater. Today Proc., no. xxxx, pp. 26, 2021, doi:
[3] E. Jimenez and Y. Sawada, “Public for private: The relationship
between public and private school enrollment in the Philippines,”
Econ. Educ. Rev., vol. 20, no. 4, pp. 389399, 2001, doi:
[4] P. Singh and Y. P. Huang, “A new hybrid time series forecasting
model based on the neutrosophic set and quantum optimization
algorithm,” Comput. Ind., vol. 111, pp. 121139, 2019, doi:
[5] M. D. Hernandez, A. C. Fajardo, and R. P. Medina, “A hybrid
convolutional neural network-gradient boosted classifier for vehicle
classification,” Int. J. Recent Technol. Eng., vol. 8, no. 2, pp. 213
216, 2019, doi: 10.35940/ijrte.B1016.078219.
[6] R. Bozick, D. M. Anderson, and L. Daugherty, “Patterns and
predictors of postsecondary re-enrollment in the acquisition of
stackable credentials,” Soc. Sci. Res., vol. 98, no. April 2020, p.
102573, 2021, doi: 10.1016/j.ssresearch.2021.102573.
[7] L. L. Lacatan and G. M. Penuliar, “Competency-Based Mapping
Tool in Personnel Management System using Analytical Hierarchy
Process,” 4th Int. Conf. Mach. Learn. Mach. Intell., 2021, doi:
[8] V. Vamitha, “A different approach on fuzzy time series forecasting
model,Mater. Today Proc., vol. 37, no. Part 2, pp. 125128, 2020,
doi: 10.1016/j.matpr.2020.04.579.
[9] M. A. Dela Cruz, “of State Universities and Colleges in Central
Luzon Philippines :,” 2019.
[10] A. Bender et al., “Dataset for multidimensional assessment to
incentivise decentralised energy investments in Sub-Saharan
Africa,” Data Br., vol. 37, p. 107265, 2021, doi:
[11] M. D. Hernandez, A. C. Fajardo, R. P. Medina, J. T. Hernandez, and
R. M. Dellosa, “Implementation of data augmentation in
convolutional neural network and gradient boosted classifier for
vehicle classification,” Int. J. Sci. Technol. Res., vol. 8, no. 12, pp.
185189, 2019.
International Journal of Emerging Technology and Advanced Engineering
Website: (E-ISSN 2250-2459, Scopus Indexed, ISO 9001:2008 Certified Journal, Volume 11, Issue 11, November 2021)
[12] N. K. Biswas, S. Banerjee, U. Biswas, and U. Ghosh, “An approach
towards development of new linear regression prediction model for
reduced energy consumption and SLA violation in the domain of
green cloud computing,” Sustain. Energy Technol. Assessments, vol.
45, no. February, p. 101087, 2021, doi: 10.1016/j.seta.2021.101087.
[13] Alexen A. Elacio; Luisito L. Lacatan; Albert A. Vinluan; Francis G.
Balazon, “Machine Learning Integration of Herzberg’s Theory using
C4.5 Algorithm ,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 9, no.
1.1, pp. 5763, 2020, doi: 10.30534/ijatcse/2020/1191.12020.
[14] A. S. Alon, M. C. A. Venal, S. V. Militante, M. D. Hernandez, and
H. B. Acla, “Lyco-frequency: A development of lycopersicon
esculentum fruit classification for tomato catsup production using
frequency sensing effect,” Int. J. Adv. Trends Comput. Sci. Eng.,
vol. 9, no. 4, pp. 46904695, 2020, doi:
[15] A. H. Ansari, “Collaboration or competition? Evaluating the impact
of Public Private Partnerships (PPPs) on public school enrolment,”
Int. J. Educ. Res., vol. 107, no. February, p. 101745, 2021, doi:
[16] J. Z. Bantog, L. L. Lacatan, and M. A. F. Quioc, “Cross-Platform
Relational Data Extraction Utilizing SQL Server (X-PRESS),” Int. J.
Comput. Appl., vol. 183, no. 31, pp. 3441, 2021, doi:
[17] S. J. R. Manglapuz and L. L. Lacatan, “Academic management
android application for student performance analytics: A
comprehensive evaluation using ISO 25010:2011,” Int. J. Innov.
Technol. Explor. Eng., vol. 8, no. 12, pp. 50855089, 2019, doi:
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
In this data article, we present datasets from the construction of a composite indicator, the Photovoltaic Decentralised Energy Investment (PV-DEI) index, presented in detail in [1]. This article consists of the comprehensive energy-related data collected in practice from several sources, and from the outputs of the methodology described in [1]. The PV-DEI index includes 52 indicators and was designed and developed to measure the multidimensional factors that currently direct decentralised renewable energy investments. The PV-DEI composite indicator was constructed because factors stimulating investment cannot be captured by a single indicator, e.g. competitiveness, affordability, governance [1]. The PV-DEI was built in alignment with a theoretical framework guided by an extensive review of the literature surrounding investment in decentralised Photovoltaic (PV), which led to the selection of its indicators. The structure of the PV-DEI was evaluated for its soundness using correlational assessments and principal component analyses (PCA). The raw data provided in this article can enable stakeholders to focus on specific country indicators, and how scores on these indicators contributed to a countries overall rank within the PV-DEI. The data can be used to weight indicators depending on the specifications of several different stakeholders (such as NGO, private sector or international institutions).
Full-text available
This study deals with the construction of a device that uses Arduino to assist in the classification of tomatoes for catsup production. The method intends to identify the tomato that is ideally adapted for Ketchup production, for improved product consistency. Using Arduino, the device can sense tomato resistance and check whether the frequency resistance for Catsup tomato is below the threshold. The device should be able to determine from this method whether or not its better suited for Catsup production.
Vehicle tracking and classification are used for intelligent transport system to provide data in terms of traffic management, routing, vehicle volume and others. A new approach will be introduced in this paper, a hybrid classifier that would detect vehicles that would be adaptable to Philippine settings. A combination of convolutional neural network and gradient boosted classifier would boost the classifying accuracy. In the discussion, CNN has outperformed other classifier in terms of accuracy while GBC got the highest AUROC and highest accuracy in terms of classifying. Although CNN and GBC is prone to overfitting, the dataset that will be used contains 1 hour of video.
The use of data is vital to the advantage and success of every company and businesses. However, querying, extracting and analyzing data from cross-platform relational databases in just a single run without the need for programming is very challenging. While this process can be done manually, it requires more resources, very time consuming and prone to error. This also involves the use of different tools and lots of consolidation. Data are coming from multi-platform applications, from open source or proprietary software, from different operating systems, from different relational database management systems, from different servers and from any location within the organization. Data are complex that requires it to be processed, stored and managed in several ways. There is a need for a solution that can analyze through the company's different data sources and allowing users to perform queries. Analyzing data is best performed with the use of tools that facilitates efficient data exploration and better querying. This study finds its way to the development of a single tool that is capable of exploring, analyzing and transforming data into information regardless of its environment and configuration and makes it available to everyone. This all-in-one database querying tool provides the facility for accessing data from any cross-platform database across the network via SQL Server linked server connection through distributed and heterogeneous query execution. This tool commonly known as X-PRESS is built on top of a framework based on dynamic creation of entities and attributes and utilizes the power of Structured Query Language. Connection to data is more secure and took advantage of server processing power thus making this tool faster and more suitable for data analysis. X-PRESS is the acronym of Cross-Platform Relational Data Extraction Utilizing SQL Server. "X-P" derives from Cross-Platform while "RESS" is the initial of the keywords Relational
Conference Paper
Organizations that want to have a highly efficient and productive workforce should develop a skill mapping approach. However, research shows that most organizations fail to detect and effectively use their employees' competencies, keeping them from working at their best. If these organizations understand that employees are their most important assets, then one of their functions is to help them navigate their careers. Competence mapping in this context is a valuable tool. It is an extension to information management and other organizational initiatives-the process structures to regularly quantify and evaluate individual and group success related to its and its clients' expectations-as key attributes (knowledge, competencies, and attitudes) needed to work effectively in the job classification or process identified. Competency Mapping inter-weaves two data sets build on the techniques and organizational workflow, starts with the consistent articulation of workflow and procedures, including all specifications for quality and quantity, inputs and outputs, decision criteria, and most significantly, internal and external client needs. It defines specific performance criteria for each phase in each process, along with all related metrics and expectations. The other collection of data is on the ability to work individually by gathering the use of a range of assessment instruments and procedures to determine the degree to which individuals can reliably demonstrate the competencies needed to meet standards over time. The output from the competency maps matches the individual performance capacities. The generated aggregated trend line determines where and with what particular population unique growth opportunities occur in the process.
Using 15 years of student enrollment histories from administrative data spanning the 2004–05 through 2018-19 school years at all public colleges, universities, and technical/trade schools in the state of Ohio, we examine rates of re-enrollment in postsecondary education for individuals pursuing additional credentials following the receipt of a sub-baccalaureate certificate. We find that the majority of certificate recipients re-enroll to continue their progression toward stacking credentials. The likelihood of re-enrollment diminishes for certificate earners as they get further out from the term when their initial certificate was completed. Certificate earners re-enroll at an accelerated rate if they acquired their initial certificate at a community college, if they currently have low wages at their jobs, and following increases in local unemployment rates. Our findings lend support to sociological ideas about the role of institutional contexts, opportunity costs, and labor market opportunities in shaping non-traditional postsecondary pathways across the life course.
The enrollment process of an educational institution becomes very complex when the waiting time is prolonged and crowds of people are created. Technological tools are the most effective way to deal with this problem. Therefore, this research seeks to optimize the processes that make up the enrollment management of an educational institution through the implementation of a digital enrollment system. The development of the digital enrollment system was divided into four phases. Planning: Where the system requirements were described. Design: A simple model was chosen according to the established requirements. Coding: Programming languages were used that helped in the structure, customization and operation of the system. Test: The system was verified for errors. In this sense, the digital enrollment process was optimized, such as student enrollment, registration and reports, allowing the reduction of crowds, time and human effort. It allowed collaborative work at a distance between the institution and the students, being a technological and innovative aid that reduces the time of the enrollment processes, in an equitable and transparent way, guaranteeing freedom of choice and equal access opportunities to remote education in times of pandemic.
With the increase of mega-cities, the demand for Smart Cities is overgrowing. The mega-cities can be smarter through the Cloud of Things (CoT). Efficient energy consumption in Smart Cities has a massive impact on the environment. But, computational power is increasing rapidly in the cloud computing environment. Enormous energy consumption (EC) and Service Level Agreement violation (SLAV) becomes a key concern. The Virtual Machine (VM) consolidation approach can significantly reduce EC, SLA violation (SLAV), and increase resource utilization. However, dynamic VM consolidation may produce performance degradation of Physical Machines (PMs) and SLAV. Therefore, it becomes essential to find a trade-off between EC and SLAV. Herein, a novel New Linear Regression(NLR) prediction model, host overload/underload, and VM placement policy have been proposed to reduce EC and SLAV. The NLR model’s primary intention is to take that the model goes through a straight line and a mean point. Future CPU utilization is predicted based on the proposed NLR model. Evaluation of proposed algorithms has been accomplished by extending CloudSim Simulator. The experiment shows that proposed algorithms reduced EC and SLAV in cloud data centers and can be used to construct a smart and sustainable environment for Smart Cities.
The study estimates the causal impact of introducing a public private partnership (PPP) school in close proximity to a public school on public school enrolment in Punjab. Under the preferred difference in difference specification, a small, negative impact on primary enrolment is observed for public schools neighbouring Foundation Assisted Schools (FAS) one year after its introduction. The study finds that enrolment in neighbouring public schools reduces by three per cent less than one year after establishing a FAS school. This specifically affects girls’ enrolment and is concentrated mostly in lower grades. Increased competition for public schools in the form of multiple neighbouring PPP schools leads to a larger negative impact on neighbouring public school enrolment.
For the last decade, many researchers designed different methods for forecasting enrollments, temperature prediction, stock price etc., in time variant and time -invariant first order, higher order, two factor dual variables. In this paper, the author developed and improved fuzzy time series forecasting model to predict the temperature using midpoints of the interval and membership values. In most of the realistic situation, the duplicates of data are significant. The proposed method uses the multiset concept to partition the universe of discourse and gives the comparison with other models.