ArticlePDF Available

Figures

Content may be subject to copyright.
JIMS 8i- International Journal of Information, Communication and Computing Technology (IJICCT)
_________________________________________________________________________________________________________
1 Assistant Professor, Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies, Faridabad
Email: 1 chitra19878@gmail.com
2 Professor, Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies, Faridabad
Email: 1 drrashmiagrawal78@gmail.com
Copyright ©IJICCT, Vol VIII, Issue II(Jul-Dec 2020):ISSN 2347-7202
This journal is cited as : JIMS 8i-Int’l J. of Inf. Comm. & Computing Technology(IJICCT)
Educational Data Mining Approaches, Challenges and Goals: A Review
Chitra Mehra1, Dr. Rashmi Agrawal2
DOI: 10.5958/2347-7202.2020.00008.0
ABSTRACT
There are various methods in educational data mining to
explore data in an educational context as it is an emerging
research area. To study educational questions for the
analysis of educational data, it uses computational
approaches. Since last few years there is an increment in the
acceptance rate of learning management systems in
education. To study the behavior and performance of the
students, various data mining techniques like clustering,
prediction and relationship mining can be used with
educational data. The motive behind this study is to review
unlike techniques and approaches of data mining in an
educational context for the structuring a new environment so
that academic predictions can be made.
KEYWORDS
EDM Knowledge discovery, Educational data mining
(EDM), Data mining (DM), Learning management system,
Educational systems, Clustering, Relationship mining,
Prediction etc.
1. INTRODUCTION
There is a wide scope of research in the field of educational
data mining nowadays. It aims at invent to improve
educational results with the help of algorithms used in data
mining in educational data framework to further explain
educational strategies for further decision making.
Educational data mining (EDM) is a field in which machine-
learning, data-mining (DM) algorithms as well as various
statistical approaches used over dissimilar varieties of
educational data. This type of data can be evaluated using
educational data mining (EDM) so that educational research
issues could be resolved. To discover the distinctive data in
educational domain, lots and lots of developing methods can
be combined with educational data mining (EDM). With the
help of these methods, understanding of students and their
learning environment would be better. It is clearly visible
from large repositories of data which have been created with
the help of instrumental educational software and state
databases of student‟s information about the learning
behavior of students. For better understanding of learners
and learning, educational data mining (EDM) uses these data
sources. Data and theory can be combined into practice
using computational approaches for learner‟s benefits.
Following are some common application areas of
educational data mining (EDM) for researchers all over the
world:
1) Offline education: Its aim is to convey proficiency
and knowledge on the basis of face-to-face contact
and also examine human psychologically for
learning. Data can be assessed by various
psychometrics and statistical techniques like
curriculum as well as student‟s
behavior/performance etc.
2) Learning management system (LMS) and e-
learning - Online instructions are provided by E-
learning whereas learning management system
(LMS) delivers administration, collaboration,
communication and reporting tools. There are
various log files and databas es in which student‟s
data is stored. This data could be analyzed by using
web mining (WM) techniques.
It is the task of educational data mining to convert raw data
from various educational resources into valuable information
by which educational research and practice could be
impacted. Process of educational data mining (EDM) is
same as the process of data mining because the same steps
are to be followed by it i.e.
i) Preprocessing
ii) Data Mining
iii) Post processing
Educational data mining not only uses traditional techniques
of data mining i.e. text mining, pattern mining, classification
etc. but it also consider regression, visualization, correlation
etc. Following are some important issues that distinguish the
data mining and its various application areas, especially to
education:
1) Objective: Each application area o data mining has
different objective. There are two objectives in
educational data mining
i) Applied research objectives It is used to
improve the learning process for better
guiding of students‟ learning.
ii) Pure research objectives - Deeper
understanding of educational phenomenon
could be attain by this kind of objective.
442
JIMS 8i-International Journal of Information, Communication and Computing Technology (IJICCT)
Copyright ©IJICCT, Vol VIII, Issue II(Jul-Dec 2020):ISSN 2347-7202
2) Data: A variety of data is available in an
educational environment for mining. This data is of
educational domain data. That‟s why it has multiple
levels of meaningful hierarchy and intrinsic
semantic information.
3) Techniques: Some special characteristics are
associated with the educational data and problems
associated to this data can be mined in a diverse
way. Most of the conventional data mining
techniques can be used directly in an educational
domain.
2. LITERATURE REVIEW
In [1], the problem of imbalanced classification on the Pima
Indian Diabetes dataset has been appraised using weka. A
variety of filters i.e. SpreadSubsample filter is used for
majority class ‟s under-sampling while Resample filter is
used for random over-sampling of minority class were used
for the balancing of class distribution in the preprocessing
stage of the above-mentioned dataset, under weka. It was
clearly mentioned in the result that by using sampling-based
techniques all accuracy parameters such as Recall, F1-Score,
Precision, and ROC area of minority class were improved. In
[2], importance of feature selection for a classification
problem has been discussed. For this purpose, two filter
selection approaches i.e. correlation feature selection (CFS)
and wrapper-based feature selection has been used. It is
clearly visible from results that highest accuracy can be
achieved by using SMO and J48 along with the correlation
feature selection algorithms whereas highest accuracy can be
measured by using Naïve Bayes along with the wrapper
subset feature selection algorithms to predict various grades
i.e. low, medium and high grade for the students. In [3],
author has discussed five machine learning algorithms i.e.
Support vector Machine, J48, Random Forest, Multilayer
perceptron and Naïve Bayes along with statistical techniques
for the enlightenment of academic performance and learning
habits of students for the prediction of their performance
level. It can be seen from the results that multilayer
perceptron shows best performance as compared to other
classifiers. In [4,5], various researches have been
implemented to show the significance of the “Data Mining”
techniques in education and proves that this is a new concept
to take out valid and precise information regarding the
performance and usefulness of learning process. In [6], Data
mining is also helpful in curriculum analys is, s tudent‟s
performance analysis and subject of the current research
topics. In [7], several investigations have been made under
the object of this projected study. For example, to predict
student performance on the basis of variables, naïve Bayes
algorithm was used. To predefine the students at failure risk,
a model was built. In [8], “K means” algorithm was used to
form a cluster of 8000 students on the basis of five variables
As seen from the results that there was a well-built
association among student performance and his/her
attendance. In [9], it was declared that for the improvement
in an education system in orientation, student performance
and organizations management, “Data Mining” can be us ed.
In [10], clustering algorithm was used to present student
learning model. In [11], author has developed a system by
which facial expressions can be made on the basis of
aggravation or accepting of students in the classroom. To
discover silent behaviors of the students and correlate them
to the gain of knowledge, they also used algorithms. In [12],
author has discussed EDM environment and components
including its tools and techniques used in the field of
education. In this study, author has also put forth challenges
involved in Educational Data Mining which represents
opportunities for future research work to be carried out.
3. DATA MINING AND ITS TECHNIQUES
Data mining is a versatile area with diversity of techniques
or approaches and these are accepted in various domains of
data mining like visualization, fuzzy sets, database systems,
machine learning, statistics, genetic algorithm, neural
networks etc. A detailed description of some of these
techniques is given below:
3.1 STATISTICAL APPROACHES
Occasionally the term „statistical techniques or „statistics
are used as pseudonym for data mining. It is a data driven
approach used for patterns recognition and predictive model
generation. Because of its data driven nature, this approach
is used as one of the major techniques of data mining. In
other words, there is an intrinsic connection of statistics and
data mining [6]. Numerous tools of statistical analysis
including Bayesian network, factor analysis, correlation
analysis, discriminant analysis, regression analysis, cluster
analysis etc. are using extensively in data mining. Training
dataset is used to build most of the statistical models. From
these models, a combination of pattern and rules are drawn.
Mostly data mining tasks can be done with the help of one or
more statistical approach. Commonly used statistical
methods in data mining are given below:
3.1.1 Correlation: It is used to find/check the relationship
between two more variables.
3.1.2. Regression: To estimate the relationship among
output (dependent variable) and one or more predictors/input
(independent variables), regression analysis is used.
3.1.3 Bayesian network: It is also known as bayes/belief or
decision network used to show a set of variables and their
conditional dependencies with the help of direct acyclic
graph.
3.1.4 Factor analysis: A large number of variables can be
trim down into fewer numbers of factors using factor
analysis.
3.1.5 Cluster analysis: Objects can be grouped on the basis
of similarity measures by using cluster analysis to locate the
similar objects within same cluster.
443
Educational Data Mining Approaches, Challenges and Goals: A Review
Copyright ©IJICCT, Vol VIII, Issue II(Jul-Dec 2020):ISSN 2347-7202
3.1.6 Discriminant analysis: Data objects can be divided
into one or more groups on the basis of discriminant
function.
3.2 MACHINE LEARNING
It is an application of artificial intelligence (AI) by which a
system with automatic learning capability and with no
explicit programming can be made. Main focus of machine
learning is the development of computer programs for
accessing of data and its usage for themselves. Because of
utmost importance of machine learning in data mining, the
algorithms of data mining are also applicable in machine
learning. To improve correctness and competence in
knowledge discovery under the process of database, machine
learning has been used to increase the level of automation. It
is a very wide usage of the system shaped by machine
learning in the area of education sector, industry, healthcare
and many more. It has been observed that in some of the
appli[cation areas of data mining, results are much accurate
by using machine learning methods. Machine learning
categories can be broadly classified into 2 types ‟ i.e.
Inductive learning and Deductive learning.
. a) Inductive Learning: Another name of inductive
learning is discovery learning. In this type of
learning, rules can be discovered by observing the
examples rather than the existing knowledge.
b) Deductive learning: This type of learning is based
on the concepts and knowledge that exist over the
time followed by specific examples to get
innovative knowledge from the longstanding data.
Intrusion detection systems (IDS) is one of the most
important applications of machine learning. Generally,
computers are monitored by intrusion detection systems and
in case of any breach of security; they generate alerts for
reporting of any violation. On the basis of evaluation of
these generated alerts, an analyst can initiate an appropriate
action.
3.3 NEURAL NETWORK
To recognize patterns and understand data by clustering or
labeling, neural network is used. This is a set of
algorithms/network/circuit of biological neurons. Its ability
is to discover through examples due to which it is more
flexible and influential. Artificial neural network (ANN)
consists electrical signaling and artificial neurons/nodes like
biological neural networks. A set of layered interrelated
processors or neurons is used for knowledge representation
in Artificial Neural Network (ANN). Various models of
neural network are helpful for providing the solution of
business problems. These models also play a very important
role while using them as a modern operations research tool.
4. EDM AND ITS METHODS
To explore and expand methods for the distinctive type of
data related to educational settings, educational data
mining is used. With the help of these methods there
would better understand of students and their learning
environment. It is different from data mining methods in
such a way that when educational data mining (EDM)
used evidently/explicitly, lacks independent educational
data and exploits multilevel hierarchy.
There are various literature sources like psychometrics,
machine learning, data mining and computational
modeling areas like information visualization and
statistics etc. are available for the efficient working of
educational data mining and its methods. Educational data
mining methods falls under two categories:
1) Statistics and Visualization
2) Web Mining
Following are the methods of educational data mining
(EDM) as proposed by Baker:
1) Clustering
2) Prediction
Regression
Classification
Density Estimation
3) Discovery with models
4) Relationship mining
Correlation mining
Association rule mining
Sequential pattern mining
The majority of the aforementioned items are the data
mining categories. For educational data mining (EDM)
research, relationship mining is the most widely used
approach.
Apart from above-mentioned educational data mining
methodologies, following are some other methodologies
which have not been used widely:
i) Text Mining: To mine amorphous or semi
structured datasets like HTML files, emails and text
documents, text mining is used. It is used for data
analysis along with its evaluation. This method is
very much helpful for automatically construction of
textbooks by the use of web content mining.
Clustering of documents can also be done with the
help of text mining based on similarity of topics.
ii) Social Network Analysis: It tries to recognize and
compute relationships among entities within a
network. With the help of data mining approaches
online interactions can be deliberated in a network.
To mine group activities in educational data mining,
similar kind of data mining approaches can be used.
iii) Outlier detections: Data points which are
considerably different from the remaining data can
444
JIMS 8i-International Journal of Information, Communication and Computing Technology (IJICCT)
Copyright ©IJICCT, Vol VIII, Issue II(Jul-Dec 2020):ISSN 2347-7202
be found with the help of outlier detection. In
educational data mining, students with learning
problems can be detected by the above-mentioned
methodology. In e-learning, to discover
asymmetrical learning processes with the help of
learner‟s reaction time, we us e outlier detection. In
addition to that analytical behavior can also
identified through cluster of students. It is also used
to discover indiscretion and divergence for the
action of learners as well as educators.
5. APPROACHES OF EDM
A variety of tasks or applications of educational
environments can be accomplished by using data mining. It
is a field of computer science that intends to discover
unusual potential factors and patterns for better decision
making. Baker has recommended the following four key
areas and five approaches/methods of applications of
educational data mining:
Key Areas:
i) Students ‟ model improvement
ii) Improvement in domain models
iii) Study of educational support provided by learning
software
iv) Scientific research interested in learning and
learners
Approaches/methods:
i) Distillation of data for human judgment
ii) Clustering
iii) Relationship mining
iv) Discovery with models
v) Prediction
Castro et al. [13] recommended the following EDM
subjects/tasks:
i) Evaluation of the student‟s learning performance
ii) Learning suggestions and course adaptation on the
bas is of student‟s learning behavior
iii) Assessment of educational web-based courses and
its learning material
iv) Teachers and students‟ feedback in e-learning
courses
v) To reveal learning behavior of a typical students
Figure 2 demonstrate that how data mining is applied in an
educational system. It is generally identified as knowledge
discovery in databases that extract or mine knowledge from
massive amount of data. Generally large number of
educational data can be provided by an educational system.
The data could be of teachers‟ data, resource data, students
data and alumni data etc. Various methods are developed by
educational data mining (EDM) to investigate the unique
types of data coming from a number of sources i.e.
traditional source (face-to-face class room environment),
modern source (educational software and online courseware)
etc.
To ascertain unknown patterns and relationships on large
volumes of data that will be helpful for decision-making,
generally we use data mining techniques. To discover
knowledge from database, there are numerous techniques
and algorithms of data mining i.e. regression, clustering,
classification, neural networks, artificial intelligence,
decision tree, genetic algorithm, nearest neighbor method
and association rules etc., are used.
Designing, Using, Interacting
Planning, Participating &
Building & Communicating
Maintenance
Data of students
Usage, interaction,
Course information
and academic data
Academicians Students
and Educators Recommendations
discovered
knowledge
Figure 2 Educational system and data mining cycle
6. EDM’S APPLICATION PROCESS
Application process of data mining in the field of education
is considered as an iterative cycle of testing, refinement and
hypothesis formulation. The main aim of this process is just
not to convert raw data into significant outcome rather this
outcome should be used to know about modification of the
educational environment for learners' learning improvement.
Application of educational data mining (EDM) is alike
pulling out of knowledge from the process of data. The first
step in this process is data collection from an educational
environment. The data which is collected in the first step is
raw and therefore there is a requirement of cleaning of data
and its preprocessing. To do this, some data mining
techniques are required. After applying data mining
techniques, the result which we will get is a model and it can
restructured the stored data. Elucidation and valuation of the
results is the last step. The above mentioned process of
educational data mining can be explained with the help of
figure 3.
Educational Systems
(System of e-learning, web
based educational
sysems/traditional classrooms)
445
Educational Data Mining Approaches, Challenges and Goals: A Review
Copyright ©IJICCT, Vol VIII, Issue II(Jul-Dec 2020):ISSN 2347-7202
Figure 3: Application process of data mining
7. EDUCATIONAL DATA MINING AND ITS
CHALLENGES
i) To get the business knowledge, most of the
data processing techniques are unit engineered.
i) To know about the working of a model on the
basis of results.
ii) To build the community in learning spaces in
the present scenario, more emphasis is given
on scaling up and personalization
iii) In the virtual world, formal education must be
kept as a relevant education because in the
virtual world information is everything
iv) Facilitation and evaluation of learning path
v) To accept fast growing learner population
either online or offline
9. GOALS OF EDUCATIONAL DATA MINING
i) On the basis of extracted patterns/knowledge
future performance of students can be
predicted.
ii) For each and every student, training method
should be personalized so that their strengths
can be improved.
iii) Teaching and learning designs would be better
by using pedagogy, technology and
socialization
iv) To learn the after effects of learning system for
an educational support system.
10. REAL LIFE APPLICATIONS OF DATA
MINING
There is an extensive collection of real-life applications
across diversity of domains in which data mining is used
because of data analytics power. Figure 4 is showing the
various real-life applications of data mining:
Figure 4. Real life Applications of Data Mining
11. CONCLUSION
In every data mining system, there is a group of methods
which can handle a variety of data, their application areas
and list of data mining tasks. A number of challenges of data
mining can be handled by the research done by many
researchers. Objectives of different data mining tasks and
techniques are helpful for many organizations to (i) expand
their understanding about data i.e. knowledge and (ii) By
making modifications in the defined operations and
procedures, their profitability may rise. With the help of data
mining, appropriate decision making for a business is easy
by analyzing the hidden patterns and trends.
12. FUTURE SCOPE
In every data mining system, there is a group of methods
which can handle a variety of data, their application areas
and list of data mining tasks. A number of challenges of data
mining can be handled by the research done by many
researchers. Objectives of different data mining tasks and
Environment Education
Raw Data
Data Preprocessing
Data Modification
EDM Methods
Model
Evaluation/Interpretation
Fraud Detection and
Crime Prevention
Customer Relationship
Management
Education
Healthcare
Financial
Telecommunication
446
JIMS 8i-International Journal of Information, Communication and Computing Technology (IJICCT)
Copyright ©IJICCT, Vol VIII, Issue II(Jul-Dec 2020):ISSN 2347-7202
techniques are helpful for many organizations to (i) expand
their understanding about data i.e. knowledge and (ii) By
making modifications in the defined operations and
procedures, their profitability may rise. With the help of data
mining, appropriate decision making for a business is easy
by analyzing the hidden patterns and trends.
[1] C. Mehra and R.Agrawal, “Comparative s tu dy of
resampling techniques of imbalanced dataset”, International
Journal of Advance Science and Technology, vol. 29, no. 3,
pp. 12699-12710. Aug. 2020.
[2] C. Jalota and R. Agrawal, “Feature Selection
Algorithms and Student Academic Performance: A Study”,
International conference on innovative computing and
communication (ICICC-2020), Delhi University, Delhi, Feb.
2020.
[3] C. Jalota and R. Agrawal, “Analysis of data mining
using classification”, International conference on Machine
Learning, Big Data, Cloud and Parallel Computing
(COMITCON-2019), MRIIRS, Faridabad, Feb. 2019.
[4] B. Kiranmai and A. Damodaram, “A review on
evaluation measures for data mining tas ks”, International
Journal Of Engineering And Computer Science, vol. 3, no.
7, pp. 7217-7220, 2014
[5] A. Sharma, M.K. Sharma and R.K. Dwivedi,
“Literature Review and Challenges of Data Mining
Techniques for Social Network Analys is”, Advances in
Computational Sciences and Technology, vol. 10, no. 5, pp.
1337-1354, 2017.
[6] M. Venkatadri and L. Reddy, “A Review on Data
Mining from pas t to the future”, International journal of
Computer Applications, vol. 15, pp. 19-22, 2011.
[7] M. Chen and J. Han, “Data mining: an overview
from a database perspective”, vol. 8, no. 6, pp. 866-883,
Dec. 1996.
[8] M.K. Gupta and P. Chandra, “An Efficient
Approach for Selection of Initial Cluster Centroids for k-
means”, International Conference on Recent Developments
in Science, Engineering and Technology, pp. 3-13, 2019.
[9] P. Chandra and M.K. Gupta, “Comprehensive
survey on data warehousing research”, International Journal
of Information Technology, vol. 10, pp. 217-224, 2018.
[10] F. Angiulli and F. Fassetti, “Exploiting domain
knowledge to detect outliers”, International Journal of Data
Mining and Knowledge Discovery, vol. 28, pp. 519-568,
2013.
[11] S. Ahuja and S. Kaur, Discriminant analysis-based
cluster ens emble”, International Journal of Data Mining,
Modelling and Management, vol. 7, no. 2, pp. 83-107, 2015.
[12] M. Bouguessa, “Clustering categorical data in
projected spaces”, International Journal of Data Mining and
Knowledge Discovery, vol. 29, pp. 3-38, 2015.
447
Article
Full-text available
In the era of big data, where the amount of information is growing exponentially, the importance of data mining has never been greater. Educational institutions today collect and store vast amounts of data, such as student enrollment and attendance records, and their exam results. With the need to sift through enormous amounts of data and present it in a way that anyone can understand, educational institutions are at the forefront of this trend, and this calls for a more sophisticated set of algorithms. Data mining in education was born as a response to this problem. Traditional data mining methods cannot be directly applied to educational problems because of the special purpose and function they serve. Defining at-risk students, identifying priority learning requirements for varied groups of students, increasing graduation rates, monitoring institutional performance efficiently, managing campus resources, and optimizing curriculum renewal are just a few of the applications of educational data mining. This paper reviews methodologies used as knowledge extractors to tackle specific education challenges from large data sets of higher education institutions to the benefit of all educational stakeholders.
Article
In the past few years, researchers are focused towards educational data mining (EDM) to improve the quality of education. Student’s academic performance prediction is a vital issue for improving the value of education. Research study conducted in the literature review mainly captivated the academic performance prediction at higher education. Though the academic performance at secondary level is infrequent, the same could be a scale for a student's performance at subsequent levels of education. Poor grades at lower levels also impact student’s future performance. In this paper, an effectual model is built with the help of significant factors that affect a student's academic performance at secondary level using single and ensemble techniques of classificification For this, both single and ensemble classification techniques are used in this paper To do the same, three single classifiers (classification algorithm) i.e., MLP, Random Forest and PART along with three well recognized ensemble algorithms Bagging (BAG), LogitBoost (LB) and Voting (VT) are applied on the datasets. For better performance of aforementioned classifiers, blended versions (single + ensemble-based classifiers) of classification models are also built. Assessment metrics i.e., accuracy, precision, recall and F-measure used to evaluate the performance of our proposed model. Evaluation results shows that Logitboost with Random Forest outperformed with 99.8% accuracy. It is clearly visible from results that the proposed model is useful for academic performance prediction to improve learning outcomes in future.
Article
In the past few years, researchers are focused towards educational data mining (EDM) to improve the quality of education. Student’s academic performance prediction is a vital issue for improving the value of education. Research study conducted in the literature review mainly captivated the academic performance prediction at higher education. Though the academic performance at secondary level is infrequent, the same could be a scale for a student's performance at subsequent levels of education. Poor grades at lower levels also impact student’s future performance. In this paper, an effectual model is built with the help of significant factors that affect a student's academic performance at secondary level using single and ensemble techniques of classificification For this, both single and ensemble classification techniques are used in this paper To do the same, three single classifiers (classification algorithm) i.e., MLP, Random Forest and PART along with three well recognized ensemble algorithms Bagging (BAG), LogitBoost (LB) and Voting (VT) are applied on the datasets. For better performance of aforementioned classifiers, blended versions (single + ensemble-based classifiers) of classification models are also built. Assessment metrics i.e., accuracy, precision, recall and F-measure used to evaluate the performance of our proposed model. Evaluation results shows that Logitboost with Random Forest outperformed with 99.8% accuracy. It is clearly visible from results that the proposed model is useful for academic performance prediction to improve learning outcomes in future.
Article
Full-text available
Data and Information or Knowledge has a significant role on human activities. Data mining is the knowledge discovery process by analyzing the large volumes of data from various perspectives and summarizing it into useful information. Due to the importance of extracting knowledge/information from the large data repositories, data mining has become an essential component in various fields of human life. Advancements in Statistics, Machine Learning, Artificial Intelligence, Pattern Recognition and Computation capabilities have evolved the present day?s data mining applications and these applications have enriched the various fields of human life including business, education, medical, scientific etc. Hence, this paper discusses the various improvements in the field of data mining from past to the present and explores the future trends.
Conference Paper
Choice of initial centroids has a major impact on the performance and accuracy of k-means algorithm to group the data objects into various clusters. In basic k-means, pure arbitrary choice of initial centroids lead to construction of different clusters in every run and consequently affects the performance and accuracy of it. To date, several attempts have been made by the researchers to increase the per-formance and accuracy of it. However, scope of improvement still exists in this area. Therefore, a new approach to initialize centroids for k-means is proposed in this paper on the basis of the concept to choose the well separated data-objects as initial cluster centroids instead of pure arbitrary selection. As a consequence, it leads to higher probability of closeness of the chosen centroids to the final cluster centroids. The proposed algorithm is empirically assessed on 6 different well-known datasets. The results confirms that the proposed approach is considerably better than the pure arbitrary selection of centroids.
Article
Data, information and knowledge have important role in various human activities because by processing data, information is extracted and by analyzing data and information the knowledge is extracted. The problem of storing, managing and analyzing the huge volumes of data, which is generated regularly by the various sources has been arisen which leads to the need of large data repositories, e.g. data warehouses. In view of the above, a considerable amount attention of research and industry has been attracted by the data warehousing (DW). Various issues and challenges in the field of data warehousing are presented in many studies during the recent years. In this paper, a comprehensive survey is presented to take a holistic view of the research trends in the fields of data warehousing. This paper presents a systematic division of work of researchers in the fields of data warehousing. Finally, current research issues and challenges in the area of data warehousing are summarized for future directions.
Article
The problem of instability and non-robustness in K-means clustering has been recognised as a serious problem in both scientific and business applications. Further, these problems get accentuated in the presence of outliers in data. Cluster ensemble technique has been recently developed to combat such problems and improve overall quality of clustering scheme. In this paper, we propose a cluster ensemble method based on discriminant analysis to obtain robust clustering and report noise to the user. Clustering schemes are generated by the partitional clustering algorithm (K-means) for constructing the ensemble. The proposed algorithm operates in three phases. During the first phase, input clustering schemes are reconciled by relabeling the clusters corresponding to an arbitrary reference scheme. This is accomplished using Hungarian algorithm, which is a well-known optimisation approach. The second phase uses discriminant analysis and constructs a label matrix that is used for generating consensus partition. In the final stage, clustering scheme is refined to deliver robust and stable clustering scheme. Empirical evaluation of the algorithm shows that the proposed method significantly improves the quality of resultant ensemble. Further, comparison with the cluster ensembles generated by package R for 20 public datasets demonstrated improved quality of ensembles generated by the proposed algorithm.
Article
The problem of clustering categorical data has been widely investigated and appropriate approaches have been proposed. However, the majority of the existing methods suffer from one or more of the following limitations: (1) difficulty detecting clusters of very low dimensionality embedded in high-dimensional spaces, (2) lack of an automatic mechanism for identifying relevant dimensions for each cluster, (3) lack of an outlier detection mechanism and (4) dependence on a set of parameters that need to be properly tuned. Most of the existing approaches are inadequate for dealing with these four issues in a unified framework. This motivates our effort to propose a fully automatic projected clustering algorithm for high-dimensional categorical data which is capable of facing the four aforementioned issues in a single framework. Our algorithm comprises two phases: (1) outlier handling and (2) clustering in projected spaces. The first phase of the algorithm is based on a probabilistic approach that exploits the beta mixture model to identify and eliminate outlier objects from a data set in a systematic way. In the second phase, the clustering process is based on a novel quality function that allows the identification of projected clusters of low dimensionality embedded in a high-dimensional space without any parameter setting by the user. The suitability of our proposal is demonstrated through empirical studies using synthetic and real data sets.
Article
We present a novel definition of outlier whose aim is to embed an available domain knowledge in the process of discovering outliers. Specifically, given a background knowledge, encoded by means of a set of first-order rules, and a set of positive and negative examples, our approach aims at singling out the examples showing abnormal behavior. The technique here proposed is unsupervised, since there are no examples of normal or abnormal behavior, even if it has connections with supervised learning, since it is based on induction from examples. We provide a notion of compliance of a set of facts with respect to a background knowledge and a set of examples, which is exploited to detect the examples that prevent to improve generalization of the induced hypothesis. By testing compliance with respect to both the direct and the dual concept, we are able to distinguish among three kinds of abnormalities, that are irregular, anomalous, and outlier observations. This allows us to provide a finer characterization of the anomaly at hand and to single out subtle forms of anomalies. Moreover, we are also able to provide explanations for the abnormality of an observation which make intelligible the motivation underlying its exceptionality. We present both exact and approximate algorithms for mining abnormalities. The approximate algorithms improve execution time while guaranteeing good accuracy. Moreover, we discuss peculiarities of the novel approach, present examples of knowledge mined, analyze the scalability of the algorithms, and provide comparison with noise handling mechanisms and some alternative approaches.
Article
Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many di#erent #elds have shown great interest in data mining. Several emerging applications in information providing services, such as data warehousing and on-line services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classi#cation of the available data mining techniques is provided and a comparative study of such techniques is presented. Index Terms --- Data mining, knowledge discovery, association rules, classi#cation, da...
Feature Selection Algorithms and Student Academic Performance: A Study
  • C Jalota
C. Jalota and R. Agrawal, "Feature Selection Algorithms and Student Academic Performance: A Study", International conference on innovative computing and communication (ICICC-2020), Delhi University, Delhi, Feb. 2020.
Analysis of data mining using classification
  • C Jalota
C. Jalota and R. Agrawal, "Analysis of data mining using classification", International conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCON-2019), MRIIRS, Faridabad, Feb. 2019.