ArticlePDF Available

Abstract and Figures

Data mining is the process of extracting out valid and unknown information from large databases and use it to make difficult decisions in business (Gregory, 2000).Data mining or data analysis with complex and large datasets brings the wealth of research and knowledge in machine learning and statistics for the task of discovering new sets of knowledge in large databases. Over the past three decades, large amounts of difficult data's of business are stored electronically and this volume will continue to increase in future. In order to manage huge volumes of data, the techniques of data mining are also becoming sophisticated and advanced, day by day.
Content may be subject to copyright.
Furkan, Salman Page 66
An Introduction to Data Mining Technique
Mohd. Furkan
, Agha Salman Haider
¹Sr Lecturer, Jazan University, Saudi Arabia
Sr Lecturer, Jazan University, Saudi Arabia
Data mining is the process of extracting out valid and unknown information from large
databases and use it to make difficult decisions in business (Gregory, 2000).Data mining or
data analysis with complex and large datasets brings the wealth of research and knowledge in
machine learning and statistics for the task of discovering new sets of knowledge in large
databases. Over the past three decades, large amounts of difficult data’s of business are stored
electronically and this volume will continue to increase in future. In order to manage huge
volumes of data, the techniques of data mining are also becoming sophisticated and
advanced, day by day
Keywords—Data Mining ,DDM, Data Miners,
Smyth, Mannila and Hand (2001) have defined that “progress in digital acquisition storage
technology has resulted in the development of vast database. This has happened in all areas of
human attempt from the mundane to the exotic. Little wonder then that attention has
development in the possibility of tapping these data, of demanding from them information
that might be of value to the proprietor of the database. The regulation concerned with this
assignment has become recognized as data mining. Defining a scientific discipline is always a
contentious task; researchers often disagree about the exact range and limits of their field of
study. Bearing this in and tolerant that others might oppose about the details, they shall
accept as their functioning definition of data mining”.
Data mining is the analysis of observational data sets to get unsuspected relationships and to
sum up the data in novel ways that are both understandable and useful to the data proprietor.
Furkan, Salman Page 67
The relationships and summaries derived throughout a data mining exercise are habitually
referred to as patterns or models. Illustrations include linear equations, graphs, tree structures,
clusters, rules and recurrent patterns in time series. The description over refers to
observational data as opposed to investigational data. Data mining classically deals with data
that have already been composed for various reasons other than the data mining analysis.
This means that the objectives of the data mining implement play on role in the data
collection plan. This is single way in which data mining is at variance from much of statistics
in which data are frequently composed by using well-organized strategies to reply particular
questions (Tan, Kumar and Steinbach, 2006).
Data mining techniques:
1. Web data mining:
The last decade has witnessed the web revolution which has ushered a new information
retrieval age. The revolution has had a profound impact on the way they search and find
information at home and at work. Searching the web has become an everyday experience for
millions of people from all over the world. From its beginning in the early 1990s the web had
grow to more than four billion pages in 2004 and perhaps would grow to more than eight
billion pages by the end of 2006.
Figure 2.9: Web data mining
Furkan, Salman Page 68
2. Multi-Relational Data mining:
Chong, Feng and Cao (2010) have described that “Multi-relational classification is an
important data mining task, since much real world data is organized in multiple relations. The
major challenges come from firstly, the large high dimensional search spaces due to many
attributes in multiple relations and secondly, the high computational cost in feature selection
and classifier construction due to high complexity in the structure of multiple relations.
Mining multi-relational data repositories is an essential task in many applications such as
business intelligence. Multi-relational classification is arguably one of the fundamental
problems in multi-relational data mining. Multi-relational classification is challenging. First,
there may be a large number of attributes in a multi-relational database where classification is
conducted. Since, relations are often connected in one way or another; virtually multi-
relational classification has to deal with a very high dimensional search space.
3. Distributed data mining:
Pralhad, Ramachandrarao and Adhikari (2010) have explained that “Distributed Data Mining
(DDM) algorithms deals with mining multiple databases distributed over different
geographical regions. In the last few years, researchers have started addressing problems
where databases stored at different places cannot be moved to a central storage area for
variety of reasons. In multi-database mining there are no such restrictions. Thus distributed
data mining could be considered as a special type of multiple database mining. Distributed
data mining environment often comes with different distributed sources of computation.
4. Graph mining:
Kantardzic (2011) has described that “Graph mining applications are far more challenging to
implement because of the additional constraints that arise from the structural nature of the
underlying graph. The problem of frequent pattern mining has been widely studied in the
context of mining transactional data. Recently, the techniques for frequent pattern mining
have also been extended to the case of graph data”.
Furkan, Salman Page 69
Figure 19: Graph mining
5. Visual data mining:
Kimani, Dix and Catarci (2010) have explained that” visual data mining is the use of
visualization technique to allow data miners and analysts to evaluate, monitor and guide the
inputs, products and process of data mining”. The field of visual data mining primarily
around the exploitation of the human visual system in mining knowledge. In essence, this can
be realized by placing the user at a strategic place in the system framework while the same
time exploitation effective visual strategies. Visual data mining may therefore, be fined as the
exploitation of appropriate visual strategies in order to allow enable or empower the data
mining user to process data and also to drive guide or direct entire process of data mining.
Figure 4.10: visual data mining
Furkan, Salman Page 70
This study explores the evolution of data mining, we have discussed different data mining
techniques used for data mining, we will discussed the best data mining technique in our
next paper.
Smyth, Mannila and Hand (2001), Principles of data mining, MIT Press, p 1.
Tan (2007), Introduction to Data Mining, Pearson Education India, p 1.
Hand (1998). “Data Mining Statistics and More?”, The American Statistician, USA, p
Cerrito (2006), Introduction to data mining using SAS Enterprise Miner, SAS
Publishing, p 1.
Larose (2005), Discovering knowledge in data: an introduction to data mining, John
Wiley and Sons, p 2.
Sullivan (2011), Introduction to Data Mining for the Life Sciences, Springer, p 2.
Pei, Kamber and Han (2011), Data Mining: Concepts and Techniques, Elsevier, p 1.
Sivanandam and Sumathi (2006), Introduction to data mining and its applications,
Springer, p 2.
... ere are various data mining algorithms and techniques available to transform raw data into useful information such as association rules, SVM, neural networks, K-nearest neighbor, and decision trees. We will use SVM and decision tree in this study for the identification and classification of information [16]. ...
Full-text available
Present-day enterprise accounting solutions have been developed to a certain extent to provide authenticity of accounting information and to provide modules for billing, pay role, general ledger, and more, but they come with certain problems such as distortion of accounting information, incomplete selection of indicator variables, and the limited and single use of identification methods. Based on this, this study starts with two points. The first is to give the concepts of decision trees and support vector machine (SVM) in data mining. Then, the accounting distortion information identification model is constructed based on this, and the model effect is verified by setting experiments. The second is to establish a regression model on the relationship between enterprise strategy and accounting information quality to further explore the factors that affect the quality of enterprise accounting information. The following are the research results: (1) The accuracy rates of classification and identification of training set data, overall data, and test set data using the SVM-based identification model are 99.19%, 96.21%, and 94.8%, respectively. (2) The average identification rate of the sample data is 88.5% using the identification model based on the decision tree. (3) The regression coefficients of enterprise strategy and accounting information quality are −0.053 and −0.054, respectively without considering the industry and year variables and with considering the industry and year variables, both of which are negative at the 0.1 significance level. The purpose of this study is to use data mining to achieve high-quality identification of enterprise accounting information and provide some references for enterprises to choose or formulate relevant development strategies.
Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas. It is concerned with the secondary analysis of large databases in order to find previously unsuspected relationships which are of interest or value to the database owners. New problems arise, partly as a consequence of the sheer size of the data sets involved, and partly because of issues of pattern matching. However, since statistics provides the intellectual glue underlying the effort, it is important for statisticians to become involved. There are very real opportunities for statisticians to make significant contributions.
Assigning our input instances to one of some number of distinct classes is one of the fundamental data-mining activities, allowing us a plethora of techniques to address the question “how similar are X and Y?” Once we have a sense of what characterizes such differences (or similarities), it is natural to then ask how we can predict what will come next. There are some specific data preparation challenges that we need to consider and once we have these in mind, we can focus on some of the common methods that are used by researchers. Linear regression is by far the most common method used for prediction, but decision trees and a very simple algorithm, 1R, provide valuable insight into our datasets without a huge amount of work being necessary. As we increase the sophistication of our models, the concept of the nearest neighbor becomes more important, and, following that, we discuss some aspects of Bayesian modeling and neural networks. The AutoClass algorithm has proven itself many times, and this is included as a practical method that is readily available in several implementation technologies, allowing it to be used somewhat out-of-the-box. Feature representation using alphabet sets is an obvious application area and we discuss that next before considering the k-means method and a discussion on some of the different distance measures we can use in our classification and prediction efforts. Finally, we hone in on the question of accuracy: how we measure it, what it means, and how to improve it. We include the widely used receiver operating characteristic (ROC) technique at this point also. Accuracy is directly coupled with our ability to accurately (sic) separate instances in different classes from each other.
The growing interest in data mining is motivated by a common problem across disciplines: how does one store, access, model, and ultimately describe and understand very large data sets? Historically, different aspects of data mining have been addressed independently by different disciplines. This is the first truly interdisciplinary text on data mining, blending the contributions of information science, computer science, and statistics. The book consists of three sections. The first, foundations, provides a tutorial overview of the principles underlying data mining algorithms and their application. The presentation emphasizes intuition rather than rigor. The second section, data mining algorithms, shows how algorithms are constructed to solve specific problems in a principled manner. The algorithms covered include trees and rules for classification and regression, association rules, belief networks, classical statistical models, nonlinear models such as neural networks, and local "memory-based" models. The third section shows how all of the preceding analysis fits together when applied to real-world data mining problems. Topics include the role of metadata, how to handle missing data, and data preprocessing.
Chapter Five begins with a discussion of the differences between supervised and unsupervised methods. In unsupervised methods, no target variable is identified as such. Most data mining methods are supervised methods, however, meaning that (a) there is a particular pre-specified target variable, and (b) the algorithm is given many examples where the value of the target variable is provided, so that the algorithm may learn which values of the target variable are associated with which values of the predictor variables. A general methodology for supervised modeling is provided, for building and evaluating a data mining model. The training data set, test data set, and validation data sets are discussed. The tension between model overfitting and underfitting is illustrated graphically, as is the bias-variance tradeoff. High complexity models are associated with high accuracy and high variability. The mean-square error is introduced, as a combination of bias and variance. The general classification task is recapitulated. The k-nearest neighbor algorithm is introduced, in the context of a patient-drug classification problem. Voting for different values of k are shown to sometimes lead to different results. The distance function, or distance metric, is defined, with Euclidean distance being typically chosen for this algorithm. The combination function is defined, for both simple unweighted voting and weighted voting. Stretching the axes is shown as a method for quantifying the relevance of various attributes. Database considerations, such as balancing, are discussed. Finally, k-nearest neighbor methods for estimation and prediction are examined, along with methods for choosing the best value for k.
to Data Mining Principles.- Data Warehousing, Data Mining, and OLAP.- Data Marts and Data Warehouse.- Evolution and Scaling of Data Mining Algorithms.- Emerging Trends and Applications of Data Mining.- Data Mining Trends and Knowledge Discovery.- Data Mining Tasks, Techniques, and Applications.- Data Mining: an Introduction - Case Study.- Data Mining & KDD.- Statistical Themes and Lessons for Data Mining.- Theoretical Frameworks for Data Mining.- Major and Privacy Issues in Data Mining and Knowledge Discovery.- Active Data Mining.- Decomposition in Data Mining - A Case Study.- Data Mining System Products and Research Prototypes.- Data Mining in Customer Value and Customer Relationship Management.- Data Mining in Business.- Data Mining in Sales Marketing and Finance.- Banking and Commercial Applications.- Data Mining for Insurance.- Data Mining in Biomedicine and Science.- Text and Web Mining.- Data Mining in Information Analysis and Delivery.- Data Mining in Telecommunications and Control.- Data Mining in Security.