Conference PaperPDF Available

KDD, semma and CRISP-DM: A parallel overview

Authors:

Abstract and Figures

In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done that seek the establishment of standards in the area. Included on these efforts there can be enumerated SEMMA and CRISP-DM. Both grow as industrial standards and define a set of sequential steps that pretends to guide the implementation of data mining applications. The question of the existence of substantial differences between them and the traditional KDD process arose. In this paper, is pretended to establish a parallel between these and the KDD process as well as an understanding of the similarities between them.
Content may be subject to copyright.
KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW
Ana Azevedo
CEISE – ISCAP – IPP
Rua Jaime Lopes de Amorim, s/n – 4465 S. M. de Infesta - Portugal
Manuel Filipe Santos
DSI - UM
Campus de Azurém – 4800-058 Guimarães
ABSTRACT
In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done
that seek the establishment of standards in the area. Included on these efforts there can be enumerated SEMMA and
CRISP-DM. Both grow as industrial standards and define a set of sequential steps that pretends to guide the
implementation of data mining applications. The question of the existence of substantial differences between them and
the traditional KDD process arose. In this paper, is pretended to establish a parallel between these and the KDD process
as well as an understanding of the similarities between them.
KEYWORDS
Data Mining Standards, Knowledge Discovery in Databases, Data Mining.
1. INTRODUCTION
Fayyad considers Data Mining (DM) as one of the phases of the KDD process (Fayyad et al., 1996). The DM
phase concerns, mainly, to the means by which the patterns are extracted and enumerated from data. The
literature is a source of some confusion because de two terms are indistinctively used, making it difficult to
determine exactly each of the concepts (Benoît, 2002). The growth of the attention paid to the area emerged
from the rising of big databases in an increasing and differentiate number of organizations. There is the risk
of wasting all the value and wealthy of information contained on these databases, unless there are used the
adequate techniques to extract useful knowledge (Chen et al, 1996) (Simoudis, 1996) (Fayyad, 1996). Some
efforts are being done that seek the establishment of standards in the area, both by academics and by people
in the industry field. The academics efforts are centered in the attempt to formulate a general framework for
DM (Dzeroski, 2006). The bulk of these efforts are centered in the definition of a language for DM that can
be accepted as a standard, in the same way that SQL was accepted as a standard for relational databases (Han
et al, 1996) (Meo et al, 1998) (Imielinski et al, 1999) (Sarawagi, 2000) (Botta et al, 2004). The efforts in the
industrial field concern mainly the definition of processes/methodologies that can guide the implementation
of DM applications. In this paper, SEMMA and CRISP-DM have been chosen, because they are considered
to be the most popular. Although it is not scientific this perception exists, because SEMMA and CRISP-DM
are presented in many of the publications of the area and are really used in practice. During the analysis of
the documentation on SEMMA and on CRISP-DM, the question of the existence of substantial differences
between them and the traditional KDD process arose. In this paper, it is intended to establish a parallel
between these and the KDD process as well as an understanding of the similarities between them. The paper
begins, on section 2, by presenting KDD, SEMMA and CRISP-DM. Next, on section 3, a comparative study
is done, presenting the analogies and the differences between the three processes. Finally, on section 4,
conclusions and future work are presented.
ISBN: 978-972-8924-63-8 © 2008 IADIS
182
2. KDD, SEMMA AND CRISP-DM DESCRIPTION
The term knowledge discovery in databases or KDD, for short, was coined in 1989 to refer to the broad
process of finding knowledge in data, and to emphasize the “high-level” application of particular DM
methods (Fayyad et al, 1996). In this paper there is a concern with the overall KDD process. SEMMA was
developed by the SAS Institute. CRISP-DM was developed by the means of the efforts of a consortium
initially composed with Daimler Chrysler, SPSS and NCR. Despite SEMMA and CRISP-DM are usually
referred as methodologies, in this paper they are referred as processes, in the sense that they consist of a
particular course of action intended to achieve a result.
2.1 The KDD Process
KDD process, as presented in (Fayyad et al, 1996), is the process of using DM methods to extract what is
deemed knowledge according to the specification of measures and thresholds, using a database along with
any required preprocessing, sub sampling, and transformation of the database. There are considered five
stages, presented in figure 1: Selection - this stage consists on creating a target data set, or focusing on a
subset of variables or data samples, on which discovery is to be performed; Pre-processing - this stage
consists on the target data cleaning and pre processing in order to obtain consistent data; Transformation -
this stage consists on the transformation of the data using dimensionality reduction or transformation
methods; Data Mining - this stage consists on the searching for patterns of interest in a particular
representational form, depending on the DM objective (usually, prediction); Interpretation/Evaluation -
this stage consists on the interpretation and evaluation of the mined patterns.
Figure 1. The five stages of KDD
The KDD process is interactive and iterative, involving numerous steps with many decisions being made
by the user (Brachman, Anand, 1996). The KDD process is preceded by the development of an understanding
of the application domain, the relevant prior knowledge and the goals of the end-user. It must be continued
by the knowledge consolidation, incorporating this knowledge into the system (Fayyad et al, 1996).
2.2 The SEMMA Process
The acronym SEMMA stands for Sample, Explore, Modify, Model, Assess, and refers to the process of
conducting a DM project. The SAS Institute considers a cycle with 5 stages for the process: Sample - this
stage consists on sampling the data by extracting a portion of a large data set big enough to contain the
significant information, yet small enough to manipulate quickly; Explore - this stage consists on the
exploration of the data by searching for unanticipated trends and anomalies in order to gain understanding
and ideas; Modify - this stage consists on the modification of the data by creating, selecting, and
transforming the variables to focus the model selection process; Model - this stage consists on modeling the
data by allowing the software to search automatically for a combination of data that reliably predicts a
desired outcome; Assess - this stage consists on assessing the data by evaluating the usefulness and reliability
of the findings from the DM process and estimate how well it performs. The SEMMA process offers an easy
to understand process, allowing an organized and adequate development and maintenance of DM projects. It
thus confers a structure for his conception, creation and evolution, helping to present solutions to business
problems as well as to find de DM business goals. (Santos & Azevedo, 2005)
IADIS European Conference Data Mining 2008
183
2.3 The CRISP-DM Process
CRISP-DM stands for CRoss-Industry Standard Process for Data Mining. It consists on a cycle that
comprises six stages (figure 2): Business understanding-this initial phase focuses on understanding the
project objectives and requirements from a business perspective, then converting this knowledge into a DM
problem definition and a preliminary plan designed to achieve the objectives; Data understanding-the data
understanding phase starts with an initial data collection and proceeds with activities in order to get familiar
with the data, to identify data quality problems, to discover first insights into the data or to detect interesting
subsets to form hypotheses for hidden information; Data preparation-the data preparation phase covers all
activities to construct the final dataset from the initial raw data; Modeling-in this phase, various modeling
techniques are selected and applied and their parameters are calibrated to optimal values; Evaluation-at this
stage the model (or models) obtained are more thoroughly evaluated and the steps executed to construct the
model are reviewed to be certain it properly achieves the business objectives; Deployment-creation of the
model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the
data, the knowledge gained will need to be organized and presented in a way that the customer can use it.
(Chapman et al, 2000)
Figure 2. The CRISP-DM life cycle
CRISP-DM is extremely complete and documented. All his stages are duly organized, structured and
defined, allowing that a project could be easily understood or revised (Santos & Azevedo, 2005).
3. A COMPARATIVE STUDY
By doing a comparison of the KDD and SEMMA stages we would, on a first approach, affirm that they are
equivalent: Sample can be identified with Selection; Explore can be identified with Pre processing; Modify
can be identified with Transformation; Model can be identified with DM; Assess can be identified with
Interpretation/Evaluation. Examining it thoroughly, we may affirm that the five stages of the SEMMA
process can be seen as a practical implementation of the five stages of the KDD process, since it is directly
linked to the SAS Enterprise Miner software. Comparing the KDD stages with the CRISP-DM stages is not
as straightforward as in the SEMMA situation. Nevertheless, we can first of all observe that the CRISP-DM
methodology incorporates the steps that, as referred above, must precede and follow the KDD process that is
to say: The Business Understanding phase can be identified with the development of an understanding of the
application domain, the relevant prior knowledge and the goals of the end-user; The Deployment phase can
be indentified with the consolidation by incorporating this knowledge into the system. Concerning the
remaining stages, we can say that: The Data Understanding phase can be identified as the combination of
Selection and Pre processing; The Data Preparation phase can be identified with Transformation; The
Modeling phase can be identified with DM; The Evaluation phase can be identified with
Interpretation/Evaluation. In table 1, we present a summary of the correspondences.
ISBN: 978-972-8924-63-8 © 2008 IADIS
184
Table 1. Summary of the correspondences between KDD, SEMMA and CRISP-DM
KDD SEMMA CRISP-DM
Pre KDD ------------- Business understanding
Selection Sample
Pre processing Explore Data Understanding
Transformation Modify Data preparation
Data mining Model Modeling
Interpretation/Evaluation Assessment Evaluation
Post KDD ------------- Deployment
4. CONCLUSIONS AND FUTURE WORK
Considering the presented analysis we conclude that SEMMA and CRISP-DM can be viewed as an
implementation of the KDD process described by (Fayyad et al, 1996). At first sight, we can get to the
conclusion that CRISP-DM is more complete than SEMMA. However, analyzing it deeper, we can integrate
the development of an understanding of the application domain, the relevant prior knowledge and the goals
of the end-user, on the Sample stage of SEMMA, because the data can not be sampled unless there exists a
truly understanding of all the presented aspects. With respect to the consolidation by incorporating this
knowledge into the system, we can assume that it is present, because it is truly the reason for doing it. This
leads to the fact that standards have been achieved, concerning the overall process: SEMMA and CRISP-DM
do guide people to know how DM can be applied in practice in real systems. In the future we pretend to
analyze other aspects related to DM standards, namely SQL-based languages for DM, as well as XML-based
languages for DM. As a complement, we pretend to investigate the existence of other standards for DM.
REFERENCES
Fayyad, U. M. et al. 1996. From data mining to knowledge discovery: an overview. In Fayyad, U. M.et al (Eds.),
Advances in knowledge discovery and data mining. AAAI Press / The MIT Press.
Benoît, G., 2002. Data Mining. Annual Review of Information Science and Technology, Vol. 36, No. 1, pp 265-310.
Brachman, R. J. & Anand, T., 1996. The process of knowledge discovery in databases. In Fayyad, U. M. et al. (Eds.),
Advances in knowledge discovery and data mining. AAAI Press / The MIT Press.
Chen, M. et al, 1996. Data Mining: An Overview from a Database Perspective. IEEE Transactions on Knowledge and
Data Engineering, Vol. 8, No. 6, pp 866-883.
Simoudis, E., 1996. Reality check for data mining. IEEE Expert, Vol. 11, No. 5, pp 26-33.
Fayyad, U. M., 1996. Data mining and knowledge discovery: making sense out of data. IEEE Expert, Vol. 11 No. 5, pp
20-25.
Dzeroski, S., 2006. Towards a General Framework for Data Mining.. In Dzeroski, S and Struyf, J (Eds.), Knowledge
Discovery in Inductive Databases. LNCS 47474. Springer-Verlag.
Han, J. et al, 1996. DMQL: A Data Mining Query Language for Relational Databases. In proceedings of DMKD-96
(SIGMOD-96 Workshop on KDD). Montreal. Canada.
Meo, R. e tal, 1998. An Extension to SQL for Mining Association Rules. Data Mining and Knowledge Discovery Vol. 2,
pp 195-224. Kluwer Academic Publishers.
Imielinski, T.; Virmani, A., 1999. MSQL: A Query Language for Database Mining. Data Mining and Knowledge
Discovery Vol. 3, pp 373-408. Kluwer Academic Publishers.
Sarawagi, S. et al, 2000. Integrating Association Rule Mining with Relational Database Systems: Alternatives and
Implications. Data Mining and Knowledge Discovery, Vol. 4, pp 89–125.
Botta, Marco, et al, 2004. Query Languages Supporting Descriptive Rule Mining: A Comparative Study. Database
Support for Data Mining Applications. LNAI 2682, pp 24-51.
SAS Enterprise Miner – SEMMA. SAS Institute.
Accessed from http://www.sas.com/technologies/analytics/datamining/miner/semma.html, on May 2008
Santos, M & Azevedo, C (2005). Data Mining – Descoberta de Conhecimento em Bases de Dados. FCA Publisher.
Chapman, P. et al, 2000. CRISP-DM 1.0 - Step-by-step data mining guide.
Accessed from http://www.crisp-dm.org/CRISPWP-0800.pdf on May 2008
IADIS European Conference Data Mining 2008
185
... Figure 1 shows that these phases can be sequential or cyclic, allowing for stopping and resuming between phases [Wirth and Hipp, 2000]. [Chapman et al., 2000] Chapman et al. [2000], Azevedo and Santos [2008], and Wirth and Hipp [2000] outline a summary of each CRISP-DM phase: ...
... • Interpretation/Evaluation: assesses the identified patterns for their relevance and value, leading to the generation of knowledge. CRISP-DM and KDD by Fayyad et al. [1996] show similarities, but KDD does not clearly define the phases of Business Understanding and Deployment [Azevedo and Santos, 2008]. While KDD combines evaluation and deployment within the Interpretation/Evaluation phase, CRISP-DM addresses these activities separately [Dåderman and Rosander, 2018]. ...
... CRISP-DM covers the entire data science project lifecycle, from problem understanding to deployment. In contrast, SEMMA focuses primarily on data management and modeling, with less emphasis on business problem comprehension [Palacios et al., 2017;Azevedo and Santos, 2008]. Additionally, CRISP-DM is an open and non-proprietary framework, whereas SEMMA is developed by SAS and often associated with its tools [Palacios et al., 2017]. ...
Article
Full-text available
The expansion of Data Science projects in organizations has been led by three factors: the growth in the amount of data generated, the evolution in storage capacity, and the increase in computational capabilities. However, most of these projects fail to deliver the expected value: 82% of the teams do not use any process model. Despite the popularity of Agile Methods, their adoption in Data Science projects is still scarce. Most of the existing research focuses on algorithms. There is a lack of studies on agility in Data Science. This Systematic Literature Review (SLR) was performed to identify and evaluate 16 studies that can answer how to adapt and apply CRISP-DM using different approaches-methods, frameworks, or process models. In addition, it shows how CRISP-DM has evolved over the last few decades, with derivations emerging from rigid processes to agile methods. This research then analyzes the 16 tailored models and examines the similarities and differences between CRISP-DM derivatives. As a result, it summarizes the CRISP-DM adaptation patters identified, such as phase addition, phase modification, features and tools addition, and integration with other approaches. Consequently, this SLR showcases how CRISP-DM is a robust, flexible, and highly adaptable model that can be extended to different business domains. Finally, it proposes a theoretical guide to modify and customize CRISP-DM for Data Science projects.
... Although these models have made valuable contributions, they do not fully address several critical aspects of data mining in semiconductor manufacturing. There is a need for more precise instructions on how to handle the complex, high-dimensional data typical in semiconductor processes (Espadinha-Cruz et al. 2021;Azevedo and Santos 2008;Salazar-Salazar et al. 2023). The need for real-time analytics, new role definitions, decision-making, crucial in modern manufacturing environments, is not explicitly addressed (Moyne and Iskandar 2017). ...
... This issue is crucial to address due to the following reasons: 1) Sensors are essential for modern manufacturing and Industry 4.0 implementation, providing real-time data on machine performance and enabling complex automation tasks (Lee et al.2020) and 2) Enhancing sensor production and data analysis directly affects the quality and efficiency of semiconductor manufacturing, which is vital for technological advancement across various sectors (Moyne and Iskandar 2017). Traditional data mining and AI-based approaches (Azevedo and Santos 2008;Salazar-Salazar et al. 2023) in effectively addressing the distinct issues posed by the semiconductor manufacturing business (Wuest et al. 2016). CRISP-DM, KDD, and TDSP were created as versatile frameworks for data analysis and AI applications. ...
Conference Paper
Full-text available
The semiconductor manufacturing industry is undergoing a data-driven revolution, driven by advancements in electronic devices and smart technologies. This shift has significantly increased the volume, velocity, and variety of data, enabling enhanced knowledge extraction and process optimization. However, traditional solutions, such as, the Cross-Industry Standard Process for Data Mining, Knowledge Discovery in Databases, and the Team Data Science Process are insufficient for addressing real-time analytics, high-dimensional data, and domain-specific challenges. To bridge these gaps, we introduce a novel framework that combines Explainable Artificial Intelligence with the Design Science Research methodology. Key contributions of this framework include real-time processing capabilities, integration of domain knowledge, and enhanced transparency of Artificial Intelligence (AI) models, ensuring accurate and interpretable decision-making. Demonstrated through wafer map clustering, this framework provides a comprehensive, industry-specific systematic guidance for implementing data mining and AI projects, providing efficient, and easy-to-understand solutions that can improve semiconductor manufacturing.
... It is an easy-to-understand process [125] Lack of understanding of the business area [12,17,101]. Lack of guidance on specific activities at each stage [26]. ...
Article
Full-text available
The exponential and dynamic growth of data underscores the need to efficiently execute Machine Learning (ML) projects to maximize their utility. However, ML methodologies have not kept pace with the rapid advances in data collection technology and artificial intelligence (AI). Notably, many methodologies developed over 30 years ago are still in use today. Given the advances in various fields of AI, there is an opportunity to analyze existing methodologies to enhance the effective application of ML algorithms. This study aims to provide an overview of ML methodologies introduced since 1989, highlighting their evolution and challenges, as well as the gaps that have been overlooked. We categorize these methodologies into three types: research-oriented, industry-oriented, and agile environments-oriented. Furthermore, we emphasize the importance of integrating Ethical, Legal, and Social Aspects (ELSA) into ML methodologies to ensure responsible and transparent AI development. We believe that this categorization, along with the incorporation of ELSA considerations, will assist practitioners in selecting the most suitable methodology for their projects. Additionally, we trust that the findings presented here will serve as a reflective foundation for the development and innovation of new ML methodologies, ensuring they meet the demands of modern data environments and contribute to advances in industrial and scientific development.
... Cross Industry Standard Process for Data Mining (CRISP-DM) adalah metode penelitian yang digunakan dalam penelitian ini karena dikenal sebagai pendekatan netral terhadap teknologi dan tidak terbatas pada industri tertentu, sehingga menjadi standar de facto dalam proses data mining [12]. Gbr. 1 adalah alur CRISP-DM yang digunakan dalam penelitian ini: ...
Article
Stunting adalah kondisi gagal tumbuh yang sering dialami oleh balita di Indonesia dan dapat berdampak negatif jangka panjang terhadap perkembangan fisik serta kognitif anak. Penelitian ini bertujuan untuk mengembangkan sistem deteksi stunting berbasis web menggunakan Random Forest, yang dikenal efektif dalam menangani data yang tidak seimbang. Dataset yang digunakan dalam penelitian ini berasal dari situs web Kaggle, yang terdiri dari 10.000 data balita di Indonesia. Metodologi yang digunakan adalah CRISP-DM, dengan tahapan mulai dari pemahaman bisnis hingga implementasi aplikasi web menggunakan Flask dan Railway untuk deployment. Evaluasi model dilakukan menggunakan confusion matrix dan hasilnya menunjukkan bahwa model dengan tuning hyperparameter pada pembagian data 90:10 mencapai akurasi sebesar 85%. Model ini kemudian diintegrasikan ke dalam aplikasi web. Sistem yang dikembangkan menyediakan dua opsi deteksi, yaitu deteksi individu untuk mendeteksi status stunting pada satu anak, dan deteksi kelompok yang memungkinkan pengguna untuk mengunggah data beberapa anak sekaligus dalam bentuk file CSV. Aplikasi ini di-deploy menggunakan platform Railway yang memudahkan pengelolaan dan pemeliharaan aplikasi, serta memberikan kemampuan untuk melakukan update secara otomatis melalui GitHub. Diharapkan aplikasi ini dapat memberikan kontribusi dalam upaya deteksi aplikasi ini dapat memberikann kontribusi dalam upaya deteksi dini stunting secara efektif dan efisien, terutama di daerah dengan akses layanan kesehatan yang terbatas. Kata Kunci— Stunting, Random Forest, Deteksi dini, Aplikasi web, dan Balita.
Article
The analysis of complex mechanisms within population data, and within sub-populations, can be empowered by combining datasets, for example to gain more understanding of change processes of health-related behaviours. Because of the complexity of this kind of research, it is valuable to provide more specific guidelines for such analyses than given in standard data science methodologies. Thereto, we propose a generic procedure for applied data science research in which the data from multiple studies are included. Furthermore, we describe its steps and associated considerations in detail to guide other researchers. Moreover, we illustrate the application of the described steps in our proposed procedure (presented in the graphical abstract) by means of a case study, i.e., a physical activity (PA) intervention study, in which we provided new insights into PA change processes by analyzing an integrated dataset using Bayesian networks. The strengths of our proposed methodology are subsequently illustrated, by comparing this data science trajectories protocol to the classic CRISP-DM procedure. Finally, some possibilities to extend the methodology are discussed.–A detailed process description for multidisciplinary data science research on multiple studies. –Examples from a case study illustrate methodological key points.
Conference Paper
Full-text available
In this paper, we address the ambitious task of formulating a general framework for data mining. We discuss the requirements that such a framework should fulfill: It should elegantly handle different types of data, different data mining tasks, and different types of patterns/models. We also discuss data mining languages and what they should support: this includes the design and implementation of data mining algorithms, as well as their composition into nontrivial multi-step knowledge discovery scenarios relevant for practical application. We proceed by laying out some basic concepts, starting with (structured) data and generalizations (e.g., patterns and models) and continuing with data mining tasks and basic components of data mining algorithms (i.e., refinement operators, distances, features and kernels). We next discuss how to use these concepts to formulate constraint-based data mining tasks and design generic data mining algorithms. We finally discuss how these components would fit in the overall framework and in particular into a language for data mining and knowledge discovery.
Article
Full-text available
Data mining evolved as a collection of applicative problems and efficient solution algorithms relative to rather peculiar problems, all focused on the discovery of relevant information hidden in databases of huge dimensions. In particular, one of the most investigated topics is the discovery of association rules. This work proposes a unifying model that enables a uniform description of the problem of discovering association rules. The model provides a SQL-like operator, named X⇒Y, which is capable of expressing all the problems presented so far in the literature concerning the mining of association rules. We demonstrate the expressive power of the new operator by means of several examples, some of which are classical, while some others are fully original and correspond to novel and unusual applications. We also present the operational semantics of the operator by means of an extended relational algebra.
Article
Full-text available
Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information-providing services, such as data warehousing and online services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided and to increase business opportunities. In response to such a demand, this article provides a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented
Article
The tremendous number of rules generated in the mining process makes it necessary for any good data mining system to provide for powerful query primitives to post-process the generated rulebase, as well as for performing selective, query based generation. In this paper, we present the design and compilation of MSQL, the rule query language developed as part of the Discovery Board system.
Conference Paper
Recently, inductive databases (IDBs) have been proposed to tackle the problem of knowledge discovery from huge databases. With an IDB, the user/analyst performs a set of very different operations on data using a query language, powerful enough to support all the required manipulations, such as data preprocessing, pattern discovery and pattern post-processing. We provide a comparison between three query languages (MSQL, DMQL and MINE RULE) that have been proposed for descriptive rule mining and discuss their common features and differences. These query languages look like extensions of SQL. We present them using a set of examples, taken from the real practice of rule mining. In the paper we discuss also OLE DB for Data Mining and Predictive Model Markup Language, two recent proposals that like the first three query languages respectively provide native support to data mining primitives and provide a description in a standard language of statistical and data mining models.
Article
Article
Data-mining tools let business managers make profitable use of the massive quantities of information their companies collect. This article looks at two particularly useful data-mining applications
Article
Current computing and storage technology is rapidly outstripping society's ability to make meaningful use of the torrent of available data. Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.