Approaches of each phase of CRISP-DM.

Approaches of each phase of CRISP-DM.

Source publication
Article
Full-text available
CRISP-DM is the de-facto standard and an industry-independent process model for applying data mining projects. Twenty years after its release in 2000, we would like to provide a systematic literature review of recent studies published in IEEE, ScienceDirect and ACM about data mining use cases applying CRISP-DM. We give an overview of the research f...

Contexts in source publication

Context 1
... third research question answers how each of the six phases of CRISP-DM is implemented (see Figure 5). Identified methods serves as a class to which the phases can be classified. ...
Context 2
... several technologies can support the process. Figure 5 shows that the used technologies have been mentioned in several studies. For storage, used technologies are MS Access, MySQL or Microsoft SQL Server. ...
Context 3
... is an innovative understanding of the deployment phase due to the user guide defines the deployment phase more technically. Technical solutions have been developed by three studies (see Figure 5). In all other cases, it is not comprehensible why the deployment phase is missing. ...

Similar publications

Article
Full-text available
Background Having an appropriate sample size is important when developing a clinical prediction model. We aimed to review how sample size is considered in studies developing a prediction model for a binary outcome. Methods We searched PubMed for studies published between 01/07/2020 and 30/07/2020 and reviewed the sample size calculations used to d...

Citations

... The research method used in this study is the Cross-Industry Standard Process for Data Mining (CRISP-DM). This method provides a structured approach to planning and executing data mining tasks and is known for its adaptability across multiple sectors and data-intensive applications [29]. The Cross-Industry Standard Process for Data Mining (CRISP-DM) method is applied in the sentiment analysis research of YouTube comments for the 2024 Indonesian presidential election as shown in Figure 3. ...
... The research method used in this study is the Cross-Industry Standard Data Mining (CRISP-DM). This method provides a structured approach to pl executing data mining tasks and is known for its adaptability across multiple data-intensive applications [29]. The Cross-Industry Standard Process for D (CRISP-DM) method is applied in the sentiment analysis research of YouTube for the 2024 Indonesian presidential election as shown in Figure 3. ...
Article
Full-text available
Presidential elections are an important political event that often trigger intense debate. With more than 139 million users, YouTube serves as a significant platform for understanding public opinion through sentiment analysis. This study aimed to implement deep learning techniques for a multi-label sentiment analysis of comments on YouTube videos related to the 2024 Indonesian presidential election. Offering a fresh perspective compared to previous research that primarily employed traditional classification methods, this study classifies comments into eight emotional labels: anger, anticipation, disgust, joy, fear, sadness, surprise, and trust. By focusing on the emotional spectrum, this study provides a more nuanced understanding of public sentiment towards presidential candidates. The CRISP-DM method is applied, encompassing stages of business understanding, data understanding, data preparation, modeling, evaluation, and deployment, ensuring a systematic and comprehensive approach. This study employs a dataset comprising 32,000 comments, obtained via YouTube Data API, from the KPU and Najwa Shihab channels. The analysis is specifically centered on comments related to presidential candidate debates. Three deep learning models—Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (Bi-LSTM), and a hybrid model combining CNN and Bi-LSTM—are assessed using confusion matrix, Area Under the Curve (AUC), and Hamming loss metrics. The evaluation results demonstrate that the Bi-LSTM model achieved the highest accuracy with an AUC value of 0.91 and a Hamming loss of 0.08, indicating an excellent ability to classify sentiment with high precision and a low error rate. This innovative approach to multi-label sentiment analysis in the context of the 2024 Indonesian presidential election expands the insights into public sentiment towards candidates, offering valuable implications for political campaign strategies. Additionally, this research contributes to the fields of natural language processing and data mining by addressing the challenges associated with multi-label sentiment analysis.
... To support this, we adopt the CRISP-DM (CRoss-Industry Standard Process for Data Mining) framework. CRISP-DM is a well-established and widely accepted process model in data mining operations [25]. Additionally, CRISP-DM is well suited to accommodate user-centered explainable AI. ...
Article
Full-text available
Autosomal dominant polycystic kidney disease (ADPKD) is the predominant hereditary factor leading to end-stage renal disease (ESRD) worldwide, affecting individuals across all races with a prevalence of 1 in 400 to 1 in 1000. The disease presents significant challenges in management, particularly with limited options for slowing cyst progression, as well as the use of tolvaptan being restricted to high-risk patients due to potential liver injury. However, determining high-risk status typically requires magnetic resonance imaging (MRI) to calculate total kidney volume (TKV), a time-consuming process demanding specialized expertise. Motivated by these challenges, this study proposes alternative methods for high-risk categorization that do not rely on TKV data. Utilizing historical patient data, we aim to predict rapid kidney enlargement in ADPKD patients to support clinical decision-making. We applied seven machine learning algorithms—Random Forest, Logistic Regression, Support Vector Machine (SVM), Light Gradient Boosting Machine (LightGBM), Gradient Boosting Tree, XGBoost, and Deep Neural Network (DNN)—to data from the Polycystic Kidney Disease Outcomes Consortium (PKDOC) database. The XGBoost model, combined with the Synthetic Minority Oversampling Technique (SMOTE), yielded the best performance. We also leveraged explainable artificial intelligence (XAI) techniques, specifically Local Interpretable Model-Agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP), to visualize and clarify the model’s predictions. Furthermore, we generated text summaries to enhance interpretability. To evaluate the effectiveness of our approach, we proposed new metrics to assess explainability and conducted a survey with 27 doctors to compare models with and without XAI techniques. The results indicated that incorporating XAI and textual summaries significantly improved expert explainability and increased confidence in the model’s ability to support treatment decisions for ADPKD patients.
... Finally, Section 7 presents final remarks, including, suggestions for future research, and practical applications in an enterprise environment. Schröer et al. [2021] conducted a systematic literature review ncluding papers published between 2017 and 2019. This research summarizes domains of application for CRISP-DM, highlighting health, education, and research. ...
Article
Full-text available
The expansion of Data Science projects in organizations has been led by three factors: the growth in the amount of data generated, the evolution in storage capacity, and the increase in computational capabilities. However, most of these projects fail to deliver the expected value: 82% of the teams do not use any process model. Despite the popularity of Agile Methods, their adoption in Data Science projects is still scarce. Most of the existing research focuses on algorithms. There is a lack of studies on agility in Data Science. This Systematic Literature Review (SLR) was performed to identify and evaluate 16 studies that can answer how to adapt and apply CRISP-DM using different approaches-methods, frameworks, or process models. In addition, it shows how CRISP-DM has evolved over the last few decades, with derivations emerging from rigid processes to agile methods. This research then analyzes the 16 tailored models and examines the similarities and differences between CRISP-DM derivatives. As a result, it summarizes the CRISP-DM adaptation patters identified, such as phase addition, phase modification, features and tools addition, and integration with other approaches. Consequently, this SLR showcases how CRISP-DM is a robust, flexible, and highly adaptable model that can be extended to different business domains. Finally, it proposes a theoretical guide to modify and customize CRISP-DM for Data Science projects.
... Finally, Section 7 presents final remarks, including, suggestions for future research, and practical applications in an enterprise environment. Schröer et al. (2021) conducted a systematic literature review ncluding papers published between 2017 and 2019. This research summarizes domains of application for CRISP-DM, highlighting health, education, and research. ...
Preprint
Full-text available
The expansion of Data Science projects in organizations has been led by three factors: the growth in the amount of data generated, the evolution in storage capacity, and the increase in computational capabilities. However, most of these projects fail to deliver the expected value: 82% of the teams do not use any process model. Despite the popularity of Agile Methods, their adoption in Data Science projects is still scarce. Most of the existing research focuses on algorithms. There is a lack of studies on agility in Data Science. This Systematic Literature Review (SLR) was performed to identify and evaluate 16 studies that can answer how to adapt and apply CRISP-DM using different approaches-methods, frameworks, or process models. In addition, it shows how CRISP-DM has evolved over the last few decades, with derivations emerging from rigid processes to agile methods. This research then analyzes the 16 tailored models and examines the similarities and differences between CRISP-DM derivatives. As a result, it summarizes the CRISP-DM adaptation patters identified, such as phase addition, phase modification, features and tools addition, and integration with other approaches. Consequently, this SLR showcases how CRISP-DM is a robust, flexible, and highly adaptable model that can be extended to different business domains. Finally, it proposes a theoretical guide to modify and customize CRISP-DM for Data Science projects.
... In the survey, based on the nine ML life cycle stages presented by Amershi et al. [4] and the CRISP-DM industry-independent process model phases [39], we abstracted seven generic life cycle stages [21] and asked about their perceived relevance and difficulty. The answers presented in Figure 3 Figure 5 together with the 95% confidence interval. ...
Conference Paper
Full-text available
[Context] In Brazil, 41% of companies use machine learning (ML) to some extent. However, several challenges have been reported when engineering ML-enabled systems, including unrealistic customer expectations and vagueness in ML problem specifications. Literature suggests that Requirements Engineering (RE) practices and tools may help to alleviate these issues, yet there is insufficient understanding of RE's practical application and its perception among practitioners. [Goal] This study aims to investigate the application of RE in developing ML-enabled systems in Brazil, creating an overview of current practices, perceptions, and problems in the Brazilian industry. [Method] To this end, we extracted and analyzed data from an international survey focused on ML-enabled systems, concentrating specifically on responses from practitioners based in Brazil. We analyzed the cluster of RE-related answers gathered from 72 practitioners involved in data-driven projects. We conducted quantitative statistical analyses on contemporary practices using bootstrapping with confidence intervals and qualitative studies on the reported problems involving open and axial coding procedures. [Results] Our findings highlight distinct aspects of RE implementation in ML projects in Brazil. For instance, (i) RE-related tasks are predominantly conducted by data scientists; (ii) the most common techniques for eliciting requirements are interviews and workshop meetings; (iii) there is a prevalence of interactive notebooks in requirements documentation; (iv) practitioners report problems that include a poor understanding of the problem to solve and the business domain, low customer engagement, and difficulties managing stakeholders expectations. [Conclusion] These results provide an understanding of RE-related practices in the Brazilian ML industry, helping to guide research and initiatives toward improving the maturity of RE for ML-enabled systems.
... However, generally, the stages commonly used in the education sector are referred to as the cross-industry standard process for data mining (CRISP-DM). CRISP-DM is a systematic guide for implementing data mining across sectors, including education as shown in Figure 1 [18], [19]. It comprises six stages: business understanding, data understanding, data preparation, modelling, evaluation, and implementation. ...
Article
Full-text available
Educational data mining (EDM) is a strategic technique for exploring data in educational environments to gain a deeper understanding of education. One of the goals of EDM is to predict things related to students in the future which can be done using a machine learning approach. In this paper, a regression model is developed to predict student performance in the first semester and the waiting period for graduate employment using machine learning approach based on informatics management (MI) and non-informatics management (non-MI) student data. Four regression models are compared for predicting student performance in the first semester and waiting period for graduate employment, including support vector regression (SVR), random forest regression (RFR), AdaBoost regression (ABR), and XGBoost regression. Based on the experiment, prediction of students' performance in the first semester, the highest R2 result produced by SVR model by value of 0.58 for MI and by RFR by value of 0.34 for non-MI. While, waiting period for graduate employment prediction, the highest R2 result produced by AdaBoost regression by value of 0.44 for MI and SVR by value of 0.39 for non-MI.
... In this work, the focus is on the phases of the CRISP-DM methodology [87]. The main reason for orienting to CRISP-DM is that it is described in the literature as a de-facto standard in the industry and is widely used due to its generality [77,81]. Additionally, the phases of CRISP-DM can be applied to other sub-topics of data science projects that are not covered under the term data mining. ...
Article
Full-text available
Technical systems are becoming increasingly complex due to the increasing number of components, functions, and involvement of different disciplines. In this regard, model-driven engineering techniques and practices tame complexity during the development process by using models as primary artifacts. Modeling can be carried out through domain-specific languages whose implementation is supported by model-driven techniques. Today, the amount of data generated during product development is rapidly growing, leading to an increased need to leverage artificial intelligence algorithms. However, using these algorithms in practice can be difficult and time-consuming. Therefore, leveraging domain-specific languages and model-driven techniques for formulating AI algorithms or parts of them can reduce these complexities and be advantageous. This study aims to investigate the existing model-driven approaches relying on domain-specific languages in support of the engineering of AI software systems to sharpen future research further and define the current state of the art. We conducted a Systemic Literature Review (SLR), collecting papers from five major databases resulting in 1335 candidate studies, eventually retaining 18 primary studies. Each primary study will be evaluated and discussed with respect to the adoption of (1) MDE principles and practices and (2) the phases of AI development support aligned with the stages of the CRISP-DM methodology. The study’s findings show that language workbenches are of paramount importance in dealing with all aspects of modeling language development (metamodel, concrete syntax, and model transformation) and are leveraged to define domain-specific languages (DSL) explicitly addressing AI concerns. The most prominent AI-related concerns are training and modeling of the AI algorithm, while minor emphasis is given to the time-consuming preparation of the data sets. Early project phases that support interdisciplinary communication of requirements, such as the CRISP-DM Business Understanding phase, are rarely reflected. The study found that the use of MDE for AI is still in its early stages, and there is no single tool or method that is widely used. Additionally, current approaches tend to focus on specific stages of development rather than providing support for the entire development process. As a result, the study suggests several research directions to further improve the use of MDE for AI and to guide future research in this area.
... Pada tahap ini, fokusnya adalah merumuskan pertanyaan-pertanyaan yang relevan agar dapat diterjemahkan ke dalam tugas analisis data. Dengan pemahaman yang jelas tentang tujuan bisnis dan masalah yang ingin dipecahkan, tim data dapat memastikan bahwa pendekatan analisis data yang dilakukan selaras dengan kebutuhan bisnis (Schröer et al., 2021). ...
Article
Full-text available
The surge in global waste, particularly in Indonesia, with a total of 36.218 million tons per year, has become an urgent issue. Challenges in waste management are increasingly complex due to the lack of public understanding and awareness in classifying types of waste. One systemic approach to address waste classification issues involves the use of machine learning technology to categorize waste into two main types: organic and non-organic. The data used in this study comes from a Kaggle website dataset comprising 25,500 entries. This research employs a transfer learning approach with the Inception-V3 architecture and data augmentation implementation. Transfer learning is chosen for its proven performance in image data classification, while data augmentation is implemented to introduce diversity to the dataset. The research stages include business understanding, data preprocessing, data augmentation, data modelling, and evaluation. The results show that the use of transfer learning with the Inception-V3 approach and data augmentation implementation achieves an accuracy rate of 94%, which falls into the excellent category.
... CRISP-DM (CRoss Industry Standard Process for Data Mining), presented by Chapman et al., is a methodology for the implementation of data mining projects with a corporate focus [7]. CRISP-DM is based on the KDD and used in engineering applications as well as in the healthcare industry and the public sector [8]. However, when applied to production processes, this general focus does not provide proper guidance on how to correctly aggregate data on a domain-specific basis. ...
Article
Full-text available
This paper presents an approach for operator assistance in roll forming to overcome the challenges of progressive skilled labor shortage faced by manufacturers of profiled products. An introductory study proves the necessity and the willingness of the roll forming industry to use process data and machine learning based assistance for less experienced operators. A newly built framework contains the characterization of process behavior based on in-line collected data. To support operators during the setup and control of complex manufacturing processes, correlations between tool adjustments and process data are analyzed in a machine learning (ML) pipeline. Setup suggestions are directly provided to the operator for implementation and a feedback loop takes the results into account. To quantify the functionality of the newly developed Machine Learning based Operator Assistance (MLbOA), an exemplary roll forming process is investigated. The system localizes maladjustments in the setup of tool gaps caused by individual mechanical load behavior and offers corrective suggestions to operators with a mean absolute error of 1.26 ± 0.36 μm. This work demonstrates the potential of machine learning based assistance systems to enhance the resilience of manufacturing processes against the challenges posed by the shortage of skilled labor.
... CRISP-DM offers a structured approach to planning data mining and machine learning projects through its six sequential phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Logistic regression was selected as the predictive model for this study due to its effectiveness in binary classification, interpretability, robustness, and simplicity [3]. The dataset analyzed consists of 200,000 rows and 200 numerical variables [4]. ...