Conference Paper

KDD meets Big Data

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Cross-Industry Standard process model (CRISP-DM) was developed in the late 90s by a consortium of industry participants to facilitate the end-to-end data mining process for Knowledge Discovery in Databases (KDD). While there have been efforts to better integrate with management and software development practices, there are no extensions to handle the new activities involved in using big data technologies. Data Science Edge (DSE) is an enhanced process model to accommodate big data technologies and data science activities. In recognition of the changes, the author promotes the use of a new term, Knowledge Discovery in Data Science (KDDS) as a call for the community to develop a new industry standard data science process model.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Nevertheless, one of the main shortcomings of CRISP-DM is that it does not explain how teams should organize to carry out the defined processes and does not address any of the above mentioned team management issues. In this sense, in words of [64], CRISP-DM needs a better integration with management processes, demands to align with software and agile development methodologies, and instead of simple checklists, it also needs method guidance for individual activities within stages. ...
... The Data Science Edge (DSE) methodology is introduced along two articles by Grady et al. [64,76]. It is a enhanced process model to accommodate big data technologies and data science activities. ...
Preprint
Full-text available
Data science has employed great research efforts in developing advanced analytics, improving data models and cultivating new algorithms. However, not many authors have come across the organizational and socio-technical challenges that arise when executing a data science project: lack of vision and clear objectives, a biased emphasis on technical issues, a low level of maturity for ad-hoc projects and the ambiguity of roles in data science are among these challenges. Few methodologies have been proposed on the literature that tackle these type of challenges, some of them date back to the mid-1990, and consequently they are not updated to the current paradigm and the latest developments in big data and machine learning technologies. In addition, fewer methodologies offer a complete guideline across team, project and data & information management. In this article we would like to explore the necessity of developing a more holistic approach for carrying out data science projects. We first review methodologies that have been presented on the literature to work on data science projects and classify them according to the their focus: project, team, data and information management. Finally, we propose a conceptual framework containing general characteristics that a methodology for managing data science projects with a holistic point of view should have. This framework can be used by other researchers as a roadmap for the design of new data science methodologies or the updating of existing ones.
... Nevertheless, one of the main shortcomings of CRISP-DM is that it does not explain how teams should organize to carry out the defined processes and does not address any of the above mentioned team management issues. In this sense, in words of [64], CRISP-DM needs a better integration with management processes, demands to align with software and agile development methodologies, and instead of simple checklists, it also needs method guidance for individual activities within stages. ...
... The Data Science Edge (DSE) methodology is introduced along two articles by Grady et al. [64,76]. It is a enhanced process model to accommodate big data technologies and data science activities. ...
Article
Full-text available
Data science has employed great research efforts in developing advanced analytics, improving data models and cultivating new algorithms. However, not many authors have come across the organizational and socio-technical challenges that arise when executing a data science project: lack of vision and clear objectives, a biased emphasis on technical issues, a low level of maturity for ad-hoc projects and the ambiguity of roles in data science are among these challenges. Few methodologies have been proposed on the literature that tackle these type of challenges, some of them date back to the mid-1990, and consequently they are not updated to the current paradigm and the latest developments in big data and machine learning technologies. In addition, fewer methodologies offer a complete guideline across team, project and data & information management. In this article we would like to explore the necessity of developing a more holistic approach for carrying out data science projects. We first review methodologies that have been presented on the literature to work on data science projects and classify them according to the their focus: project, team, data and information management. Finally, we propose a conceptual framework containing general characteristics that a methodology for managing data science projects with a holistic point of view should have. This framework can be used by other researchers as a roadmap for the design of new data science methodologies or the updating of existing ones.
... 15 Additionally, this outbreak can be potentialized by poor air quality caused by air city pollution and uncontrolled forest fires, however, a better analysis of this correlation is still a challenge due to the large amount of data. To overcome this challenge, artificial intelligence approaches like the Knowledge Discovery of Databases (KDD) 16,17 could be applied to this type of data which requires 4Vs (volume, variety, speed, and veracity). 18 We propose in this work the use of a KDD related algorithm named k-means, and a time series analysis (Autoregressive Integrated Moving Average with Explanatory Variable -ARIMAX) to identify and analyze clusters of cities more and less affected by the Amazon forest fires, and the relationship between these scenarios and the hospitalizations for respiratory diseases (HRD), and hospitalizations and mortality rate by the SARS-COV-2 considering the State of Par a in Brazil. ...
Article
Full-text available
Background Brazil has faced two simultaneous problems related to respiratory health: forest fires and the high mortality rate due to COVID-19 pandemics. The Amazon rain forest is one of the Brazilian biomes that suffers the most with fires caused by droughts and illegal deforestation. These fires can bring respiratory diseases associated with air pollution, and the State of Pará in Brazil is the most affected. COVID-19 pandemics associated with air pollution can potentially increase hospitalizations and deaths related to respiratory diseases. Here, we aimed to evaluate the association of fire occurrences with the COVID-19 mortality rates and general respiratory diseases hospitalizations in the State of Pará, Brazil. Methods We employed machine learning technique for clustering k-means accompanied with the elbow method used to identify the ideal quantity of clusters for the k-means algorithm, clustering 10 groups of cities in the State of Pará where we selected the clusters with the highest and lowest fires occurrence from the 2015 to 2019. Next, an Auto-regressive Integrated Moving Average Exogenous (ARIMAX) model was proposed to study the serial correlation of respiratory diseases hospitalizations and their associations with fire occurrences. Regarding the COVID-19 analysis, we computed the mortality risk and its confidence level considering the quarterly incidence rate ratio in clusters with high and low exposure to fires. Findings Using the k-means algorithm we identified two clusters with similar DHI (Development Human Index) and GDP (Gross Domestic Product) from a group of ten clusters that divided the State of Pará but with diverse behavior considering the hospitalizations and forest fires in the Amazon biome. From the auto-regressive and moving average model (ARIMAX), it was possible to show that besides the serial correlation, the fires occurrences contribute to the respiratory diseases increase, with an observed lag of six months after the fires for the case with high exposure to fires. A highlight that deserves attention concerns the relationship between fire occurrences and deaths. Historically, the risk of mortality by respiratory diseases is higher (about the double) in regions and periods with high exposure to fires than the ones with low exposure to fires. The same pattern remains in the period of the COVID-19 pandemic, where the risk of mortality for COVID-19 was 80% higher in the region and period with high exposure to fires. Regarding the SARS-COV-2 analysis, the risk of mortality related to COVID-19 is higher in the period with high exposure to fires than in the period with low exposure to fires. Another highlight concerns the relationship between fire occurrences and COVID-19 deaths. The results show that regions with high fire occurrences are associated with more cases of COVID deaths. Interpretation The decision-make process is a critical problem mainly when it involves environmental and health control policies. Environmental policies are often more cost-effective as health measures than the use of public health services. This highlight the importance of data analyses to support the decision making and to identify population in need of better infrastructure due to historical environmental factors and the knowledge of associated health risk. The results suggest that The fires occurrences contribute to the increase of the respiratory diseases hospitalization. The mortality rate related to COVID-19 was higher for the period with high exposure to fires than the period with low exposure to fires. The regions with high fire occurrences is associated with more COVID-19 deaths, mainly in the months with high number of fires. Funding No additional funding source was required for this study.
... Also, Espinoza and Armour [5] identified that building big data & analytics abilities is not enough to successfully carry out a big data project, but coordination and governance also play a major role on this purpose. On the other hand, Grady [7] identified that mission expertise, domain data and processes, statistics, software systems and engineering, analytic systems and research and algorithms are the required skills that a big data team needs to successfully undertake a big data project. Bhardwaj [2] identified that collaborative data analysis and data science is often done following and ad hoc methodology and by doing trial-and-error. ...
Chapter
Full-text available
The development of big data & analytics projects with the participation of several corporate divisions and research groups within and among organizations is a non-trivial problem and requires well-defined roles and processes. Since there is no accepted standard for the implementation of big data & analytics projects, project managers have to either adapt an existing data mining process methodology or create a new one. This work presents a use case for a big data & analytics project for the banking sector. The authors found out that an adaptation of ASUM-DM, a refined CRISP-DM, with the addition of big data analysis, application prototyping, and prototype evaluation, plus a strong project management work with an emphasis in communications proved the best solution to develop a cross-disciplinary, multi-organization, geographically-distributed big data & analytics project.
... It is common to extend or modify CRISP DM for particular needs of a project. For example, in [5] the authors extend CRISP DM to process big scientific data. CRISP DM is data centric and does not tackle service composition or deployment problems in detail. ...
Conference Paper
Full-text available
The growing digitization and networking process within our society has a large influence on all aspects of everyday life. Large amounts of data are being produced continuously, and when these are analyzed and interlinked they have the potential to create new knowledge and intelligent solutions for economy and society. To process this data, we developed the Big Data Integrator (BDI) Platform with various Big Data components available out-of-the-box. The integration of the components inside the BDI Platform requires components homogenization, which leads to the standardization of the development process. To support these activities we created the BDI Stack Lifecycle (SL), which consists of development, packaging, composition, enhancement, deployment and monitoring steps. In this paper, we show how we support the BDI SL with the enhancement applications developed in the BDE project. As an evaluation, we demonstrate the applicability of the BDI SL on three pilots in the domains of transport, social sciences and security.
Chapter
An overview of common process models for the implementation of data science is presented in this article. Since the development of KDD and CRISP-DM, the central ideas have been examined from broader perspectives, and further frameworks have been created. In addition to the core activities that are conducted in the individual process phases, typical roles and project-supporting artifacts are outlined. In summary, a distinction can be made between four process phases that relate to ideation, data, analysis, and deployment. These phases are considered as a holistic methodology of data science. The overview is an orientation for data scientists and project managers in the preparation and realization of data science projects. However, many challenges need to be overcome to further specify and specialize the processes, which may also lead to new approaches to data science methodology in the future.
Article
Full-text available
There is an increasing number of big data science projects aiming to create value for organizations by improving decision making, streamlining costs or enhancing business processes. However, many of these projects fail to deliver the expected value. It has been observed that a key reason many data science projects don’t succeed is not technical in nature, but rather, the process aspect of the project. The lack of established and mature methodologies for executing data science projects has been frequently noted as a reason for these project failures. To help move the field forward, this study presents a systematic review of research focused on the adoption of big data science process frameworks. The goal of the review was to identify (1) the key themes, with respect to current research on how teams execute data science projects, (2) the most common approaches regarding how data science projects are organized, managed and coordinated, (3) the activities involved in a data science projects life cycle, and (4) the implications for future research in this field. In short, the review identified 68 primary studies thematically classified in six categories. Two of the themes (workflow and agility) accounted for approximately 80% of the identified studies. The findings regarding workflow approaches consist mainly of adaptations to CRISP-DM ( vs entirely new proposed methodologies). With respect to agile approaches, most of the studies only explored the conceptual benefits of using an agile approach in a data science project ( vs actually evaluating an agile framework being used in a data science context). Hence, one finding from this research is that future research should explore how to best achieve the theorized benefits of agility. Another finding is the need to explore how to efficiently combine workflow and agile frameworks within a data science context to achieve a more comprehensive approach for project execution.
Conference Paper
Full-text available
Data science is an emerging discipline with a particular research focus on improving the available techniques for data analysis. While the number of data science projects is growing, unfortunately, there is a slight consideration of how a team performs a data science project. Although the existence of a repeatable well-defined process could deal with many challenges of data science projects, researches conducted in recent years indicate a convergence of the results to agile methodologies as the appropriate ones for the projects. In this paper, first, the tasks and roles of individuals in data science projects are addressed; then, some research conducted for the methodologies used in the projects are studied. The study shows that agile methodologies could resolve many issues of data science projects by increasing the communications and cooperation of the team members and investors.
Article
Full-text available
Mining ubiquitous sensing data is important but also challenging, due to many factors, such as heterogeneous large-scale data that is often at various levels of abstraction. This also relates particularly to the important aspects of the explainability and interpretability of the applied models and their results, and thus ultimately to the outcome of the data mining process. With this, in general, the inclusion of domain knowledge leading towards semantic data mining approaches is an emerging and important research direction. This article aims to survey relevant works in these areas, focusing on semantic data mining approaches and methods, but also on selected applications of ubiquitous sensing in some of the most prominent current application areas. Here, we consider in particular: (1) environmental sensing; (2) ubiquitous sensing in industrial applications of artificial intelligence; and (3) social sensing relating to human interactions and the respective individual and collective behaviors. We discuss these in detail and conclude with a summary of this emerging field of research. In addition, we provide an outlook on future directions for semantic data mining in ubiquitous sensing contexts.
Article
Full-text available
The Cross-Industry Standard Process for Data Mining (CRISP-DM) is a widely accepted framework in production and manufacturing. This data-driven knowledge discovery framework provides an orderly partition of the often complex data mining processes to ensure a practical implementation of data analytics and machine learning models. However, the practical application of robust industry-specific data-driven knowledge discovery models faces multiple data- and model development-related issues. These issues need to be carefully addressed by allowing a flexible, customized and industry-specific knowledge discovery framework. For this reason, extensions of CRISP-DM are needed. In this paper, we provide a detailed review of CRISP-DM and summarize extensions of this model into a novel framework we call Generalized Cross-Industry Standard Process for Data Science (GCRISP-DS). This framework is designed to allow dynamic interactions between different phases to adequately address data- and model-related issues for achieving robustness. Furthermore, it emphasizes also the need for a detailed business understanding and the interdependencies with the developed models and data quality for fulfilling higher business objectives. Overall, such a customizable GCRISP-DS framework provides an enhancement for model improvements and reusability by minimizing robustness-issues.
Article
Full-text available
Nowadays, we are in the era of advanced technologies where tremendous amount of data is produced by multiple sources such as sensors, devices, social media, user experiences, etc. Furthermore, this raw data has a low value, and major part is not really useful or important for business. One way to give an added value to this stored data is to extract useful knowledge from it, for the ending-system or the end-users by a process commonly called knowledge Discovery in Database (KDD). Smart Farming uses a large amount of connected technologies producing also a huge amount of data in order to maximize productions by reducing: human efforts, environment impact and wasting natural resources. In this paper, We develop a new data analytic architecture dedicated to Precision Livestock Farming (PLF) to improve in particular the livestock animals production, animals’ welfare, and farming processes. We present a new data processing architecture for a knowledge-base management system (KBMS) allowing to ease decision support and monitoring operations that can help farmers and stakeholders to better exploit data and have a long-term view of the evolution of the knowledge it contains. Our main contribution in the present paper is a new architecture specifically developed for the precision livestock farming integrating a periodical data reevaluation which address the problematic of data conservation and the decrease in data value over time.
Chapter
Processes and practices used in data science projects have been reshaping especially over the last decade. These are different from their software engineering counterparts. However, to a large extent, data science relies on software, and, once taken to use, the results of a data science project are often embedded in software context. Hence, seeking synergy between software engineering and data science might open promising avenues. However, while there are various studies on data science workflows and data science project teams, there have been no attempts to combine these two very interlinked aspects. Furthermore, existing studies usually focus on practices within one company. Our study will fill these gaps with a multi-company case study, concentrating both on the roles found in data science project teams as well as the process. In this paper, we have studied a number of practicing data scientists to understand a typical process flow for a data science project. In addition, we studied the involved roles and the teamwork that would take place in the data context. Our analysis revealed three main elements of data science projects: Experimentation, Development Approach, and Multi-disciplinary team(work). These key concepts are further broken down to 13 different sub-themes in total. The found themes pinpoint critical elements and challenges found in data science projects, which are still often done in an ad-hoc fashion. Finally, we compare the results with modern software development to analyse how good a match there is.
Article
Big Data is a term that gained popularity due to its potential benefits in various fields, and is progressively being used. However, there are still many gaps and challenges to overcome, especially when it comes to the selection and handling of relevant technologies. A consequence of the huge number of manifestations in this area, growing each year, the uncertainty and complexity increase. The lack of a classification approach causes a growing demand for more experts with a broad knowledge and expertise. Using various techniques of ontology engineering and following the design science methodology, this work proposes the Big Data Technology Ontology (BDTOnto) as a comprehensive and sustainable classification approach to classify big data technologies and their manifestations. In particular, a reusable, extensible and adaptable artifact in the form of an ontology will be developed and evaluated.
Conference Paper
Full-text available
Most commercial data mining products provide a large number of models and tools for performing various data mining tasks, but few provide intelligent assistance for addressing many important decisions that must be considered during the mining process. In this paper, we propose the realization of a hybrid data mining assistant, based on the CBR paradigm and the use of an ontology, in order to empower the user during the various phases of the data mining process
Article
Full-text available
Knowledge Discovery and Data Mining is a very dynamic research and development area that is reaching maturity. As such, it requires stable and well-defined foundations, which are well understood and popularized throughout the community. This survey presents a historical overview, description and future directions concerning a standard for a Knowledge Discovery and Data Mining process model. It presents a motivation for use and a comprehensive comparison of several leading process models, and discusses their applications to both academic and industrial problems. The main goal of this review is the consolidation of the research in this area. The survey also proposes to enhance existing models by embedding other current standards to enable automation and interoperability of the entire process.
Article
Full-text available
Data Mining projects are implemented by following the knowledge discovery process. This process is highly complex and iterative in nature and comprises of several phases, starting off with business understanding, and followed by data understanding, data preparation, modeling, evaluation and deployment or implementation. Each phase comprises of several tasks. Knowledge Discovery and Data Mining (KDDM) process models are meant to provide prescriptive guidance towards the execution of the end-to-end knowledge discovery process, i.e. such models prescribe how exactly each one of the tasks in a Data Mining project can be implemented. Given this role, the quality of the process model used, affects the effectiveness and efficiency with which the knowledge discovery process can be implemented and therefore the outcome of the overall Data Mining project. This paper presents the results of the rigorous evaluation of the Integrated Knowledge Discovery and Data Mining (IKDDM) process model and compares it to the CRISP-DM process model. Results of statistical tests confirm that the IKDDM leads to more effective and efficient implementation of the knowledge discovery process.
Article
Full-text available
Up to now, many data mining and knowledge discovery methodologies and process models have been developed, with varying degrees of success. In this paper, we describe the most used (in industrial and academic projects) and cited (in scientific literature) data mining and knowledge discovery methodologies and process models, providing an overview of its evolution along data mining and knowledge discovery history and setting down the state of the art in this topic. For every approach, we have provided a brief description of the proposed knowledge discovery in databases (KDD) process, discussing about special features, outstanding advantages and disadvantages of every approach. Apart from that, a global comparative of all presented data mining approaches is provided, focusing on the different steps and tasks in which every approach interprets the whole KDD process. As a result of the comparison, we propose a new data mining and knowledge discovery process named refined data mining process for developing any kind of data mining and knowledge discovery project. The refined data mining process is built on specific steps taken from analyzed approaches.
Conference Paper
Full-text available
As the world becomes increasingly dynamic, the traditional static modeling may not be able to deal with it. One solution is to use agile modeling that is characterized with flexibility and adaptability. On the other hand, data mining applications require greater diversity of technology, business skills, and knowledge than the typical applications, which means it may benefit a lot from features of agile software development. In this paper, we will propose a framework named ASD-DM based on Adaptive Software Development (ASD) that can easily adapt with predictive data mining applications. A case study in automotive manufacturing domain was explained and experimented to evaluate ASD-DM methodology.
Article
Full-text available
They also enable integration and automation of DMKD tasks. The chapter describes a six-step DMKD process model, the above mentioned technologies, and their implementation details. 1.1. The Knowledge Discovery and Data Mining Knowledge Discovery (KD) is a nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns from large collections of data [30]. One of the KD steps is Data Mining (DM). DM is the step that is concerned with the actual extraction of knowledge from data, in contrast to the KD process that is concerned with many other things like understanding and preparation of the data, verification and application of the discovered knowledge. In practice, however, people use terms DM, KD, and DMKD as synonymous. The design of a framework for a knowledge discovery process is an important issue. Several researchers described a series of steps that constitute the KD process. They range from very simple models, incorporating few steps t
Article
Doing data science is difficult. Projects are typically very dynamic with requirements that change as data understanding grows. The data itself arrives piecemeal, is added to, replaced, contains undiscovered flaws and comes from a variety of sources. Teams also have mixed skill sets and tooling is often limited. Despite these disruptions, a data science team must get off the ground fast and begin demonstrating value with traceable, tested work products. This is when you need Guerrilla Analytics. In this book, you will learn about: The Guerrilla Analytics Principles: simple rules of thumb for maintaining data provenance across the entire analytics life cycle from data extraction, through analysis to reporting. Reproducible, traceable analytics: how to design and implement work products that are reproducible, testable and stand up to external scrutiny. Practice tips and war stories: 90 practice tips and 16 war stories based on real-world project challenges encountered in consulting, pre-sales and research. Preparing for battle: how to set up your teams analytics environment in terms of tooling, skill sets, workflows and conventions. Data gymnastics: over a dozen analytics patterns that your team will encounter again and again in projects. © 2015 Enda Ridge Published by Elsevier Inc. All rights reserved.
Article
The number, variety and complexity of projects involving data mining or knowledge discovery in databases activities have increased just lately at such a pace that aspects related to their development process need to be standardized for results to be integrated, reused and interchanged in the future. Data mining projects are quickly becoming engineering projects, and current standard processes, like CRISP-DM, need to be revisited to incorporate this engineering viewpoint. This is the central motivation of this paper that makes the point that experience gained about the software development process over almost 40 years could be reused and integrated to improve data mining processes. Consequently, this paper proposes to reuse ideas and concepts underlying the IEEE Std 1074 and ISO 12207 software engineering model processes to redefine and add to the CRISP-DM process and make it a data mining engineering standard.
SAS Enterprise Miner SEMMA
  • Sas Institute
Discovering Data Mining: From Concepts to Implementation
  • P Cabena
The Data Science Venn Diagram
  • D Conway
The mosaic effect and big data
  • A Mazmanian
A methodology for solving problems with DataScience for Internet of Things
  • A Jaokar
An Integrated Knowledge Discovery and and Data Mining Process Model
  • S Sharma
  • K Osei-Bryson
CRISP-DM, still the top methodology for analytics, datamining, or data science projects
  • G Piatetsky-Shapiro