ArticlePDF Available

Abstract and Figures

In this article, we present OntoDM-core, an ontology of core data mining entities. OntoDM-core defines the most essential data mining entities in a three-layered ontological structure comprising of a specification, an implementation and an application layer. It provides a representational framework for the description of mining structured data, and in addition provides taxonomies of datasets, data mining tasks, generalizations, data mining algorithms and constraints, based on the type of data. OntoDM-core is designed to support a wide range of applications/use cases, such as semantic annotation of data mining algorithms, datasets and results; annotation of QSAR studies in the context of drug discovery investigations; and disambiguation of terms in text mining. The ontology has been thoroughly assessed following the practices in ontology engineering, is fully interoperable with many domain resources and is easy to extend. OntoDM-core is available at http.//www.ontodm.com
Content may be subject to copyright.
A preview of the PDF is not available
... We follow an "explanation-by-exploration" approach [47], which enables users to obtain an abstract view of the journey of data through an AI workflow and then dive into the code corresponding to those steps. Fortunately, the Semantic Web community has developed a variety of knowledge representation formalisms to capture fundamental elements of data science workflows, such as provenance, data flows, and high-level activities [33,45,52,62]. That being said, data science workflows are complex and, although models and techniques for representing code as a data graphs exist [2,18], generating high-level, compact representations automatically is still an open and challenging problem. ...
... OntoDM-core [52] was developed for representing complex data mining activities. Its conceptualisation distinguishing between a design phase, an implementation phase, and an application. ...
... In our work, we avoided including ML-specific terminology with the purpose of generality, and consider those activities a special type of Analysis. However, DJO could be extended and specialised to support ML-specific activities, for example, reusing notions from Onto-DM [52]. ...
Article
Full-text available
Artificial intelligence systems are not simply built on a single dataset or trained model. Instead, they are made by complex data science workflows involving multiple datasets, models, preparation scripts, and algorithms. Given this complexity, in order to understand these AI systems, we need to provide explanations of their functioning at higher levels of abstraction. To tackle this problem, we focus on the extraction and representation of data journeys from these workflows. A data journey is a multi-layered semantic representation of data processing activity linked to data science code and assets. We propose an ontology to capture the essential elements of a data journey and an approach to extract such data journeys. Using a corpus of Python notebooks from Kaggle, we show that we are able to capture high-level semantic data flow that is more compact than using the code structure itself. Furthermore, we show that introducing an intermediate knowledge graph representation outperforms models that rely only on the code itself. Finally, we report on a user survey to reflect on the challenges and opportunities presented by computational data journeys for explainable AI.
... Another example, KDDONTO [175] emphasizes the development of data mining techniques. In addition, Panov et al. [176] created a heavy-weight ontology that offers ways to express data mining items and inductive queries. Confalonieri et al. [177] proposed an extension of Trepan [178] that integrates ontologies in the generation of explanations. ...
Article
Full-text available
Artificial intelligence (AI) is currently being utilized in a wide range of sophisticated applications, but the outcomes of many AI models are challenging to comprehend and trust due to their black-box nature. Usually, it is essential to understand the reasoning behind an AI model’s decision-making. Thus, the need for eXplainable AI (XAI) methods for improving trust in AI models has arisen. XAI has become a popular research subject within the AI field in recent years. Existing survey papers have tackled the concepts of XAI, its general terms, and post-hoc explainability methods but there have not been any reviews that have looked at the assessment methods, available tools, XAI datasets, and other related aspects. Therefore, in this comprehensive study, we provide readers with an overview of the current research and trends in this rapidly emerging area with a case study example. The study starts by explaining the background of XAI, common definitions, and summarizing recently proposed techniques in XAI for supervised machine learning. The review divides XAI techniques into four axes using a hierarchical categorization system: (i) data explainability, (ii) model explainability, (iii) post-hoc explainability, and (iv) assessment of explanations. We also introduce available evaluation metrics as well as open-source packages and datasets with future research directions. Then, the significance of explainability in terms of legal demands, user viewpoints, and application orientation is outlined, termed as XAI concerns. This paper advocates for tailoring explanation content to specific user types. An examination of XAI techniques and evaluation was conducted by looking at 410 critical articles, published between January 2016 and October 2022, in reputed journals and using a wide range of research databases as a source of information. The article is aimed at XAI researchers who are interested in making their AI models more trustworthy, as well as towards researchers from other disciplines who are looking for effective XAI methods to complete tasks with confidence while communicating meaning from data.
... Such applications can be found in biomedicine [19], food and nutrition [20], environmental studies [21], etc. In recent years, ontologies are also used to represent computer science domains, such as the domains of data mining (DM) and machine learning (ML) [22]. ...
Article
Full-text available
Many optimization algorithm benchmarking platforms allow users to share their experimental data to promote reproducible and reusable research. However, different platforms use different data models and formats, which drastically complicates the identification of relevant datasets, their interpretation, and their interoperability. Therefore, a semantically rich, ontology-based, machine-readable data model that can be used by different platforms is highly desirable. In this paper, we report on the development of such an ontology, which we call OPTION (OPTImization algorithm benchmarking ONtology). Our ontology provides the vocabulary needed for semantic annotation of the core entities involved in the benchmarking process, such as algorithms, problems, and evaluation measures. It also provides means for automatic data integration, improved interoperability, and powerful querying capabilities, thereby increasing the value of the benchmarking data. We demonstrate the utility of OPTION, by annotating and querying a corpus of benchmark performance data from the BBOB collection of the COCO framework and from the Yet Another Black-Box Optimization Benchmark (YABBOB) family of the Nevergrad environment. In addition, we integrate features of the BBOB functional performance landscape into the OPTION knowledge base using publicly available datasets with exploratory landscape analysis. Finally, we integrate the OPTION knowledge base into the IOHprofiler environment and provide users with the ability to perform meta-analysis of performance data.
... Such applications can be found in biomedicine [19], food and nutrition [20], environmental studies [21], etc. In recent years, ontologies are also used to represent computer science domains, such as the domains of data mining (DM) and machine learning (ML) [22]. ...
Preprint
Many optimization algorithm benchmarking platforms allow users to share their experimental data to promote reproducible and reusable research. However, different platforms use different data models and formats, which drastically complicates the identification of relevant datasets, their interpretation, and their interoperability. Therefore, a semantically rich, ontology-based, machine-readable data model that can be used by different platforms is highly desirable. In this paper, we report on the development of such an ontology, which we call OPTION (OPTImization algorithm benchmarking ONtology). Our ontology provides the vocabulary needed for semantic annotation of the core entities involved in the benchmarking process, such as algorithms, problems, and evaluation measures. It also provides means for automatic data integration, improved interoperability, and powerful querying capabilities, thereby increasing the value of the benchmarking data. We demonstrate the utility of OPTION, by annotating and querying a corpus of benchmark performance data from the BBOB collection of the COCO framework and from the Yet Another Black-Box Optimization Benchmark (YABBOB) family of the Nevergrad environment. In addition, we integrate features of the BBOB functional performance landscape into the OPTION knowledge base using publicly available datasets with exploratory landscape analysis. Finally, we integrate the OPTION knowledge base into the IOHprofiler environment and provide users with the ability to perform meta-analysis of performance data.
... This approach is similar to other modeling approaches in multiple domains. For example, the General Formal Ontology [45] in biological and biomedical areas, the OntoDM [46], and the Data Mining Optimization Ontology ( DMOP) [47] for data mining processes. A biological interaction was viewed as a process, with actors ( such as herbivory) performing actions ( such as consuming the part of plant) and cause something ( i.e., damages on parts of the plant). ...
Chapter
Herbarium specimens are one of the most valuable data sources for biodiversity research. They provide an insight into plant distribution and diversity across places over time. We would be able to understand the past and predict the future of global change with those insights. Millions of images of herbarium specimens have been produced and shared on multiple repositories on the Internet. The images, alongside their meta-data, have driven advanced analytic techniques for scientific discovery, including machine learning. Machine learning techniques, especially supervised ones, build statistical models from training data, such as a collection of images that have been annotated. Typically, the annotation will be performed by humans manually. It is an essential process to ensure the quality of the annotated data. Low-quality training data will negatively impact the outcome of the learning process. The object of interest annotation is one type of annotation, wherein relevant objects in images will be marked and labeled. This annotation has been widely used, especially for object identification tasks. There are multiple tools available to assist the annotation process. However, since each tool employs a different annotation format, working with annotated data from multiple sources is challenging. Furthermore, the existing tool provides only an immediate label to an object and ignores how objects are related to each other. In this work, we propose a solution to overcome the challenges and limitations of current methods. Our solution converts the various annotations format into a single common data model while also extracting the relations between objects. As a result, machine learning algorithms can be trained on richer features of distributed annotated datasets. We evaluated the solution on annotated datasets of insect-plant interactions (herbivory). In this case, the objects of interest are damages caused by insects on a plant’s parts (such as leaves). Furthermore, multiple relationships can be extracted from the annotation, for example, how multiple damages can be found on the same specimen. Our results indicated that the solution could be adapted to various classification tasks and multiple sources’ labeled data.
... Panov et al. 71 Thus the application and usability of ontologies will differ from application to application. ...
Chapter
Pattern Mining over Data Stream (PMDS) is part of the most significant task in data mining. A major challenge is to define a representational framework that unifies PMDS algorithms dealing with different pattern types (frequent itemset, high-utility itemset, uncertain frequent itemset), using different methods (test-and-generate, pattern-growth, hybrid) and different window models (landmark, sliding, decay, tilted) in a uniform fashion. This will help standardize the process and create a better understanding of the algorithm design, provide a base for unification and research opportunities. It also facilitates the variability management and allows the derivation of tools for wide experimentation. In this publication, we propose a reference ontology to formalize the domain knowledge around PMDS. The design process of the ontology followed leading practices in ontology engineering. It is aligned to the most popular data mining and machine learning ontologies and thus, represents a major contribution toward PMDS domain ontologies.
Article
Full-text available
While the Web Ontology Language (OWL) provides a mechanism to import ontologies, this mechanism is not always suitable. First, given the current state of editing tools and the issues they have working with large ontologies, direct OWL imports have sometimes proven impractical for day-to-day development. Second, ontologies chosen for integration may be under active development and not aligned with the chosen design principles. Importing heterogeneous ontologies in their entirety may lead to inconsistencies or unintended inferences. In this paper we propose a set of guidelines for importing required terms from an external resource into a target ontology. We describe the guidelines, their implementation, present some examples of application, and outline future work and extensions.
Article
yamato sharply distinguishes itself from other existing upper ontologies in the following respects. (1) Most importantly, yamato is designed with both engineering and philosophical minds. (2) yamato is based on a sophisticated theory of roles, given that the world is full of roles. (3) yamato has a tenable theory of functions which helps to deal with artifacts effectively. (4) Information is a ‘content-bearing’ entity and it differs significantly from the entities that philosophers have traditionally discussed. Taking into account the modern society in which a flood of information occurs, yamato has a sophisticated theory of informational objects (representations). (5) Quality and quantity are carefully organized for the sake of greater interoperability of real-world data. (6) The philosophical contribution of yamato includes a theory of objects, processes, and events. Those features are illustrated with several case studies. These features lead to the intensive application of yamato in some domains such as biomedicine and learning engineering.
Book
Introduction to Bio-Ontologies explores the computational background of ontologies. Emphasizing computational and algorithmic issues surrounding bio-ontologies, this self-contained text helps readers understand ontological algorithms and their applications.The first part of the book defines ontology and bio-ontologies. It also explains the importan.
Article
This chapter introduces an ontology-based framework for automated construction of complex interactive data mining workflows as a means of improving productivity of Grid-enabled data exploration systems. The authors first characterize existing manual and automated workflow composition approaches and then present their solution called GridMiner Assistant (GMA), which addresses the whole life cycle of the knowledge discovery process. GMA is specified in the OWL language and is being developed around a novel data mining ontology, which is based on concepts of industry standards like the predictive model markup language, cross industry standard process for data mining, and Java data mining API. The ontology introduces basic data mining concepts like data mining elements, tasks, services, and so forth. In addition, conceptual and implementation architectures of the framework are presented and its application to an example taken from the medical domain is illustrated. The authors hope that the further research and development of this framework can lead to productivity improvements, which can have significant impact on many real-life spheres. For example, it can be a crucial factor in achievement of scientific discoveries, optimal treatment of patients, productive decision making, cutting costs, and so forth.
Article
We describe the Data Mining OPtimization Ontology (DMOP), which was developed to support informed decision-making at various choice points of the knowledge discovery (KD) process. It can be used as a reference by data miners, but its primary purpose is to automate algorithm and model selection through semantic meta-mining, i.e., ontology-based meta-analysis of complete data mining processes in view of extracting patterns associated with mining performance. DMOP contains in-depth descriptions of DM tasks (e.g., learning, feature selection), data, algorithms, hypotheses (mined models or patterns), and workflows. Its development raised a number of non-trivial modeling problems, the solution to which demanded maximal exploitation of OWL 2 representational potential. We discuss a number of modeling issues encountered and the choices made that led to version 5.3 of the DMOP ontology.
Article
Knowledge Discovery in Databases (KDD) has grown a lot during the last years. But providing user support for constructing workflows is still problematic. The large number of operators available in current KDD systems makes it difficult for a user to successfully solve her task. Also, workflows can easily reach a huge number of operators(hundreds) and parts of the workflows are applied several times. Therefore, it becomes hard for the user to construct them manually. In addition, workflows are not checked for correctness before execution. Hence, it frequently happens that the execution of the workflow stops with an error after several hours runtime. In this paper we present a solution to these problems. We introduce a knowledge-based representation of Data Mining (DM) workflows as a basis for cooperative interactive planning. Moreover, we discuss workflow templates, i.e. abstract workflows that can mix executable operators and tasks to be refined later into sub-workflows. This new representation helps users to structure and handle workflows, as it constrains the number of operators that need to be considered. Finally, workflows can be grouped in templates which foster re-use further simplifying DM workflow construction.