
Michel Dumontier- PhD
- Professor (Full) at Maastricht University
Michel Dumontier
- PhD
- Professor (Full) at Maastricht University
About
325
Publications
91,779
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
22,750
Citations
Introduction
Skills and Expertise
Current institution
Additional affiliations
September 1999 - September 2004
July 2005 - August 2013
Publications
Publications (325)
Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (https://github.com/GLJS/audio-datasets). It provides a co...
Data-science is an interdisciplinary research working on data from different fields. When analyzing these data, data scientists implicitly agree to follow the rules governing these fields. However, the responsibilities of the involved actors are not necessarily explicit. While novel frameworks supporting open-science are being proposed, there are c...
Sound recognition is effortless for humans but poses a significant challenge for artificial hearing systems. Deep neural networks (DNNs), especially convolutional neural networks (CNNs), have recently surpassed traditional machine learning in sound classification. However, current DNNs map sounds to labels using binary categorical variables, neglec...
Audio-language models (ALMs) process sounds to provide a linguistic description of sound-producing events and scenes. Recent advances in computing power and dataset creation have led to significant progress in this domain. This paper surveys existing datasets used for training audio-language models, emphasizing the recent trend towards using large,...
Rule mining on knowledge graphs allows for explainable link prediction. Contrarily, embedding-based methods for link prediction are well known for their generalization capabilities, but their predictions are not interpretable. Several approaches combining the two families have been proposed in recent years. The majority of the resulting hybrid appr...
The emerging European Health Data Space (EHDS) Regulation opens new prospects for large-scale sharing and re-use of health data. Yet, the proposed regulation suffers from two important limitations: it is designed to benefit the whole population with limited consideration for individuals, and the generation of secondary datasets from heterogeneous,...
Sound recognition is effortless for humans but poses a significant challenge for artificial hearing systems. Deep neural networks (DNNs), especially convolutional neural networks (CNNs), have recently surpassed traditional machine learning in sound classification. However, current DNNs map sounds to labels using binary categorical variables, neglec...
A large amount of personal health data that is highly valuable to the scientific community is still not accessible or requires a lengthy request process due to privacy concerns and legal restrictions. As a solution, synthetic data has been studied and proposed to be a promising alternative to this issue. However, generating realistic and privacy-pr...
Developing personal data sharing tools and standards in conformity with data protection regulations is essential to empower citizens to control and share their health data with authorized parties for any purpose they approve. This can be, among others, for primary use in healthcare, or secondary use for research to improve human health and well-bei...
Data science is an interdisciplinary research area where scientists are typically working with data coming from different fields. When using and analyzing data, the scientists implicitly agree to follow standards, procedures, and rules set in these fields. However, guidance on the responsibilities of the data scientists and the other involved actor...
The objective of the FAIR Digital Objects Framework (FDOF) is for objects published in a digital environment to comply with a set of requirements, such as identifiability, and the use of a rich metadata record (Santos 2021, Schultes and Wittenburg 2019, Schwardmann 2020). With the increasing prevalence of the FAIR (Findable, Accessible, Interoperab...
Taxonomies and ontologies for the characterization of everyday sounds have been developed in several research fields, including auditory cognition, soundscape research, artificial hearing, sound design, and medicine. Here, we surveyed 36 of such knowledge organization systems, which we identified through a systematic literature search. To evaluate...
The mining of personal data collected by multiple organizations remains challenging in the presence of technical barriers, privacy concerns, and legal and/or organizational restrictions. While a number of privacy-preserving and data mining frameworks have recently emerged, much remains to show their practical utility. In this study, we implement an...
Easy access to data is one of the main avenues to accelerate scientific research. As a key element of scientific innovations, data sharing allows the reproduction of results, helps prevent data fabrication, falsification, and misuse. Although the research benefits from data reuse are widely acknowledged, the data collections existing today are stil...
Despite the remarkable success of Generative Adversarial Networks (GANs) on text, images, and videos, generating high-quality tabular data is still under development owing to some unique challenges such as capturing dependencies in imbalanced data, optimizing the quality of synthetic patient data while preserving privacy. In this paper, we propose...
UNSTRUCTURED
This is the authors' responses to peer review reports related to MS#34979.
Within clinical, biomedical, and translational science, an increasing number of projects are adopting graphs for knowledge representation. Graph‐based data models elucidate the interconnectedness among core biomedical concepts, enable data structures to be easily updated, and support intuitive queries, visualizations, and inference algorithms. Howe...
There are thousands of distinct disease entities and concepts, each of which are known by different and sometimes contradictory names. The lack of a unified system for managing these entities poses a major challenge for both machines and humans that need to harmonize information to better predict causes and treatments for disease. The Mondo Disease...
Background
The European Platform on Rare Disease Registration (EU RD Platform) aims to address the fragmentation of European rare disease (RD) patient data, scattered among hundreds of independent and non-coordinating registries, by establishing standards for integration and interoperability. The first practical output of this effort was a set of 1...
Easy access to data is one of the main avenues to accelerate scientific research. As a key element of scientific innovations, data sharing allows the reproduction of results, helps prevent data fabrication, falsification, and misuse. Although the research benefits from data reuse are widely acknowledged, the data collections existing today are stil...
Advancements in oncology and radiology are driving more specific, and thus improved, treatment and diagnostic opportunities. This creates challenges on the assessment of management options, as more information is needed to make an informed decision. One of the methods is to use machine-and deep learning techniques to develop predictive models. Alth...
Wilkinson et al. claimed in previous work that Adherence of a dataset to the FAIR Guiding Principles enables its automated discovery. We present here a formalization of that claim, stating that all things of class “adherence to the FAIR Guiding principles” that are in the context of a thing of class “data set” can generally have a relation of type...
While there exists an abundance of open biomedical data, the lack of high-quality metadata makes it challenging for others to find relevant datasets and to reuse them for another purpose. In particular, metadata are useful to understand the nature and provenance of the data. A common approach to improving the quality of metadata relies on expensive...
BACKGROUND
In the poorly studied field of physician suicide, various factors can contribute to misinformation or information distortion, which in turn can influence evidence-based policies and prevention of suicide in this unique population.
OBJECTIVE
The aim of this paper is to use nanopublications as a scientific publishing approach to establish...
Combining and analysing sensitive data from multiple sources offers considerable potential for knowledge discovery. However, there are a number of issues that pose problems for such analyses, including technical barriers, privacy restrictions, security concerns, and trust issues. Privacy-preserving distributed data mining techniques (PPDDM) aim to...
Research using health data is challenged by its heterogeneous nature, description and storage. The COVID-19 outbreak made clear that rapid analysis of observations such as clinical measurements across a large number of healthcare providers can have enormous health benefits. This has brought into focus the need for a common model of quantitative hea...
The COST Action Gene Regulation Ensemble Effort for the Knowledge Commons (GREEKC, CA15205, www.greekc.org) organized nine workshops in a four-year period, starting September 2016. The workshops brought together a wide range of experts from all over the world working on various parts of the knowledge cycle that is central to understanding gene regu...
The effectiveness of machine learning models to provide accurate and consistent results in drug discovery and clinical decision support is strongly dependent on the quality of the data used. However, substantive amounts of open data that drive drug discovery suffer from a number of issues including inconsistent representation, inaccurate reporting,...
Background
The European Platform on Rare Disease Registration (EU RD Platform) aims to address the fragmentation of European rare disease (RD) patient data, scattered among hundreds of independent and non-coordinating registries, by establishing standards for integration and interoperability. The first practical output of this effort was a set of 1...
Background
Artificial intelligence (AI) typically requires a significant amount of high-quality data to build reliable models, where gathering enough data within a single institution can be particularly challenging. In this study we investigated the impact of using sequential learning to exploit very small, siloed sets of clinical and imaging data...
Background
The amount of available data, which can facilitate answering scientific research questions, is growing. However, the different formats of published data are expanding as well, creating a serious challenge when multiple datasets need to be integrated for answering a question.
Results
This paper presents a semi-automated framework that pr...
Despite the significant health impacts of adverse events associated with drug-drug interactions, no standard models exist for managing and sharing evidence describing potential interactions between medications. Minimal information models have been used in other communities to establish community consensus around simple models capable of communicati...
While the publication of Linked Data has become increasingly common, the process tends to be a relatively complicated and heavy-weight one. Linked Data is typically published by centralized entities in the form of larger dataset releases, which has the downside that there is a central bottleneck in the form of the organization or individual respons...
Accurate and precise information about the therapeutic uses (indications) of a drug is essential for applications in drug repurposing and precision medicine. Leading online drug resources such as DrugCentral and DrugBank provide rich information about various properties of drugs, including their indications. However, because indications in such dat...
The quality of a Knowledge Graph (also known as Linked Data) is an important aspect to indicate its fitness for use in an application. Several quality dimensions are identified, such as accuracy, completeness, timeliness, provenance, and accessibility, which are used to assess the quality. While many prior studies offer a landscape view of data qua...
To better allocate funds in the new EU research framework programme Horizon Europe, an assessment of current and past efforts is crucial. In this paper we develop and apply a multi-method qualitative and computational approach to provide a catalogue of climate crisis mitigation technologies on the EU level between 2014 and 2020. Using the approach,...
One of the grand challenges discussed during the Dagstuhl Seminar "Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web" and described in its report is that of a: "Public FAIR Knowledge Graph of Everything: We increasingly see the creation of knowledge graphs that capture information about the entirety of a class of ent...
It is essential for the advancement of science that researchers share, reuse and reproduce each other’s workflows and protocols. The FAIR principles are a set of guidelines that aim to maximize the value and usefulness of research data, and emphasize the importance of making digital objects findable and reusable by others. The question of how to ap...
Author summary
Everything we do today is becoming more and more reliant on the use of computers. The field of biology is no exception; but most biologists receive little or no formal preparation for the increasingly computational aspects of their discipline. In consequence, informal training courses are often needed to plug the gaps; and the demand...
In the poorly studied field of physician suicide, various fac-tors can contribute to misinformation or information distor-tion, which in turn can influence evidence-based policies and prevention of suicide in this unique population. Here, we report on the use of nanopublications as a scientific publishing approach to establish a citation network of...
Validating RDF data becomes necessary in order to ensure data compliance against the conceptualization model it follows, e.g., schema or ontology behind the data, and improve data consistency and completeness. There are different approaches to validate RDF data, for instance, JSON schema, particularly for data in JSONLD format, as well as Shape Exp...
We report on the activities of the 2015 edition of the BioHackathon, an annual event that brings together researchers and developers from around the world to develop tools and technologies that promote the reusability of biological data. We discuss issues surrounding the representation, publication, integration, mining and reuse of biological data...
The FAIR principles have been widely cited, endorsed and adopted by a broad range of stakeholders since their publication in 2016. By intention, the 15 FAIR guiding principles do not dictate specific technological implementations, but provide guidance for improving Findability, Accessibility, Interoperability and Reusability of digital resources. T...
The utility of Artificial Intelligence (AI) in healthcare strongly depends upon the quality of the data used to build models, and the confidence in the predictions they generate. Access to sufficient amounts of high-quality data to build accurate and reliable models remains problematic owing to substantive legal and ethical constraints in making cl...
Results As a result, we have selected 25 most relevant tools which are highly advised to be used for policy development at the EU level and for national consumer policy enforcement authorities, particularly the Consumer Protection Cooperation (CPC) network, responsible for enforcing EU consumer protection laws to protect consumers' interests in EU...
Background:
Current approaches to identifying drug-drug interactions (DDIs), include safety studies during drug development and post-marketing surveillance after approval, offer important opportunities to identify potential safety issues, but are unable to provide complete set of all possible DDIs. Thus, the drug discovery researchers and healthca...
It is essential for the advancement of science that scientists and researchers share, reuse and reproduce workflows and protocols used by others. The FAIR principles are a set of guidelines that aim to maximize the value and usefulness of research data, and emphasize a number of important points regarding the means by which digital objects are foun...
Combining data from varied sources has considerable potential for knowledge discovery: collaborating data parties can mine data in an expanded feature space, allowing them to explore a larger range of scientific questions. However, data sharing among different parties is highly restricted by legal conditions, ethical concerns, and / or data volume....
In recent years, as newer technologies have evolved around the healthcare ecosystem, more and more data have been generated. Advanced analytics could power the data collected from numerous sources, both from healthcare institutions, or generated by individuals themselves via apps and devices, and lead to innovations in treatment and diagnosis of di...
The FAIR principles were received with broad acceptance in several scientific communities. However, there is still some degree of uncertainty on how they should be implemented. Several self-report questionnaires have been proposed to assess the implementation of the FAIR principles. Moreover, the FAIRmetrics group released 14, general-purpose matur...
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
Publishing databases in the Resource Description Framework (RDF) model is becoming widely accepted to maximize the syntactic and semantic interoperability of open data in life sciences. Here we report advancements made in the 6th and 7th annual BioHackathons which were held in Tokyo and Miyagi respectively. This review consists of two major section...
Transparent evaluations of FAIRness are increasingly required by a wide range of stakeholders, from scientists to publishers, funding agencies and policy makers. We propose a scalable, automatable framework to evaluate digital resources that encompasses measurable indicators, open source tools, and participation guidelines, which come together to a...
To reuse the enormous amounts of biomedical data available on the Web, there is an urgent need for good quality metadata. This is extremely important to ensure that data is maximally Findable, Accessible, Interoperable and Reusable. The Gene Expression Omnibus (GEO) allow users to specify metadata in the form of textual key: value pairs (e.g. sex:...
To reuse the enormous amounts of biomedical data available on the Web, there is an urgent need for good quality metadata. This is extremely important to ensure that data is maximally Findable, Accessible, Interoperable and Reusable. The Gene Expression Omnibus (GEO) allow users to specify metadata in the form of textual key: value pairs (e.g. sex:...
It is widely anticipated that the use and analysis of health-related big data will enable further understanding and improvements in human health and wellbeing. Here, we propose an innovative infrastructure, which supports secure and privacy-preserving analysis of personal health data from multiple providers with different governance policies. Our o...
The learning health system depends on a cycle of evidence generation, translation to practice, and continuous practice-based data collection. Clinical practice guidelines (CPGs) represent medical evidence, translated into recommendations on appropriate clinical care. The FAIR guiding principles offer a framework for publishing the extensive knowled...
In this paper we present our preliminary work on monitoring data License accoUntability and CompliancE (LUCE). LUCE is a blockchain platform solution designed to stimulate data sharing and reuse, by facilitating compliance with licensing terms. The platform enables data accountability by recording the use of data and their purpose on a blockchain-s...
Transparent evaluations of FAIRness are increasingly required by a wide range of stakeholders, from scientists to publishers, funding agencies and policy makers. We propose a scalable, automatable framework to evaluate digital resources that encompasses measurable indicators, open source tools, and participation guidelines, which come together to a...
To the Editors:
In the commentary by Guise et al., the authors describe a learning cycle for learning health systems in which evidence is rapidly generated, integrated into practice, and further evidence can be generated for further medical and clinical insights.1 The novel aspect suggested is the archiving of data, whether from clinical trials, s...
This open access book comprehensively covers the fundamentals of clinical data science, focusing on data collection, modelling and clinical applications. Topics covered in the first section on data collection include: data sources, data at scale (big data), data stewardship (FAIR data) and related privacy concerns. Aspects of predictive modelling u...
It is widely anticipated that the use of health-related big data will enable further understanding and improvements in human health and wellbeing. Our current project, funded through the Dutch National Research Agenda, aims to explore the relationship between the development of diabetes and socio-economic factors such as lifestyle and health care u...
Columbia Open Health Data (COHD) is a publicly accessible database of electronic health record (EHR) prevalence and co-occurrence frequencies between conditions, drugs, procedures, and demographics. COHD was derived from Columbia University Irving Medical Center’s Observational Health Data Sciences and Informatics (OHDSI) database. The lifetime dat...
Recent developments in machine learning have lead to a rise of large number of methods for extracting features from structured data. The features are represented as a vectors and may encode for some semantic aspects of data. They can be used in a machine learning models for different tasks or to compute similarities between the entities of the data...
Prescribing the right drug with the right dose is a central tenet of precision medicine. We examined the use of patients’ prior Electronic Health Records to predict a reduction in drug dosage. We focus on drugs that interact with the P450 enzyme family, because their dosage is known to be sensitive and variable. We extracted diagnostic codes, condi...
Nanopublications are a Linked Data format for scholarly data publishing that has received considerable uptake in the last few years. In contrast to the common Linked Data publishing practice, nanopublications work at the granular level of atomic information snippets and provide a consistent container format to attach provenance and metadata at this...
With the increased adoption of the FAIR Principles, a wide range of stakeholders, from scientists to publishers, funding agencies and policy makers, are seeking ways to transparently evaluate resource FAIRness. We describe the FAIR Evaluator, a software infrastructure to register and execute tests of compliance with the recently published FAIR Metr...
Crowdsourcing involves the creating of HITs (Human Intelligent Tasks), submitting them to a crowdsourcing platform and providing a monetary reward for each HIT. One of the advantages of using crowdsourcing is that the tasks can be highly parallelized, that is, the work is performed by a high number of workers in a decentralized setting. The design...
A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. One approach to facilitate reuse of the scientific literature is to structure this information as linked data using standardized web technologies. In this paper we present the se...
A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. One approach to facilitate reuse of the scientific literature is to structure this information as linked data using standardized web technologies. In this paper we present the se...
A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. One approach to facilitate reuse of the scientific literature is to structure this information as linked data using standardized web technologies. In this paper we present the se...
“FAIRness” - the degree to which a digital resource is Findable, Accessible, Interoperable, and Reusable - is aspirational, yet the means of reaching it may be defined by increased adherence to measurable indicators. We report on the production of a core set of semi-quantitative metrics having universal applicability for the evaluation of FAIRness,...
Therapeutic intent, the reason behind the choice of a therapy and the context in which a given approach should be used, is an important aspect of medical practice. There are unmet needs with respect to current electronic mapping of drug indications. For example, the active ingredient sildenafil has 2 distinct indications, which differ solely on dos...
Various approaches and systems have been presented in the context of scholarly communication for what has been called semantic publishing. Closer inspection, however, reveals that these approaches are mostly not about publishing semantic representations, as the name seems to suggest. Rather, they take the processes and outcomes of the current narra...
Background
The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: fem...
Background
Patient data, such as electronic health records or adverse event reporting systems, constitute an essential resource for studying Adverse Drug Events (ADEs). We explore an original approach to identify frequently associated ADEs in subgroups of patients. ResultsBecause ADEs have complex manifestations, we use formal concept analysis and...
Biomedical data are growing at an incredible pace and require substantial expertise to organize data in a manner that makes them easily findable, accessible, interoperable and reusable. Massive effort has been devoted to using Semantic Web standards and technologies to create a network of Linked Data for the life sciences, among others. However, wh...
In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure....
Evidence is lacking for patient-reported effectiveness of treatments for most medical conditions and specifically for lower back pain. In this paper, we examined a consumer-based social network that collects patients' treatment ratings as a potential source of evidence. Acknowledging the potential biases of this data set, we used propensity score m...