Douglas Teodoro

Douglas Teodoro
  • PhD
  • Professor (Assistant) at University of Geneva

About

124
Publications
14,708
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,309
Citations
Current institution
University of Geneva
Current position
  • Professor (Assistant)
Additional affiliations
September 2008 - February 2013
University of Geneva
Position
  • Research Assistant
Education
September 2008 - October 2012
University of Geneva
Field of study
  • Informatics
March 2000 - December 2005
Federal University of Itajubá
Field of study
  • Computer Engineering

Publications

Publications (124)
Preprint
Full-text available
Large Language Models (LLMs) have shown remarkable progress in medical question answering (QA), yet their effectiveness remains predominantly limited to English due to imbalanced multilingual training data and scarce medical resources for low-resource languages. To address this critical language gap in medical QA, we propose Multilingual Knowledge...
Preprint
Full-text available
Introduction Considering numerous radiological images and the heavy workload of writing corresponding reports in clinical work, it is significant to leverage artificial intelligence (AI) to facilitate this process and reduce the burden of radiologists. In the past few years, particularly with the advent of vision language models, some works explore...
Preprint
Full-text available
Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasonin...
Article
Full-text available
Adverse drug events (ADEs) are a major safety issue in clinical trials. Thus, predicting ADEs is key to developing safer medications and enhancing patient outcomes. To support this effort, we introduce CT-ADE, a dataset for multilabel ADE prediction in monopharmacy treatments. CT-ADE encompasses 2,497 drugs and 168,984 drug-ADE pairs from clinical...
Preprint
Full-text available
Equitable distribution of physicians across specialties is a significant public health challenge. While previous studies primarily relied on classic statistics models to estimate factors affecting medical students' career choices, this study explores the use of machine learning techniques to predict decisions early in their studies. We evaluated va...
Preprint
Full-text available
Recently, machine learning methods have emerged to predict dental disease progression, often relying on costly annotated datasets and frequently exhibiting low generalization performance. This study evaluates the application of Siamese networks for detecting subtle changes in longitudinal dental x-rays and predicting the time span category between...
Preprint
Full-text available
Over the past few years, discriminative and generative large language models (LLMs) have emerged as the predominant approaches in natural language processing. However, despite significant advancements, there remains a gap in comparing the performance of discriminative and generative LLMs in cross-lingual biomedical concept normalization. In this pa...
Preprint
Full-text available
Artificial intelligence (AI) is increasingly applied to clinical trial risk assessment, aiming to improve safety and efficiency. This scoping review analyzes 142 studies published between 2013 and 2024, focusing on safety (n=55), efficacy (n=46), and operational (n=45) risk prediction. AI techniques, including traditional machine learning, deep lea...
Article
Full-text available
Large language models (LLMs) have the potential to enhance the verification of health claims. However, issues with hallucination and comprehension of logical statements require these models to be closely scrutinized in healthcare applications. We introduce CliniFact, a scientific claim dataset created from hypothesis testing results in clinical res...
Preprint
Full-text available
Interoperability in health information systems is crucial for accurate data exchange across environments such as electronic health records, clinical notes, and medical research. The main challenge arises from the wide variation in biomedical concepts, their representation across different systems and languages, and the limited context, complicating...
Article
Full-text available
Objectives Clinical trials (CTs) are essential for improving patient care by evaluating new treatments’ safety and efficacy. A key component in CT protocols is the study population defined by the eligibility criteria. This study aims to evaluate the effectiveness of large language models (LLMs) in encoding eligibility criterion information to suppo...
Article
Full-text available
IntroductionThe 2030 Sustainable Development Agenda and the United Nations Convention on the Rights of Persons with Disabilities (CRPD) aspire to leave no one behind and call for the inclusion of persons with disabilities in all spheres of life. To monitor this goal of inclusion, CRPD's Article 31 requires state parties to collect data about the si...
Preprint
Full-text available
Objectives Clinical trials (CTs) are essential for improving patient care by evaluating new treatments' safety and efficacy. A key component in CT protocols is the study population defined by the eligibility criteria. This study aims to evaluate the effectiveness of large language models (LLMs) in encoding eligibility criterion information to suppo...
Preprint
Full-text available
Large language models (LLMs) have the potential to enhance the verification of health claims. However, issues with hallucination and comprehension of logical statements require these models to be closely scrutinized in healthcare applications. We introduce CliniFact, a scientific claim dataset created from hypothesis testing results in clinical res...
Article
Full-text available
Objectives In the midst of the pandemic, face-to-face data collection for national censuses and surveys was suspended due to limitations on mobility and social distancing, limiting the collection of already scarce disability data. Responses to these constraints were met with a surge of high-frequency phone surveys (HFPSs) that aimed to provide time...
Preprint
Full-text available
The Medical Subject Headings (MeSH), one of the main knowledge organization systems in the biomedical domain, is constantly evolving following the latest scientific discoveries in health and life sciences. Previous research focused on quantifying information in MeSH using its hierarchical structure. In this work, we propose a data-driven approach b...
Article
Full-text available
Due to the complexity of the biomedical domain, the ability to capture semantically meaningful representations of terms in context is a long-standing challenge. Despite important progress in the past years, no evaluation benchmark has been developed to evaluate how well language models represent biomedical concepts according to their corresponding...
Article
Full-text available
Background : While Enterobacteriaceae bacteria are commonly found in the healthy human gut, their colonization of other body parts can potentially evolve into serious infections and health threats. We investigate a graph-based machine learning model to predict risks of inpatient colonization by multidrug-resistant (MDR) Enterobacteriaceae. Methods:...
Preprint
Due to the complexity of the biomedical domain, the ability to capture semantically meaningful representations of terms in context is a long-standing challenge. Despite important progress in the past years, no evaluation benchmark has been developed to evaluate how well language models represent biomedical concepts according to their corresponding...
Preprint
BACKGROUND Medical student career choice directly influences the physician workforce shortage and the misdistribution of resources. Individual and contextual factors related to career choice have been evaluated separately, but their interaction over time is unclear. Secondly, actual career choice, reasons for this choice, and the influence of natio...
Article
Full-text available
Background A medical student’s career choice directly influences the physician workforce shortage and the misdistribution of resources. First, individual and contextual factors related to career choice have been evaluated separately, but their interaction over time is unclear. Second, actual career choice, reasons for this choice, and the influence...
Preprint
Full-text available
The increasing significance of Adverse Drug Events (ADEs) extracted from social media, such as Twitter data, has led to the development of various end-to-end resolution methodologies. Despite recent advancements, there remains a substantial gap in normalizing ADE entities coming from social media, particularly with informal and diverse expressions...
Preprint
Full-text available
This paper outlines the performance evaluation of a system for adverse drug event normalization, developed by the Data Science for Digital Health group for the Social Media Mining for Health Applications 2023 shared task 5. Shared task 5 targeted the normalization of adverse drug event mentions in Twitter to standard concepts from the Medical Dicti...
Preprint
Full-text available
This paper presents the results of the Data Science for Digital Health (DS4DH) group in the MEDIQA-Chat Tasks at ACL-ClinicalNLP 2023. Our study combines the power of a classical machine learning method, Support Vector Machine, for classifying medical dialogues, along with the implementation of oneshot prompts using GPT-3.5. We employ dialogues and...
Article
Full-text available
Background The COVID-19 pandemic has led to an unprecedented amount of scientific publications, growing at a pace never seen before. Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the eviden...
Preprint
Full-text available
Effective representation of medical concepts is crucial for secondary analyses of electronic health records. Neural language models have shown promise in automatically deriving medical concept representations from clinical data. However, the comparative performance of different language models for creating these empirical representations, and the e...
Preprint
Full-text available
While Enterobacteriaceae bacteria are commonly found in healthy human gut, their colonisation of other body parts can potentially evolve into serious infections and health threats. We aim to design a graph-based machine learning model to assess risks of inpatient colonisation by multi-drug resistant (MDR) Enterobacteriaceae. The colonisation predic...
Preprint
Background: The presence of widespread misinformation in Web resources and the limited quality control provided by search engines can lead to serious implications for individuals seeking health advice. Objective: We aimed to investigate a multi-dimensional information quality assessment model based on deep learning to enhance the reliability of onl...
Article
Full-text available
The prediction of chemical reaction pathways has been accelerated by the development of novel machine learning architectures based on the deep learning paradigm. In this context, deep neural networks initially designed for language translation have been used to accurately predict a wide range of chemical reactions. Among models suited for the task...
Preprint
Full-text available
Current approaches for clinical information extraction are inefficient in terms of computational costs and memory consumption, hindering their application to process large-scale electronic health records (EHRs). We propose an efficient end-to-end model, the Joint-NER-RE-Fourier (JNRF), to jointly learn the tasks of named entity recognition and rela...
Article
Full-text available
Success rate of clinical trials (CTs) is low, with the protocol design itself being considered a major risk factor. We aimed to investigate the use of deep learning methods to predict the risk of CTs based on their protocols. Considering protocol changes and their final status, a retrospective risk assignment method was proposed to label CTs accord...
Preprint
Full-text available
Background: The COVID-19 pandemic has led to an unprecedented amount of scientific publications, growing at a pace never seen before. Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the evide...
Chapter
Full-text available
The study of existing links among different types of medical concepts can support research on optimal pathways for the treatment of human diseases. Here, we present a clustering analysis of medical concept learned representations generated from MIMIC-IV, an open dataset of de-identified digital health records. Patient’s trajectory information were...
Preprint
BACKGROUND The presence of widespread misinformation in Web resources and the limited quality control provided by search engines can lead to serious implications for individuals seeking health advice. OBJECTIVE We aimed to investigate a multi-dimensional information quality assessment model based on deep learning to enhance the reliability of onli...
Article
Full-text available
Background Widespread misinformation in web resources can lead to serious implications for individuals seeking health advice. Despite that, information retrieval models are often focused only on the query-document relevance dimension to rank results. Objective We investigate a multidimensional information quality retrieval model based on deep lear...
Article
Full-text available
Background Identifying and removing reference duplicates when conducting systematic reviews (SRs) remain a major, time-consuming issue for authors who manually check for duplicates using built-in features in citation managers. To address issues related to manual deduplication, we developed an automated, efficient, and rapid artificial intelligence-...
Chapter
Full-text available
A recent trend in health-related machine learning proposes the use of Graph Neural Networks (GNN’s) to model biomedical data. This is justified due to the complexity of healthcare data and the modelling power of graph abstractions. Thus, GNN’s emerge as the natural choice to learn from increasing amounts of healthcare data. While formulating the pr...
Chapter
Full-text available
We present an analysis of supplementary materials of PubMed Central (PMC) articles and show their importance in indexing and searching biomedical literature, in particular for the emerging genomic medicine field. On a subset of articles from PubMed Central, we use text mining methods to extract MeSH terms from abstracts, full texts, and text-based...
Chapter
Full-text available
The importance of genomic data for health is rapidly growing but accessing and gathering information about variants from different sources is hindered by highly heterogeneous representations of variants, as outlined by clinical associations (AMP/ASCO/CAP) in their recommendations. To enable a smooth and effective retrieval of variant-containing doc...
Preprint
Full-text available
This paper describes the work of the Data Science for Digital Health (DS4DH) group at the TREC Health Misinformation Track 2021. The TREC Health Misinformation track focused on the development of retrieval methods that provide relevant, correct and credible information for health related searches on the Web. In our methodology, we used a two-step r...
Chapter
Full-text available
As the world’s population continues to expand, maritime transport is critical to ensure economic growth. To improve security and safety of maritime transportation, the Automatic Identification System (AIS) collects real-time data about vessels and their positions. While a large portion of the AIS data is provided via an automatic tracking system, s...
Article
Full-text available
The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we asse...
Article
Full-text available
The 2019 coronavirus (COVID-19) pandemic revealed the urgent need for the acceleration of vaccine development worldwide. Rapid vaccine development poses numerous risks for each category of vaccine technology. By using the Risklick artificial intelligence (AI), we estimated the risks associated with all types of COVID-19 vaccine during the early pha...
Preprint
Full-text available
We consider the hierarchical representation of documents as graphs and use geometric deep learning to classify them into different categories. While graph neural networks can efficiently handle the variable structure of hierarchical documents using the permutation invariant message passing operations, we show that we can gain extra performance impr...
Preprint
BACKGROUND The COVID-19 global health crisis has led to an exponential surge in the published scientific literature. In the attempt to tackle the pandemic, extremely large COVID-19-related corpora are being created, sometimes with inaccurate information, which is no longer at scale of human analyses. OBJECTIVE In the context of searching for scien...
Article
Full-text available
Background: The coronavirus disease (COVID-19) global health crisis has led to an exponential surge in the published scientific literature. In the attempt to tackle the pandemic, extremely large COVID-19-related corpora are being created, sometimes with inaccurate information, which is no longer at scale of human analyses. Objective: In the cont...
Article
Full-text available
Introduction: The SARS-CoV-2 pandemic has led to one of the most critical and boundless waves of publications in the history of modern science. The necessity to find and pursue relevant information and quantify its quality is broadly acknowledged. Modern information retrieval techniques combined with artificial intelligence (AI) appear as one of t...
Preprint
Full-text available
The health and life science domains are well-known for their wealth of entities. These entities are presented as free text in large corpora, such as biomedical scientific and electronic health records. To enable the secondary use of these corpora and unlock their value, named entity recognition (NER) methods are proposed. Inspired by the success of...
Conference Paper
Full-text available
Objectives: Clinical Named Entity Recognition is a critical Natural Language Processing task, as it could support biomedical research and healthcare systems. While most extracted clinical entities are based on single-label concepts, it is very common in the clinical domain entities with more than one semantic category simultaneously. This work prop...
Article
Full-text available
The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this article, we describe significant updates that we have made over the last two years to the resource. The number of sequences in UniProtKB has risen to approximately...
Conference Paper
Full-text available
With the growing number of electronic health record data, clinical NLP tasks have become increasingly relevant to unlock valuable information from unstructured clinical text. Although the performance of downstream NLP tasks, such as named-entity recognition (NER), in English corpus has recently improved by contextualised language models, less resea...
Conference Paper
With the growing number of electronic health record data, clinical NLP tasks have become increasingly relevant in healthcare, unlocking valuable information from unstructured clinical text. Although the performance of downstream tasks, such as named-entity recognition (NER), in English corpus have recently improved by contextualised language models...
Preprint
Full-text available
Chemical patent documents describe a broad range of applications holding key information, such as chemical compounds, reactions, and specific properties. However, the key information should be enabled to be utilized in downstream tasks. Text mining provides means to extract relevant information from chemical patents through information extraction t...
Conference Paper
Named entity recognition (NER) is key for biomedical applications as it allows knowledge discovery in free text data. As entities are semantic phrases, their meaning is conditioned to the context to avoid ambiguity. In this work, we explore contextualized language models for NER in French biomedical text as part of the Défi Fouille de Textes challe...
Article
Full-text available
In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigat...
Preprint
Full-text available
In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliography in UniProt, we investigate...
Article
Full-text available
The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate di...
Data
ORBDA source database tables–PostgreSQL datatypes. (DOCX)
Data
FETCH query pseudo-code implementation. (DOCX)
Data
Parameters used in the query latency assessments. (DOCX)
Data
SEARCH query pseudo-code implementation. (DOCX)
Article
Full-text available
The Brazilian Ministry of Health has selected the openEHR model as a standard for electronic health record systems. This paper presents a set of archetypes to represent the main data from the Brazilian Public Hospital Information System and the High Complexity Procedures Module of the Brazilian public Outpatient Health Information System. The arche...
Article
Assessing care quality and performance is essential to improve healthcare processes and population health management. However, due to bad system design and lack of access to required data, this assessment is often delayed or not done at all. The goal of our research is to investigate an advanced analytics platform that enables healthcare quality an...
Preprint
Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE D...
Article
Full-text available
http://dx.doi.org/10.1371/journal.pone.0150069 --------------------------------------------------------------- This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistenc...
Conference Paper
Full-text available
Antibiotics resistance poses a significant problem in today’s hospital care. Although large amounts of resistance data are gathered locally, they cannot be compared globally due to format and access diversity. We present an ontology-based integration approach serving an EU project in making antibiotics resistance data semantically and geographicall...
Article
Full-text available
Background: Working in a clinical environment requires unfettered mobility. This is especially true for nurses who are always on the move providing patients' care in different locations. Since the introduction of clinical information systems in hospitals, this mobility has often been considered hampered by interactions with computers. The populari...
Article
Full-text available
Improving antibiotic prescribing practices is an important public-health priority given the widespread antimicrobial resistance. Establishing clinical practice guidelines is crucial to this effort, but their development is a complex task and their quality is directly related to the methodology and source of knowledge used. We present the design and...
Article
Full-text available
Antibiotic resistance is a major worldwide public health concern. In clinical settings, timely antibiotic resistance information is key for care providers as it allows appropriate targeted treatment or improved empirical treatment when the specific results of the patient are not yet available. To improve antibiotic resistance trend analysis algorit...
Data
Full-text available
Decomposition of the K. pneumonia time series using the EMD method. Components used in the *DECA, °DECF and †DECS models. (PDF)
Data
Full-text available
Correlation between temperature and resistance components. °,*components mutually statistically significant different from noise; ° correlation not significant (); *correlation significant (). (PDF)
Data
Full-text available
Decomposition of the P. aeruginosa time series using the EMD method. Components used in the *DECA, °DECF and †DECS models. (PDF)
Data
Full-text available
Resistance time series for the test period. (PDF)
Data
Full-text available
Decomposition of the E. coli time series using the EMD method. Components used in the *DECA, °DECF and †DECS models. (PDF)
Data
Full-text available
Decomposition of the S. aureus time series using the EMD method. Components used in the *DECA, °DECF and †DECS models. (PDF)
Data
Full-text available
Results for 1-weak ahead forecasts. Raw signal: black; RW: red; KNN: green; DECA: dark blue; DECF: light blue; DECS: purple. (PDF)
Article
Full-text available
This paper describes an approach to build a Data Definition Ontology (DDO) in the context of full domain ontology integration with datasets in order to share and query clinical heterogeneous data repositories. We have adapted an existing semantic web tool (D2RQ) to implement a process that automatically generates the DDO from a database information...
Article
Full-text available
Patent collections contain an important amount of medical-related knowledge, but existing tools were reported to lack of useful functionalities. We present here the development of TWINC, an advanced search engine dedicated to patent retrieval in the domain of health and life sciences. Our tool embeds two search modes: an ad hoc search to retrieve r...
Article
Full-text available
Antimicrobial resistance has reached globally alarming levels and is becoming a major public health threat. Lack of efficacious antimicrobial resistance surveillance systems was identified as one of the causes of increasing resistance, due to the lag time between new resistances and alerts to care providers. Several initiatives to track drug resist...
Article
We present a new approach for pathogens and gene product normalization in the biomedical literature. The idea of this approach was motivated by needs such as literature curation, in particular applied to the field of infectious diseases thus, variants of bacterial species (S. aureus, Staphyloccocus aureus, ...) and their gene products (protein ArsC...
Article
Full-text available
Health-related information retrieval is complicated by the variety of nomenclatures available to name entities, since different communities of users will use different ways to name a same entity. We present in this report the development and evaluation of a user-friendly interactive Web application aiming at facilitating health-related patent searc...
Article
Full-text available
Objective: While the broad use of antibiotics has reached its limits with the apparition of bacterial resistance, it became of major importance to regulate antibiotic prescriptions. In this paper, we present KART, a system to facilitate the creation of clinical guidelines in the context of infec- tious diseases. Methods: This system is composed of...
Conference Paper
The BiTeM group participated in the first TREC Medical Records Track in 2011 relying on a strong background in medical records processing and medical terminologies. For this campaign, we submitted a baseline run, computed with a simple free-text index in the Terrier platform, which achieved fair results (0.468 for P10). We also performed automatic...
Conference Paper
For the third year, the BiTeM group participated in the TREC Chemical IR Track. For this campaign, we applied strategies that already showed their effectiveness, as the Citations Feedback, which takes benefit from the citations of the retrieved documents in order to re-arrange the ranking. But we also investigated a new inter-lingua model built wit...

Network

Cited By