Xavier Tannier

Xavier Tannier
  • Professor
  • Professor (Full) at Sorbonne University

About

212
Publications
36,657
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,654
Citations
Introduction
I am an Full Professor at Sorbonne Université (formerly known as Pierre and Marie Curie University, Paris 6, UPMC). My research topics concern natural language processing and information retrieval and extraction. I teach Computer Science at Polytech Paris Sorbonne.
Current institution
Sorbonne University
Current position
  • Professor (Full)
Additional affiliations
September 2017 - present
Sorbonne University
Position
  • Professor (Full)
Description
  • Currently head of the master's program Computer Science and Applied Mathematics (MAIN) at Polytech Sorbonne.
Position
  • Professor (Full)
October 2006 - September 2007
Xerox Corporation
Position
  • Researcher
Education
September 2003 - September 2006
Mines Saint-Étienne
Field of study
  • Information Retrieval, Natural Language Processing
September 1998 - September 2003

Publications

Publications (212)
Preprint
Rules could be an information extraction (IE) default option, compared to ML and LLMs in terms of sustainability, transferability, interpretability, and development burden. We suggest a sustainable and combined use of rules and ML as an IE method. Our approach starts with an exhaustive expert manual highlighting in a single working session of a rep...
Article
Full-text available
Objectives: To summarize key contributions to current research in the field of Clinical Research Informatics (CRI) and to select the best papers published in 2023. Methods: A bibliographic search using a combination of MeSH descriptors and free-text terms on CRI was performed using PubMed, followed by a double-blind review in order to select a list...
Chapter
Semantic interoperability is a growing and challenging subject in the healthcare domain. It aims to ensure a coherent and unambiguous exchange, use, and reuse of health information among different systems and applications. In the context of the EUCAIM (Cancer Image Europe) project, semantic interoperability among various heterogeneous cancer image...
Preprint
BACKGROUND Valuable insights gathered by clinicians during their inquiries and documented in textual reports are often unavailable in the structured data recorded in the electronic health records (EHRs). OBJECTIVE This work highlights that mining unstructured textual data with natural language processing (NLP) techniques complements the available...
Article
Full-text available
Background Valuable insights gathered by clinicians during their inquiries and documented in textual reports are often unavailable in the structured data recorded in electronic health records (EHRs). Objective This study aimed to highlight that mining unstructured textual data with natural language processing techniques complements the available s...
Article
The morphological classification of nucleated blood cells is fundamental for the diagnosis of hematological diseases. Many Deep Learning algorithms have been implemented to automatize this classification task, but most of the time they fail to classify images coming from different sources. This is known as “domain shift”. Whereas some research has...
Article
Full-text available
Background: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must na...
Article
White blood cell classification plays a key role in the diagnosis of hematologic diseases. Models can perform classification either from images or based on morphological features. Image-based classification generally yields higher performance, but feature-based classification is more interpretable for clinicians. In this study, we employed a Multim...
Article
Interoperability is crucial to overcoming various challenges of data integration in the healthcare domain. While OMOP and FHIR data standards handle syntactic heterogeneity among heterogeneous data sources, ontologies support semantic interoperability to overcome the complexity and disparity of healthcare data. This study proposes an ontological ap...
Article
Full-text available
Malaria is a deadly disease that is transmitted through mosquito bites. Microscopists use a microscope to examine thin blood smears at high magnification (1000x) to identify parasites in red blood cells (RBCs). Estimating parasitemia is essential in determining the severity of the Plasmodium falciparum infection and guiding treatment. However, this...
Preprint
BACKGROUND Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must nav...
Article
Full-text available
Mosquito-borne diseases like malaria are rising globally, and improved mosquito vector surveillance is needed. Survival of Anopheles mosquitoes is key for epidemiological monitoring of malaria transmission and evaluation of vector control strategies targeting mosquito longevity, as the risk of pathogen transmission increases with mosquito age. Howe...
Article
Objective To develop and validate a natural language processing (NLP) pipeline that detects 18 conditions in French clinical notes, including 16 comorbidities of the Charlson index, while exploring a collaborative and privacy-enhancing workflow. Materials and Methods The detection pipeline relied both on rule-based and machine learning algorithms,...
Article
Full-text available
Objective The objective of this study is to address the critical issue of deidentification of clinical reports to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals...
Article
Full-text available
There is an urgent need to monitor the mental health of large populations, especially during crises such as the COVID-19 pandemic, to timely identify the most at-risk subgroups and to design targeted prevention campaigns. We therefore developed and validated surveillance indicators related to suicidality: the monthly number of hospitalisations caus...
Article
Full-text available
Diabetic foot ulcers can have vital consequences, such as amputation for patients. The primary purpose of this study is to predict the amputation risk of diabetic foot patients using machine‐learning classification algorithms. In this research, 407 patients treated with the diagnosis of diabetic foot between January 2009–September 2019 in Istanbul...
Article
Full-text available
Objectives: To summarize key contributions to current research in the field of Clinical Research Informatics (CRI) and to select best papers published in 2022. Method: A bibliographic search using a combination of Medical Subject Headings (MeSH) descriptors and free-text terms on CRI was performed using PubMed, followed by a double-blind review in...
Article
Aspergillosis of the new-born remains a rare but severe disease. We report four cases of primary cutaneous A. flavus infections in premature new-borns linked to incubators contamination by putative clonal strains. Our objective was to evaluate the ability of MALDI-TOF coupled to Convolutional Neural Network (CNN) for clone recognition in a context...
Article
Full-text available
Introduction Since the 1970s, fetal scalp blood sampling (FSBS) has been used as a second‐line test of the acid–base status of the fetus to evaluate fetal well‐being during labor. The commonly employed thresholds that delineate normal pH (>7.25), subnormal (7.20–7.25), and pathological pH (<7.20) guide clinical decisions. However, these experienced...
Article
Objectives Medico-administrative data are promising to automate the calculation of Healthcare Quality and Safety Indicators. Nevertheless, not all relevant indicators can be calculated with this data alone. Our feasibility study objective is to analyze 1) the availability of data sources; 2) the availability of each indicator elementary variables,...
Article
Full-text available
Background The SARS CoV‐2 pandemic disrupted healthcare systems. We compared the cancer stage for new breast cancers (BCs) before and during the pandemic. Methods We performed a retrospective multicenter cohort study on the data warehouse of Greater Paris University Hospitals (AP‐HP). We identified all female patients newly referred with a BC in 2...
Preprint
Full-text available
Objective To develop and validate advanced natural language processing pipelines that detect 18 conditions in clinical notes written in French, among which 16 comorbidities of the Charlson index, while exploring a collaborative and privacy-preserving workflow. Materials and methods The detection pipelines relied both on rule-based and machine learn...
Article
Full-text available
The SARS‐COV‐2 pandemic disrupted healthcare systems. We assessed its impact on the presentation, care trajectories and outcomes of new pancreatic cancers (PCs) in the Paris area. We performed a retrospective multicenter cohort study on the data warehouse of Greater Paris University Hospitals (AP‐HP). We identified all patients newly referred with...
Article
Full-text available
Real-world data (RWD) bears great promises to improve the quality of care. However, specific infrastructures and methodologies are required to derive robust knowledge and brings innovations to the patient. Drawing upon the national case study of the 32 French regional and university hospitals governance, we highlight key aspects of modern clinical...
Preprint
BACKGROUND Biomedical natural language processing tasks are best performed with English models, and translation tools have undergone major improvements. On the other hand, building annotated biomedical datasets remains a challenge. OBJECTIVE The objective of our study is to determine whether using English tools to extract and normalize French medi...
Article
Background: Biomedical natural language processing tasks are best performed with English models, and translation tools have undergone major improvements. On the other hand, building annotated biomedical datasets remains a challenge. Objective: The aim of our study is to determine whether the use of English tools to extract and normalize French m...
Article
Background Natural language processing tools are powerful for mining rheumatology databases, extracting patient information directly from clinical notes. However, these algorithms come with a high computational cost and are often not applicable at the scale of very large databases in the temporality of clinical practice. Objectives The objective o...
Preprint
Full-text available
Objective:Develop and validate an algorithm for analyzing the layout of PDF clinical documents to improve the performance of downstream natural language processing tasks. Materials and Methods: We designed an algorithm to process clinical PDF documents and extract only clinically relevant text. The algorithm consists of several steps: initial text...
Article
Purpose: To compare the computability of Observational Medical Outcomes Partnership (OMOP)-based queries related to prescreening of patients using two versions of the OMOP common data model (CDM; v5.3 and v5.4) and to assess the performance of the Greater Paris University Hospital (APHP) prescreening tool. Materials and methods: We identified th...
Article
Full-text available
Identifying fungal clones propagated during outbreaks in hospital settings is a problem that increasingly confronts biologists. Current tools based on DNA sequencing or microsatellite analysis require specific manipulations that are difficult to implement in the context of routine diagnosis. Using deep learning to classify the mass spectra obtained...
Preprint
Full-text available
The objective of this study is to address the critical issue of de-identification of clinical reports in order to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals...
Preprint
Full-text available
The objective of our study is to determine whether using English tools to extract and normalize French medical concepts on translations provides comparable performance to French models trained on a set of annotated French clinical notes. We compare two methods: a method involving French language models and a method involving English language models...
Preprint
Full-text available
Real World Data (RWD) bears great promises to improve the quality of care. However, specific infrastructures and methodologies are required to derive robust knowledge and brings innovations to the patient. Drawing upon the national case study of the 32 French regional and university hospitals governance, we highlight key aspects of modern Clinical...
Article
Full-text available
Objectives: To summarize key contributions to current research in the field of Clinical Research Informatics (CRI) and to select best papers published in 2021. Method: Using PubMed, we did a bibliographic search using a combination of MeSH descriptors and free-text terms on CRI, followed by a double-blind review in order to select a list of candida...
Article
Malaria is a fatal disease transmitted by bites from mosquito-type vectors. Biologists examined blood smears under a microscope at high magnification (1000×) to identify the presence of parasites in red blood cells (RBCs). Such an examination is laborious and time-consuming. Moreover, microscopists sometimes have difficulty identifying parasitized...
Article
Full-text available
This paper presents a model aiming to automatically detect sections in medieval Latin charters. These legal sources are some of the most important sources for medieval studies as they reflect economic and social dynamics as well as legal and institutional writing practices. An automatic linear segmentation can greatly facilitate charter indexation...
Article
Full-text available
Objective: This study aimed to analyze risk factors for amputation (overall, minor and major) in patients with diabetic foot ulcers (DFUs). Methods: 407 patients with DFUs (286 male, 121 female; mean age = 60, age range = 32-92) who were managed in a tertiary care centre from 2009 to 2019 were retrospectively identified and included in the study...
Preprint
Full-text available
The ongoing worldwide emergence of epidemic azole-resistant Candida parapsilosis clones is a threat to human health. In order to monitor a clonal outbreak which involved 2 hospitals in Paris, we used a technology that most clinical laboratories are equipped with: Matrix-Assisted Laser Desorption-Ionization Time of Flight mass spectrometry (MALDI-TO...
Preprint
BACKGROUND Reliable and interpretable automatic extraction of clinical phenotypes from large electronic medical records databases remains a challenge, especially in a language other than English. OBJECTIVE We aimed to provide an automated end-to-end extraction of cohorts of similar patients from electronic health records for systemic diseases. ME...
Article
Full-text available
Background Reliable and interpretable automatic extraction of clinical phenotypes from large electronic medical record databases remains a challenge, especially in a language other than English. Objective We aimed to provide an automated end-to-end extraction of cohorts of similar patients from electronic health records for systemic diseases. Met...
Preprint
Full-text available
Background Clinical studies using real-world data may benefit from exploiting clinical reports, a particularly rich albeit unstructured medium. To that end, natural language processing can extract relevant information. Methods based on transfer learning using pre-trained language models have achieved state-of-the-art results in most NLP application...
Article
Full-text available
Background Intervening in and preventing diabetes distress requires an understanding of its causes and, in particular, from a patient’s perspective. Social media data provide direct access to how patients see and understand their disease and consequently show the causes of diabetes distress. Objective Leveraging machine learning methods, we aim to...
Article
Introduction The SARS-CoV-2 pandemic has impacted the care of cancer patients. This study sought to assess the pandemic’s impact on the clinical presentations and outcomes of newly referred patients with lung cancer from the Greater Paris area. Methods We retrospectively retrieved the electronic health records and administrative data of 11.4 mil...
Preprint
Full-text available
Objective: We aimed to provide an automated end-to-end extraction of cohorts of similar patients from electronic health records for systemic diseases. Materials and Methods: Our multistep algorithm includes a named-entity recognition step, a multilabel classification using Medical Subject Headings ontology and the computation of patient similarity....
Article
A vast amount of crucial information about patients resides solely in unstructured clinical narrative notes. There has been a growing interest in clinical Named Entity Recognition (NER) task using deep learning models. Such approaches require sufficient annotated data. However, there is little publicly available annotated corpora in the medical fie...
Article
Background The development of electronic health records has provided a large volume of unstructured biomedical information. Extracting patient characteristics from these data has become a major challenge, especially in languages other than English. Methods Inspired by the French Text Mining Challenge (DEFT 2021) [1] in which we participated, our s...
Preprint
BACKGROUND Intervening in and preventing diabetes distress requires an understanding of its causes and, in particular, from a patient’s perspective. Social media data provide direct access to how patients see and understand their disease and consequently show the causes of diabetes distress. OBJECTIVE Leveraging machine learning methods, we aim to...
Article
Full-text available
The spread of fungal clones is hard to detect in the daily routines in clinical laboratories, and there is a need for new tools that can facilitate clone detection within a set of strains. Currently, Matrix Assisted Laser Desorption-Ionization Time-of-Flight Mass Spectrometry is extensively used to identify microbial isolates at the species level....
Article
Full-text available
The SARS‐Cov2 may have impaired care trajectories, patient overall survival (OS), tumor stage at initial presentation for new colorectal cancer (CRC) cases. This study aimed at assessing those indicators before and after the beginning of the pandemic in France. In this retrospective cohort study, we collected prospectively the clinical data of the...
Article
Introduction Les fractures ostéoporotiques sont associées à un excès de morbi-mortalité. La mise en œuvre de parcours de soins de type filière fracture est efficace pour réduire le risque de nouvelle fracture et l’excès de mortalité. La mobilisation de ressources humaines et la difficulté à identifier les patients éligibles est l’une des limites à...
Preprint
Full-text available
Objective: Leveraging machine learning methods, we aim to extract both explicit and implicit cause-effect associations in patient-reported, diabetes-related tweets and provide a tool to better understand opinion, feelings and observations shared within the diabetes online community from a causality perspective. Materials and Methods: More than 30 m...
Article
PurposeThe Coronavirus disease 2019 (COVID-19) has led to an unparalleled influx of patients. Prognostic scores could help optimizing healthcare delivery, but most of them have not been comprehensively validated. We aim to externally validate existing prognostic scores for COVID-19.Methods We used “COVID-19 Evidence Alerts” (McMaster University) to...
Chapter
This paper studies the effect of the order of depth of mention on nested named entity recognition (NER) models. NER is an essential task in the extraction of biomedical information, and nested entities are common since medical concepts can assemble to form larger entities. Conventional NER systems only predict disjointed entities. Thus, iterative m...
Preprint
Full-text available
This paper studies the effect of the order of depth of mention on nested named entity recognition (NER) models. NER is an essential task in the extraction of biomedical information, and nested entities are common since medical concepts can assemble to form larger entities. Conventional NER systems only predict disjointed entities. Thus, iterative m...
Article
Full-text available
Introduction The dissemination of SARS-Cov2 may have delayed the diagnosis of new cancers. This study aimed at assessing the number of new cancers during and after the lockdown. Methods We collected prospectively the clinical data of the 11.4 million of patients referred to the Assistance Publique Hôpitaux de Paris Teaching Hospital. We identified...
Preprint
BACKGROUND The amount of available textual health data such as scientific and biomedical literature is constantly growing and it becomes more and more challenging for health professionals to properly summarise those data and in consequence to practice evidence-based clinical decision making. Moreover, the exploration of large unstructured health te...
Article
Full-text available
Background: The amount of available textual health data such as scientific and biomedical literature is constantly growing and becoming more and more challenging for health professionals to properly summarize those data and practice evidence-based clinical decision making. Moreover, the exploration of unstructured health text data is challenging f...
Article
Introduction Concept normalization is the task of linking terms from textual medical documents to their concept in terminologies such as the UMLS®. Traditional approaches to this problem depend heavily on the coverage of available resources, which poses a problem for languages other than English. Objective We present a system for concept normaliza...
Article
Full-text available
Vector control programmes are a strategic priority in the fight against malaria. However, vector control interventions require rigorous monitoring. Entomological tools for characterizing malaria transmission drivers are limited and are difficult to establish in the field. To predict Anopheles drivers of malaria transmission, such as mosquito age, b...
Article
Full-text available
Introduction Little research has been done to systematically evaluate concerns of people living with diabetes through social media, which has been a powerful tool for social change and to better understand perceptions around health-related issues. This study aims to identify key diabetes-related concerns in the USA and primary emotions associated w...
Preprint
Full-text available
Objective Little is known about the concerns of people living with diabetes. This study aims to identify key diabetes-related concerns in the USA and primary emotions associated with those concerns using information shared on Twitter. Research Design and Methods A total of 11.7 million diabetes-related tweets in English were collected between April...
Article
Objective: We aimed to enhance the performance of a supervised model for clinical named-entity recognition (NER) using medical terminologies. In order to evaluate our system in French, we built a corpus for 5 types of clinical entities. Methods: We used a terminology-based system as baseline, built upon UMLS and SNOMED. Then, we evaluated a biGR...
Conference Paper
An important class of journalistic fact-checking scenarios involves verifying the claims and knowledge of different actors at different moments in time. Claims may be about facts, or about other claims, leading to chains of hearsay. We have recently proposed a data model for (time-anchored) facts, statements and beliefs. It builds upon the W3C's RD...
Chapter
Claims on statistic (numerical) data, e.g., immigrant populations, are often fact-checked. We present a novel approach to extract from text documents, e.g., online media articles, mentions of statistic entities from a reference source. A claim states that an entity has certain value, at a certain time. This completes a fact-checking pipeline from t...
Conference Paper
Full-text available
News agencies produce thousands of multimedia stories describing events happening in the world that are either scheduled such as sports competitions, political summits and elections, or breaking events such as military conflicts, terrorist attacks, natural disasters, etc. When writing up those stories, journalists refer to contextual background and...
Preprint
We aimed to enhance the performance of a supervised model for clinical named-entity recognition (NER) using medical terminologies. In order to evaluate our system in French, we built a corpus for 5 types of clinical entities. We used a terminology-based system as baseline, built upon UMLS and SNOMED. Then, we evaluated a biGRU-CRF, and an hybrid sy...
Preprint
Full-text available
News agencies produce thousands of multimedia stories describing events happening in the world that are either scheduled such as sports competitions, political summits and elections, or breaking events such as military conflicts, terrorist attacks, natural disasters, etc. When writing up those stories, journalists refer to contextual background and...
Preprint
Full-text available
Objective: Natural language processing can help minimize human intervention in identifying patients meeting eligibility criteria for clinical trials, but there is still a long way to go to obtain a general and systematic approach that is useful for researchers. We describe two methods taking a step in this direction and present their results obtain...
Article
Full-text available
Data journalism designates journalistic work inspired by digital data sources. A particularly popular and active area of data journalism is concerned with fact-checking. The term was born in the journalist community and referred the process of verifying and ensuring the accuracy of published media content; since 2012, however, it has increasingly f...
Conference Paper
The proliferation of falsehood and misinformation, in particular through the Web, has lead to increasing energy being invested into journalistic fact-checking. Fact-checking journalists typically check the accuracy of a claim against some trusted data source. Statistic databases such as those compiled by state agencies are often used as trusted dat...
Article
Integrating Biology and the Bedside (i2b2) is the de-facto open-source medical tool for cohort discovery. Fast Healthcare Interoperability Resources (FHIR) is a new standard for exchanging health care information electronically. Substitutable Modular third-party Applications (SMART) defines the SMART-on-FHIR specification on how applications shall...
Conference Paper
Full-text available
Fact checking has captured the attention of the media and the public alike; it has also recently received strong attention from the computer science community, in particular from data and knowledge management, natural language processing and information retrieval; we denote these together under the term "content management". In this paper, we ident...
Conference Paper
Statistic data is an important sub-category of open data; it is interesting for many applications, including but not limited to data journalism, as such data is typically of high quality, and reflects (under an aggregated form) important aspects of a society's life such as births, immigration, economy etc. However, such open data is often not publi...
Conference Paper
The correct identification of the link between an entity mention in a text and a known entity in a large knowledge base is important in information retrieval or information extraction. The general approach for this task is to generate, for a given mention, a set of candidate entities from the base and, in a second step, determine which is the best...

Network

Cited By