
Pierre Zweigenbaum- French National Centre for Scientific Research
Pierre Zweigenbaum
- French National Centre for Scientific Research
About
314
Publications
69,465
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,693
Citations
Current institution
Publications
Publications (314)
In a parallel corpus we know which document is a translation of what by design. If the link between documents in different languages is not known, it needs to be established. In this chapter we will discuss methods for measuring document similarity across languages and how to evaluate the results. Then, we will proceed to discussing methods for bui...
With the advent of Neural Machine Translation (NMT), a breakthrough has been achieved with regard to translation quality when compared to previous approaches such as rule-based, example-based, and statistical machine translation (MT). NMT systems tend to be considered as black boxes and it is not easy to predict their behavior.
The aim of the Bilingual Lexicon Induction (BLI) task is to produce a bilingual lexicon using a pair of comparable corpora and either a small set of seed translations (a supervised setting) or no seeds at all (an unsupervised setting). A traditional bilingual dictionary usually offers a structure of senses and conditions for their translations, as...
As explained in Chap. 1 and later developed in Chap. 6, Machine Translation (MT) engines need to be trained with large numbers of parallel sentences or segments. The quantity and diversity of existing parallel text is limited however. This motivates the search for parallel sentences in comparable corpora. By exploring a larger share of the levels o...
In the beginning of the 2000s the use of comparable corpora was on the margins of NLP research. Existing MT systems were nearly always based on fully parallel corpora, while NLP applications were mostly built separately in each language without the advantages of cross-lingual transfer.
When we start working across languages, we need to determine ways of measuring similarity of entities (such as a word, a phrase, a sentence or a link between entities) within each language and across languages. Modern approaches usually rely on the Vector Space Model (VSM), which uses a numerical vector \(x\) of a specified dimensionality \(D\) to...
This section concerns applications of comparable corpora beyond pure machine translation. It has been argued [1, 2] that downstream applications such as cross-lingual document classification, information retrieval or natural language inference, apart from proving the practical utility of NLP methods
Collecting relations between chemicals and drugs is crucial in biomedical research. The pre-trained transformer model, e.g. Bidirectional Encoder Representations from Transformers (BERT), is shown to have limitations on biomedical texts; more specifically, the lack of annotated data makes relation extraction (RE) from biomedical texts very challeng...
Recently many studies have been conducted on the topic of relation extraction. The DrugProt track at BioCreative VII provides a manually-annotated corpus for the purpose of the development and evaluation of relation extraction systems, in which interactions between chemicals and genes are studied. We describe the ensemble system that we used for ou...
We investigate a method to extract relations from texts based on global alignment and syntactic information. Combined with SVM, this method is shown to have a performance comparable or even better than LSTM on two RE tasks.
Simulated consultations through virtual patients allow medical students to practice history-taking skills. Ideally, applications should provide interactions in natural language and be multi-case, multi-specialty. Nevertheless, few systems handle or are tested on a large variety of cases. We present a virtual patient dialogue system in which a medic...
Background
Entity normalization is an important information extraction task which has gained renewed attention in the last decade, particularly in the biomedical and life science domains. In these domains, and more generally in all specialized domains, this task is still challenging for the latest machine learning-based approaches, which have diffi...
The European MAPA (Multilingual Anonymisation for Public Administrations) project aims at developing an open-source solution for automatic de-identification of medical and legal documents. We introduce here the context, partners and aims of the project, and report on preliminary results.
Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system even if it is not intrinsically linked to the notion of Transformer. While this system is thought to achieve a good balance between the fle...
Background:
It is of utmost importance to investigate novel therapies for cancer, as it is a major cause of death. In recent years, immunotherapies, especially those against immune checkpoints, have been developed and brought significant improvement in cancer management. However, on the other hand, immune checkpoints blockade (ICB) by monoclonal a...
This chapter provides an overview of the role of artificial intelligence in natural language processing. We follow the chronology of the development of natural language processing systems (Sect. 2). This review is necessarily partial and subjective: rather than providing a general introduction to natural language processing, it focuses on logical a...
This paper proposes a new collaborative and inclusive model for Knowledge Organization Systems (KOS) for sustaining cultural heritage and language diversity. It is based on contributions of end-users as well as scientific and scholarly communities from across borders, languages, nations, continents, and disciplines. It consists in collecting knowle...
Recent work in cross-lingual contextual word embedding learning cannot handle multi-sense words well. In this work, we explore the characteristics of contextual word embeddings and show the link between contextual word embeddings and word senses. We propose two improving solutions by considering contextual multi-sense word embeddings as noise (remo...
We report initial experiments for analyzing social media through an NLP annotation tool on web posts about medications of current interests (baclofen, levothyroxine and vaccines) and summaries of product characteristics (SPCs). We conducted supervised experiments on a subset of messages annotated by experts according to positive or negative misuse;...
Timely mortality surveillance in France is based on the monitoring of electronic death certificates to provide information to health authorities. This study aims to analyze the performance of a rule-based and a supervised machine learning method to classify medical causes of death into 60 mortality syndromic groups (MSGs). Performance was first mea...
Virtual patient software allows health professionals to practise their skills by interacting with tools simulating clinical scenarios. A natural language dialogue system can provide natural interaction for medical history-taking. However, the large number of concepts and terms in the medical domain makes the creation of such a system a demanding ta...
Background:
Mortality surveillance is of fundamental importance to public health surveillance. The real-time recording of death certificates, thanks to Electronic Death Registration System (EDRS), provides valuable data for reactive mortality surveillance based on medical causes of death in free-text format. Reactive mortality surveillance is base...
Objective
This study aims to implement and evaluate two automatic classification methods of free-text medical causes of death into Mortality Syndromic Groups (MSGs) in order to be used for reactive mortality surveillance.IntroductionMortality is an indicator of the severity of the impact of an event on the population. In France mortality surveillan...
IntroductionLes causes médicales sont renseignées par les médecins sur les certificats de décès en texte libreavec une grande variété d'expressions. Les méthodes de traitement automatique des langues (TAL)permettent d'envisager leur exploitation dans des délais courts. Cet article décrit la démarcheretenue pour développer ces méthodes et illustre l...
Objectives: To summarize recent research and present a selection of the best papers published in 2017 in the field of clinical Natural Language Processing (NLP).
Methods: A survey of the literature was performed by the two editors of the NLP section of the International Medical Informatics Association (IMIA) Yearbook. Bibliographic databases PubMed...
Integrating Biology and the Bedside (i2b2) is the de-facto open-source medical tool for cohort discovery. Fast Healthcare Interoperability Resources (FHIR) is a new standard for exchanging health care information electronically. Substitutable Modular third-party Applications (SMART) defines the SMART-on-FHIR specification on how applications shall...
Despite considerable recent attention to problems with reproducibility of scientific research, there is a striking lack of agreement about the definition of the term. That is a problem, because the lack of a consensus definition makes it difficult to compare studies of reproducibility, and thus to have even a broad overview of the state of the issu...
Background:
Natural language processing applied to clinical text or aimed at a clinical outcome has been thriving in recent years. This paper offers the first broad overview of clinical Natural Language Processing (NLP) for languages other than English. Recent studies are summarized to offer insights and outline opportunities in this area.
Main b...
Objectives: To summarize recent research and present a selection of the best papers published in 2016 in the field of clinical Natural Language Processing (NLP).
Method: A survey of the literature was performed by the two section editors of the IMIA Yearbook NLP section. Bibliographic databases were searched for papers with a focus on NLP efforts a...
Objectives: To summarize recent research and present a selection of the best papers published in 2016 in the field of clinical Natural Language Processing (NLP).
Method: A survey of the literature was performed by the two section editors of the IMIA Yearbook NLP section. Bibliographic databases were searched for papers with a focus on NLP efforts a...
Prior knowledge of the distributional characteristics of linguistic phenomena can be useful for a variety of language processing tasks. This paper describes the distribution of negation in two types of biomedical texts: scientific journal articles and progress notes. Two types of negation are examined: explicit negation at the syntactic level and a...
Objective:
To summarize recent research and present a selection of the best papers published in 2015 in the field of clinical Natural Language Processing (NLP).
Method:
A systematic review of the literature was performed by the two section editors of the IMIA Yearbook NLP section by searching bibliographic databases with a focus on NLP efforts a...
This paper reports on Task 2 of the 2016 CLEF eHealth evaluation lab which extended the previous information extraction tasks of ShARe/CLEF eHealth evaluation labs. The task continued with named entity recognition and normalization in French narratives, as offered in CLEF eHealth 2015. Named entity recognition involved ten types of entities includi...
This paper presents the SeeDev Task of the BioNLP Shared Task 2016. The purpose of the SeeDev Task is the extraction from scientific articles of the descriptions of genetic and molecular mechanisms involved in seed development of the model plant, Arabidopsis thaliana. The SeeDev task consists in the extraction of many different event types that inv...
This paper highlights some of the recent developments in the field of machine translation using comparable corpora. We start by updating previous definitions of comparable corpora and then look at bilingual versions of continuous vector space models. Recently, neural networks have been used to obtain latent context representations with only few dim...
We introduce a dialogue task between a virtual patient and a doctor where the dialogue system, playing the patient part in a simulated consultation, must reconcile a specialized level, to understand what the doctor says, and a lay level, to output realistic patient-language utterances. This increases the challenges in the analysis and generation ph...
While measuring the readability of texts has been a long-standing research topic, assessing the technicality of terms has only been addressed more recently and mostly for the English language. In this paper, we train a learning-to-rank model to determine a specialization degree for each term found in a given list. Since no training data for this ta...
The number of patients that benefit from remote monitoring of cardiac implantable electronic devices, such as pacemakers and defibrillators, is growing rapidly. Consequently, the huge number of alerts that are generated and transmitted to the physicians represents a challenge to handle. We have developed a system based on a formal ontology that int...
Nous décrivons un prototype de système de dialogue qui simule un patient lors d’une consultation médicale, et dont l’objectif est la formation des personnels de santé. L’entrée se fait par saisie au clavier et la sortie est vocale. Nous insistons ici sur les méthodes mises en place pour mener un dialogue naturel afin d’éviter de créer un sentiment...
Information Extraction Challenge Gene Regulation Network in Arabidopsis thaliana (GRNA)
Aims
Remote monitoring of cardiac implantable electronic devices is a growing standard; yet, remote follow-up and management of alerts represents a time-consuming task for physicians or trained staff. This study evaluates an automatic mechanism based on artificial intelligence tools to filter atrial fibrillation (AF) alerts based on their medical s...
Pharmacovigilance (PV) is defined by the World Health Organization as the science and activities related to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problem. An essential aspect in PV is to acquire knowledge about Drug-Drug Interactions (DDI). The shared tasks on DDI-Extraction organized i...
A comprehensive understanding of the molecular network underlying seed development regulations remains a major scientific challenge with important potential impact for fundamental research, agriculture and industry. Seed development requires the coordinated growth of different tissues that involves complex genetics and environmental regulations. Mo...
This paper describes the work-in-progress prototype of a dialog system that simulates a virtual patient (VP) consultation. We report some challenges and difficulties that are found during its development, especially in managing the interaction and the vocabulary from the medical domain.
To summarize recent research and present a selection of the best papers published in 2014 in the field of clinical Natural Language Processing (NLP).
A systematic review of the literature was performed by the two section editors of the IMIA Yearbook NLP section by searching bibliographic databases with a focus on NLP efforts applied to clinical tex...
The expansion of biomedical literature is creating the need for efficient tools to keep pace with increasing volumes of information. Text mining (TM) approaches are becoming essential to facilitate the automated extraction of useful biomedical information from unstructured text. We reviewed the applications of TM in psychiatry, and explored its adv...
The acquisition of knowledge about relations between bacteria and their locations (habitats and geographical locations) in short texts about bacteria, as defined in the BioNLP-ST 2013 Bacteria Biotope task, depends on the detection of co-reference links between mentions of entities of each of these three types. To our knowledge, no participant in t...
The determination of risk factors and their temporal relations in natural language patient records is a complex task which has been addressed in the i2b2/UTHealth 2014 shared task. In this context, in most systems it was broadly decomposed into two sub-tasks implemented by two components: entity detection, and temporal relation determination. Task-...
Seed is the main vector for breeding and production of annual field crops, and the accumulation of seed storage compounds (sugars, lipids, proteins) is of primary importance for food, feed and industrial uses. Seed development requires the coordinated growth of different tissues and involves complex genetics and environmental regulations. A compreh...
We summarise the organisation and results of the first shared task aimed at detecting the most similar texts in a large multilingual collection. The dataset of the shared was based on Wikipedia dumps with interlanguage links with further filtering to ensure comparability of the paired articles. The eleven system runs we received have been evaluated...
Unsupervised word classes induced from unannotated text corpora are increasingly used to help tasks addressed by supervised classification, such as standard named entity detection. This paper studies the contribution of unsupervised word classes to a medical entity detection task with two specific objectives: How do unsupervised word classes compar...
Elaboration du réseau de régulation impliqué dans le développement de la graine chez Arabidopsis thaliana
RÉSUMÉ. Nous présentons une approche pour la recherche de réponses à des questions médicales posées en langage naturel, appelée MEANS. Cette approche se fonde sur des techniques de TAL pour l’extraction des entités médicales et des relations sémantiques exprimées dans les questions et les corpus médicaux. MEANS utilise les langages du Web sémantiqu...
This chapter explains why it is hard to use medical language in computer applications and why the computer must adopt the human interpretation of medical words to avoid misunderstandings linked to ambiguity, homonymy and synonymy. Terminological resources are specific representations of medical language for dedicated use in particular health domain...
We present a medical question answering approach, called MEANS. This approach relies on natural language processing techniques to extract medical entities and relations from the user questions and medical corpora. It also uses semantic Web languages to represent and query the information searched by the users. This feature allows to share the infor...
The beginning of the 1990s marked a radical turn in various NL Papplications towards using large collections of texts.
Paraphrases are a key feature in many natural language processing applications, and their extraction and generation are important tasks to tackle. Given two comparable corpora in the same language and the same domain, but displaying two different discourse types (lay and specialized), specific paraphrases can be spotted which provide a dimension al...
Bilingual lexicons are central components of machine translation and cross-lingual information retrieval systems. Their manual construction requires strong expertise in both languages involved and is a costly process. Several automatic methods were proposed as an alternative but they often rely on resources available in a limited number of language...
This paper presents an extension of the standard approach used for bilingual lexicon extraction from comparable corpora. We study the ambiguity problem revealed by the seed bilingual dictionary used to translate context vectors and augment the standard approach by a Word Sense Dis-ambiguation process. Our aim is to identify the translations of word...
This paper presents an approach that extends the standard approach used for bilingual lexicon extraction from comparable corpora. We focus on the problem associated to polysemous words found in the seed bilingual lexicon when translating source context vectors. To improve the adequacy of context vectors, the use of a WordNet-based Word Sense Disamb...
This research tackles the automatic annotation of texts written in a language L1 by exploiting resources and tools available for another language L2. Our approach involves the use of a parallel corpus (L1-L2) aligned at the level of sentences and words. To address the lack of annotated French corpus in the medical field, we focus on the French-Engl...
We present a named-entity recognition (NER) system for parallel multilingual text. Our system handles three languages (i.e., En glish, French, and Spanish) and is tailored to the biomedical domain. For each language, we design a supervised knowledge-based CRF model with rich biomedical and general domain information. We use the sen tence alignment...
We present our participation in Task 1a of the 2013 CLEFeHEALTH Challenge, whose goal was the identification of disorder named entities from electronic medical records. We developed a supervised CRF model that based on a rich set of features learns to predict disorder named entities. The CRF system uses external knowledge from specialized biomedica...
We present our participation in Task 2 of the 2013 CLEFeHEALTH Challenge, whose goal was to determine the UMLS concept unique identifier (CUI), if available, of an abbreviation or acronym. We hypothesize that considering only the abbreviations of the training corpus could be sufficient to provide a strong baseline for this task. We therefore test h...
In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two...
Medical entity recognition is currently generally performed by data-driven methods based on supervised machine learning. Expert-based systems, where linguistic and domain expertise are directly provided to the system are often combined with data-driven systems. We present here a case study where an existing expert-based medical entity recognition s...
In the context of prenatal diagnosis of malformation, knowledge of “similar” and resolved cases (i.e. previous cases with a diagnosis validated by fetus autopsy) is essential for diagnosis orientation. Therefore, access to biomedical data accumulated over the years by fetopathology experts specializing in the study of foetal malformations is crucia...
This paper presents an approach that extends the standard approach used for bilingual lexicon extraction from comparable corpora. We focus on the unresolved problem of polysemous words revealed by the bilingual dictionary and introduce a use of a Word Sense Disambiguation process that aims at improving the adequacy of context vectors. On two specia...
Cet article présente une nouvelle méthode visant à améliorer les résultats de l'approche standard utilisée pour l'extraction de lexiques bilingues à partir de corpus comparables spécialisés. Nous tentons de résoudre le problème de la polysémie des mots dans les vecteurs de contexte par l'introduction d'un processus de désambiguïsation sémantique ba...
Objective:
To identify the temporal relations between clinical events and temporal expressions in clinical reports, as defined in the i2b2/VA 2012 challenge.
Design:
To detect clinical events, we used rules and Conditional Random Fields. We built Random Forest models to identify event modality and polarity. To identify temporal expressions we bu...
Identification of co-referent entity mentions inside text has significant importance for other natural language processing (NLP) tasks (e.g.event linking). However, this task, known as co-reference resolution, remains a complex problem, partly because of the confusion over different evaluation metrics and partly because the well-researched existing...
Après avoir lu ce chapitre, vous devriez :
avoir compris les particularités du langage médical, les notions de synonymie, homonymie, ambiguïté;
avoir compris la nécessité de formaliser le langage médical dans le contexte de l’informatisation des activités de santé ;
comprendre la notion de concept, lien entre la réalité et l’expression verbale ;
co...