About
27
Publications
6,795
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
263
Citations
Introduction
Additional affiliations
September 2020 - present
BuildingMinds
Position
- Analyst
January 2020 - August 2020
March 2011 - January 2015
Education
April 2015 - June 2020
May 2010 - April 2012
January 2005 - December 2009
Publications
Publications (27)
Objective:
Automated clinical phenotyping is challenging because word-based features quickly turn it into a high-dimensional problem, in which the small, privacy-restricted, training datasets might lead to overfitting. Pretrained embeddings might solve this issue by reusing input representation schemes trained on a larger dataset. We sought to eva...
Objectives: We survey recent developments in medical Information Extraction (IE) as reported in the literature from the past three years. Our focus is on the fundamental methodological paradigm shift from standard Machine Learning (ML) techniques to Deep Neural Networks (DNNs). We describe applications of this new paradigm concentrating on two basi...
Clinical narratives in electronic health record systems are a rich resource of patient-based in- formation. They constitute an ongoing challenge for natural language processing, due to their high compactness and abundance of short forms. German medical texts exhibit numerous ad-hoc abbreviations that terminate with a period character. The disambigu...
Word embeddings have become the predominant representation scheme on a token-level for various clinical natural language processing (NLP) tasks. More recently, character-level neural language models, exploiting recurrent neural networks, have again received attention, because they achieved similar performance against various NLP benchmarks. We inve...
Acronyms frequently occur in clinical text, which makes their identification, disambiguation and resolution an important task in clinical natural language processing. This paper contributes to acronym resolution in Spanish through the creation of a set of sense inventories organized by clinical specialty containing acronyms, their expansions, and c...
From 2017 to 2019 the Text REtrieval Conference (TREC) held a challenge task on precision medicine using documents from medical publications (PubMed) and clinical trials. Despite lots of performance measurements carried out in these evaluation campaigns, the scientific community is still pretty unsure about the impact individual system features and...
The 2019 Precision Medicine Track at TREC (TREC-PM) aimed at identifying relevant documents from two collections, namely PubMed (biomedical abstracts) and ClinicalTrials.gov (clinical trials), given 40 precision medicine topics representing (virtual) patients. The organizers also proposed a new subtask on treatment retrieval from PubMed. We describ...
The TREC-PM challenge aims for advances in the field of information retrieval applied to precision medicine. Here we describe our experimental setup and the achieved results in its 2018 edition. We explored the use of unsupervised topic models, supervised document classification, and rule-based query-time search term boosting and expansion. We part...
The TREC Conference held in 2017 the Precision Medicine Track with the challenge of finding relevant documents from two collections, namely biomedical abstracts and clinical trials, given a set of 30 input topics representing cancer patients. We proposed a free and open-source (FOSS) Java framework for design, testing, and validation of ranking str...
In this paper we report on our participation in the TREC 2017 Precision Medicine track (team name: imi_mug ). We submitted 5 fully automatic runs to both the biomedical articles and clinical trials subtasks, focusing strongly on the former. Our system was based on Elasticsearch, whose queries were generated modularly via our own open source framewo...
Clinical narratives are typically produced under time pressure, which incites the use of abbreviations and acronyms. To expand such short forms in a correct way eases text comprehension and further semantic processing. We propose a completely unsupervised and data-driven algorithm for the resolution of non-lexicalised and potentially ambiguous abbr...
Pathology reports are a main source of information regarding cancer diagnosis and are commonly written following semi-structured templates that include tumour localisation and behaviour. In this work, we evaluated the efficiency of support vector machines (SVMs) to classify pathology reports written in Portuguese into the International Classificati...
TNM is a classification system for assessment of progression stage of malignant tumors. The physician, upon patient examination, classifies a tumor using three variables: T, N and M. Definitions of values for T, N and M depend on the tumor topography (or body part), specified as ICD-O codes. These values are then used to infer the Clinical Stage (C...
This work develops an automated classifier of pathology reports which infers the topography and the morphology classes of a tumor using codes from the International Classification of Diseases for Oncology (ICD-O). Data from 94,980 patients of the A.C. Camargo Cancer Center was used for training and validation of Naive Bayes classifiers, evaluated b...
Clinical trials are studies designed to assess whether a new intervention is better than the current alternatives. However, most of them fail to recruit participants on schedule. It is hard to use Electronic Health Record (EHR) data to find eligible patients, therefore studies rely on manual assessment, which is time consuming, inefficient and requ...
This work aims at developing an automated classifier of pathology reports, which should be able to infer the localization (topography) and the histological type (morphology) of a tumor in the International Classification of Diseases for Oncology (ICD-O). We used data provided by the A.C. Camargo Cancer Center located in São Paulo for training and v...
Clinical reports are usually written in natural language due to its descriptive power and ease of communication among specialists. Processing data for knowledge discovery and statistical analysis requires information retrieval techniques, already established for newswire texts, but still rare in the medical subdomain. The present work aims at devel...
Context: Many clinical documents are written in narrative form and often use free text. In order to identify information contained in clinical narratives, natural language processing (NLP) tools can be applied. An NLP tool can recognize sentences, determine individual words (tokens) from a text group (corpus) and tag them according to language sema...
Part of speech taggers need a considerable amount of data to train their models. Such data is not readily available for medical texts in Portuguese. We evaluated the accuracy of a morphological tagger against a gold standard when trained with corpora of different sizes and domains. Accuracy was the highest with a medical corpus during the complete...
Internet pages response time is commonly related as too high by its users. This delay is due to several factors and, although numerous efforts have been made in order to decrease it, it is not possible to eliminate it completely. Prefetching links in the interval between two requests is the usual solution. In our work, future access prediction is c...