About
26
Publications
5,779
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
179
Citations
Citations since 2017
Introduction
Additional affiliations
September 2020 - present
BuildingMinds
Position
- Analyst
January 2020 - August 2020
March 2011 - January 2015
Education
April 2015 - June 2020
May 2010 - April 2012
January 2005 - December 2009
Publications
Publications (26)
Objective:
Automated clinical phenotyping is challenging because word-based features quickly turn it into a high-dimensional problem, in which the small, privacy-restricted, training datasets might lead to overfitting. Pretrained embeddings might solve this issue by reusing input representation schemes trained on a larger dataset. We sought to eva...
Objectives: We survey recent developments in medical Information Extraction (IE) as reported in the literature from the past three years. Our focus is on the fundamental methodological paradigm shift from standard Machine Learning (ML) techniques to Deep Neural Networks (DNNs). We describe applications of this new paradigm concentrating on two basi...
Clinical narratives in electronic health record systems are a rich resource of patient-based in- formation. They constitute an ongoing challenge for natural language processing, due to their high compactness and abundance of short forms. German medical texts exhibit numerous ad-hoc abbreviations that terminate with a period character. The disambigu...
Word embeddings have become the predominant representation scheme on a token-level for various clinical natural language processing (NLP) tasks. More recently, character-level neural language models, exploiting recurrent neural networks, have again received attention, because they achieved similar performance against various NLP benchmarks. We inve...
Acronyms frequently occur in clinical text, which makes their identification, disambiguation and resolution an important task in clinical natural language processing. This paper contributes to acronym resolution in Spanish through the creation of a set of sense inventories organized by clinical specialty containing acronyms, their expansions, and c...
From 2017 to 2019 the Text REtrieval Conference (TREC) held a challenge task on precision medicine using documents from medical publications (PubMed) and clinical trials. Despite lots of performance measurements carried out in these evaluation campaigns, the scientific community is still pretty unsure about the impact individual system features and...
The 2019 Precision Medicine Track at TREC (TREC-PM) aimed at identifying relevant documents from two collections, namely PubMed (biomedical abstracts) and ClinicalTrials.gov (clinical trials), given 40 precision medicine topics representing (virtual) patients. The organizers also proposed a new subtask on treatment retrieval from PubMed. We describ...
The TREC-PM challenge aims for advances in the field of information retrieval applied to precision medicine. Here we describe our experimental setup and the achieved results in its 2018 edition. We explored the use of unsupervised topic models, supervised document classification, and rule-based query-time search term boosting and expansion. We part...
The TREC Conference held in 2017 the Precision Medicine Track with the challenge of finding relevant documents from two collections, namely biomedical abstracts and clinical trials, given a set of 30 input topics representing cancer patients. We proposed a free and open-source (FOSS) Java framework for design, testing, and validation of ranking str...
In this paper we report on our participation in the TREC 2017 Precision Medicine track (team name: imi_mug ). We submitted 5 fully automatic runs to both the biomedical articles and clinical trials subtasks, focusing strongly on the former. Our system was based on Elasticsearch, whose queries were generated modularly via our own open source framewo...
Clinical narratives are typically produced under time pressure, which incites the use of abbreviations and acronyms. To expand such short forms in a correct way eases text comprehension and further semantic processing. We propose a completely unsupervised and data-driven algorithm for the resolution of non-lexicalised and potentially ambiguous abbr...
Pathology reports are a main source of information regarding cancer diagnosis and are commonly written following semi-structured templates that include tumour localisation and behaviour. In this work, we evaluated the efficiency of support vector machines (SVMs) to classify pathology reports written in Portuguese into the International Classificati...
TNM is a classification system for assessment of progression stage of malignant tumors. The physician, upon patient examination, classifies a tumor using three variables: T, N and M. Definitions of values for T, N and M depend on the tumor topography (or body part), specified as ICD-O codes. These values are then used to infer the Clinical Stage (C...
This work develops an automated classifier of pathology reports which infers the topography and the morphology classes of a tumor using codes from the International Classification of Diseases for Oncology (ICD-O). Data from 94,980 patients of the A.C. Camargo Cancer Center was used for training and validation of Naive Bayes classifiers, evaluated b...
Clinical trials are studies designed to assess whether a new intervention is better than the current alternatives. However, most of them fail to recruit participants on schedule. It is hard to use Electronic Health Record (EHR) data to find eligible patients, therefore studies rely on manual assessment, which is time consuming, inefficient and requ...
This work aims at developing an automated classifier of pathology reports, which should be able to infer the localization (topography) and the histological type (morphology) of a tumor in the International Classification of Diseases for Oncology (ICD-O). We used data provided by the A.C. Camargo Cancer Center located in São Paulo for training and v...
Clinical reports are usually written in natural language due to its descriptive power and ease of communication among specialists. Processing data for knowledge discovery and statistical analysis requires information retrieval techniques, already established for newswire texts, but still rare in the medical subdomain. The present work aims at devel...
Context: Many clinical documents are written in narrative form and often use free text. In order to identify information contained in clinical narratives, natural language processing (NLP) tools can be applied. An NLP tool can recognize sentences, determine individual words (tokens) from a text group (corpus) and tag them according to language sema...
Part of speech taggers need a considerable amount of data to train their models. Such data is not readily available for medical texts in Portuguese. We evaluated the accuracy of a morphological tagger against a gold standard when trained with corpora of different sizes and domains. Accuracy was the highest with a medical corpus during the complete...
Internet pages response time is commonly related as too high by its users. This delay is due to several factors and, although numerous efforts have been made in order to decrease it, it is not possible to eliminate it completely. Prefetching links in the interval between two requests is the usual solution. In our work, future access prediction is c...
Projects
Project (1)
CBmed: 1.2 - Innovative Use of Information for Clinical Care and Biomarker Research
Abstract
In Project 1.2 CBmed, called IICCAB (Innovative Use of Information for Clinical Care and Biomarker Research) large-scale clinical data sets are processed for better re-use.
Data for biomarker research come not only from specialized laboratories but also from different sources, in which routine clinical data are stored. The merging of these data requires automatic semantic normalization in order to aggregate them and to make them available for innovative applications.
The core of the system developed by IICCAB is a high performance database, which uses SAP HANA technology. Since an important part of clinical information is found exclusively as narratives within free-text fields of clinical databases, human language technology methods are required to analyse this content and to map it to a standardized vocabulary. The data obtained allow, together with already available structured data like lab parameters, disease codes, etc. semantically standardized patient profiles.
The processed data is the basis for four different application scenarios:
1. "Recruiting" will facilitate patient cohorts in terms of various characteristics, using advanced graphical interfaces for querying and visualization. This is a crucial prerequisite for all kinds of clinical research, especially regarding biomarkers and the use of biosamples.
2. "Prediction" will focus on predictive analytics based on semantically enriched patient profiles, in order to help estimate the probability of future events, such as hospital re-admissions.
3. "Patient QuickView" will provide an automatic summary of decision-relevant patient data, depending on the preferences of each user group and task. Thus, an alternative is provided for time-consuming browsing through numerous documents in electronic health records.
4. "Coding" supports physicians in the coding of disease cases for administrative purposes. IICCAB analyses available clinical data and proposes appropriate disease and procedure codes.
The first phase of the Federation (FFG) IICCAB - project within the Austrian competence center for biomarker (CBMed -- http://www.cbmed.org/en) runs until the end of 2018 and is headed by Stefan Schulz, University professor of medicine computer science at the Medical University of Graz. Cooperation partners are the Styrian Hospital Association M.B.H. (KAGes), the Medical University of Graz, the Biobank Graz, and the German software company SAP.
Project Leader:
Schulz Stefan
Duration:
01.07.2015-31.12.2018