Figure - uploaded by Roland Roller
Content may be subject to copyright.
General overview about Med1 Corpus

General overview about Med1 Corpus

Source publication
Conference Paper
Full-text available
Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited. This also applies to information about a person's health status. Information we share online unveils directly or indirectly information about our lifestyle and health situation and thus provides a valuable data r...

Similar publications

Preprint
Full-text available
In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of th...

Citations

... deleted messages) and the use of laymen terms and abbreviations (e.g. "AD" as a generic name for any anti-depressant) complicates the processing of these data (Seiffe et al., 2020;Basaldella et al., 2020). It is furthermore difficult to collect these resources (e.g. ...
... Although this is not a multi-lingual model it might perform better on rare words like medication names, and it might handle spelling mistakes better than wordpiece-based models. Including negation of ADRs (Scaboro et al., 2021) and other corpora, e.g. the TLC corpus (Seiffe et al., 2020), to disambiguate user terms using a mapping from technical to laymen terms and vice versa might also be beneficial for the performance. ...
Preprint
Full-text available
In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topic imbalance make it a very challenging dataset, since often, the same symptom can have several causes and is not always related to a medication intake. We aim to encourage further multi-lingual efforts in the domain of ADR detection and provide preliminary experiments for binary classification using different methods of zero- and few-shot learning based on a multi-lingual model. When fine-tuning XLM-RoBERTa first on English patient forum data and then on the new German data, we achieve an F1-score of 37.52 for the positive class. We make the dataset and models publicly available for the community.
... The paper lists 13 different German text corpora with a clinical/biomedical context, but only three are freely available. First, GGPOnc [17] a dataset of clinical practice guidelines, second TLC [18], posts of a patient forum with annotated laymen expressions, and finally JSynCC [19] a dataset of German case reports extracted from medical literature. In the case of JSynCC, authors actually provide a software to extract the relevant text passages from digital medical books, instead of providing the data itselfdue to legal reasons. ...
... In order to explore this, we carried out a small proof of concept. To do so, we tested our concept detection on two additional biomedical text datasets in German, namely (a) GGPONC [17], a dataset of clinical practice guidelines and (b) a set of posts published in a German health forum, taken from the TLC corpus [18]. In both cases we applied the model to 600 sentences of each dataset. ...
Preprint
Full-text available
Background: In the information extraction and natural language processing domain, accessible datasets are crucial to reproduce and compare results. Publicly available implementations and tools can serve as benchmark and facilitate the development of more complex applications. However, in the context of clinical text processing the number of accessible datasets is scarce -- and so is the number of existing tools. One of the main reasons is the sensitivity of the data. This problem is even more evident for non-English languages. Approach: In order to address this situation, we introduce a workbench: a collection of German clinical text processing models. The models are trained on a de-identified corpus of German nephrology reports. Result: The presented models provide promising results on in-domain data. Moreover, we show that our models can be also successfully applied to other biomedical text in German. Our workbench is made publicly available so it can be used out of the box, as a benchmark or transferred to related problems.
... Additionally, the community has explored leveraging social media postings to monitor public health (Paul and Dredze, 2012;Choudhury et al., 2013;Sarker et al., 2016;Stefanidis et al., 2017), and detect personal health mentions (Yin et al., 2015;Klein et al., 2017;Karisani and Agichtein, 2018). A few studies compare biomedical information in scientific documents with social media: Thorne and Klinger (2017) explore how disease names are referred to across both domains, while Seiffe et al. (2020) look into laypersons' medical vocabulary. A related task is entity normalization which links a given mention of an entity to the respective concept in a formalized medical ontology. ...
Preprint
Full-text available
Text mining and information extraction for the medical domain has focused on scientific text generated by researchers. However, their direct access to individual patient experiences or patient-doctor interactions can be limited. Information provided on social media, e.g., by patients and their relatives, complements the knowledge in scientific text. It reflects the patient's journey and their subjective perspective on the process of developing symptoms, being diagnosed and offered a treatment, being cured or learning to live with a medical condition. The value of this type of data is therefore twofold: Firstly, it offers direct access to people's perspectives. Secondly, it might cover information that is not available elsewhere, including self-treatment or self-diagnoses. Named entity recognition and relation extraction are methods to structure information that is available in unstructured text. However, existing medical social media corpora focused on a comparably small set of entities and relations and particular domains, rather than putting the patient into the center of analyses. With this paper we contribute a corpus with a rich set of annotation layers following the motivation to uncover and model patients' journeys and experiences in more detail. We label 14 entity classes (incl. environmental factors, diagnostics, biochemical processes, patients' quality-of-life descriptions, pathogens, medical conditions, and treatments) and 20 relation classes (e.g., prevents, influences, interactions, causes) most of which have not been considered before for social media data. The publicly available dataset consists of 2,100 tweets with approx. 6,000 entity and 3,000 relation annotations. In a corpus analysis we find that over 80 % of documents contain relevant entities. Over 50 % of tweets express relations which we consider essential for uncovering patients' narratives about their journeys.
... For German, only GGPONC [3] has been published during our work on our project as a dataset that carries annotation information, yet other German datasets [5,35] do not. Moreover, the Technical-Laymen [34] corpus provides an annotated corpus, yet it is based on crawled texts from non-professional online forums. Various other German medical text corpora exist [4,6,8,9,[14][15][16]20,22,32,36,37] as basis for certain NLP and information extraction use-cases, but are inaccessible for public distribution. ...
Preprint
Full-text available
BACKGROUND Data mining in the field of medical data analysis often needs to rely solely on processing of unstructured data to retrieve relevant data. For German NLP, no open medical neural named entity recognition (NER) model has been published prior to this work. A major issue can be attributed to the lack of German training data. OBJECTIVE We develop a novel German medical NER model for public access. In order to bypass legal restrictions due to potential data leaks through model analysis, we do not make use of internal, proprietary datasets. METHODS The underlying German dataset is retrieved by translation and word alignment of a public English dataset. The dataset serves as foundation for model training and evaluation. RESULTS The obtained dataset consists of 8599 sentences including 30233 annotations. The model achieves an averaged f1 score of 0.82 on the test set after training across seven different NER types. The model is publicly available. CONCLUSIONS We demonstrate the feasibility of training a German medical NER model by the exclusive use of public training data. The sample code and the statistical model are available on GitHub.
... A small number of studies looked into the comparison of biomedical information in social media and scientific text: Thorne and Klinger (2017) analyze quantitatively how disease names are referred to across these domains. Seiffe et al. (2020) analyze laypersons' medical vocabulary. ...
Preprint
Full-text available
Social media contains unfiltered and unique information, which is potentially of great value, but, in the case of misinformation, can also do great harm. With regards to biomedical topics, false information can be particularly dangerous. Methods of automatic fact-checking and fake news detection address this problem, but have not been applied to the biomedical domain in social media yet. We aim to fill this research gap and annotate a corpus of 1200 tweets for implicit and explicit biomedical claims (the latter also with span annotations for the claim phrase). With this corpus, which we sample to be related to COVID-19, measles, cystic fibrosis, and depression, we develop baseline models which detect tweets that contain a claim automatically. Our analyses reveal that biomedical tweets are densely populated with claims (45 % in a corpus sampled to contain 1200 tweets focused on the domains mentioned above). Baseline classification experiments with embedding-based classifiers and BERT-based transfer learning demonstrate that the detection is challenging, however, shows acceptable performance for the identification of explicit expressions of claims. Implicit claim tweets are more challenging to detect.
... A small number of studies looked into the comparison of biomedical information in social media and scientific text: Thorne and Klinger (2018) analyze quantitatively how disease names are referred to across these domains. Seiffe et al. (2020) analyze laypersons' medical vocabulary. ...
Chapter
Due to the vast amount of health-related data on the Internet, a trend toward digital health literacy is emerging among laypersons. We hypothesize that providing trustworthy explanations of informal medical terms in social media can improve information quality. Entity linking (EL) is the task of associating terms with concepts (entities) in the knowledge base. The challenge with EL in lay medical texts is that the source texts are often written in loose and informal language. We propose an end-to-end entity linking approach that involves identifying informal medical terms, normalizing medical concepts according to SNOMED-CT, and linking entities to Wikipedia to provide explanations for laypersons.KeywordsMedical entity linkingMedical concept normalizationNamed entity recognition
Conference Paper
The lack of publicly accessible text corpora is a major obstacle for progress in natural language processing. For medical applications, unfortunately, all language communities other than English are low-resourced. In this work, we present GGPONC (German Guideline Program in Oncology NLP Corpus), a freely distributable German language corpus based on clinical practice guidelines for oncology. This corpus is one of the largest ever built from German medical documents. Unlike clinical documents, clinical guidelines do not contain any patient-related information and can therefore be used without data protection restrictions. Moreover, GGPONC is the first corpus for the German language covering diverse conditions in a large medical subfield and provides a variety of metadata, such as literature references and evidence levels. By applying and evaluating existing medical information extraction pipelines for German text, we are able to draw comparisons for the use of medical language to other corpora, medical and non-medical ones.
Preprint
Full-text available
The lack of publicly available text corpora is a major obstacle for progress in clinical natural language processing, for non-English speaking countries in particular. In this work, we present GGPONC (German Guideline Program in Oncology NLP Corpus), a freely distributable German language corpus based on clinical practice guidelines in the field of oncology. The corpus is one of the largest corpora of German medical text to date. It does not contain any patient-related data and can therefore be used without data protection restrictions. Moreover, it is the first corpus for the German language covering diverse conditions in a large medical subfield. In addition to the textual sources, we provide a large variety of metadata, such as literature references and evidence levels. By applying and evaluating existing medical information extraction pipelines for German text, we are able to draw comparisons for the use of medical language to other medical text corpora.