Fig 2 - uploaded by Milos Jovanovik
Content may be subject to copyright.
Source publication
Medical datasets that contain data relating to drugs and chemical substances, in general tend to contain multiple variations of a generic name which denotes the same drug or a drug product. This ambiguity lies in the fact that a single drug, referenced by a unique code, has an active substance which can be known under different chemical names in di...
Similar publications
In the era of current data explosion, data cleaning becomes an important part of data analysis, and it is also one of the important means to improve data quality. In this paper, the concept, principle, process, detection method and related cleaning algorithm of structural data cleaning are introduced in detail through the data cleaning technology b...
Citations
... Our interest in this subject stems from a challenge we are currently working on with our LinkedDrugs dataset [3], where the manufacturers (Pharmaceutical Organization) and active ingredients (Drug entities) of the collected drug products can be expressed in varying forms, depending on the data source, country of registration, language, etc. Encouraged by the results of our preliminary research [4], we wish to build on it. Given the ambiguity in entity naming in our drug products dataset, we aimed to greatly enhance the dataset's quality, as well as the outcomes of any downstream analytical work by utilizing NER to normalize these name values for the active components and manufacturers. ...
... • The extension of our prior work [3,4] based on the existing BioBERT model to be able to extract two new entity types, namely Drug and Pharmaceutical Organization, and proposing a technique for automatically building the training set, which is beneficial to multiple downstream tasks. • We show how to create the labeled dataset, if we know many class representatives that we want to learn, which can be considered as a semi-supervised method. ...
... The Drug entities set had 20,266 distinct drug brand names, whereas the Pharmaceutical Organization entities set had 3633 unique values. As a part of our earlier effort [3,4], these sets were already extracted and released. ...
Even though named entity recognition (NER) has seen tremendous development in recent years, some domain-specific use-cases still require tagging of unique entities, which is not well handled by pre-trained models. Solutions based on enhancing pre-trained models or creating new ones are efficient, but creating reliable labeled training for them to learn on is still challenging. In this paper, we introduce PharmKE, a text analysis platform tailored to the pharmaceutical industry that uses deep learning at several stages to perform an in-depth semantic analysis of relevant publications. The proposed methodology is used to produce reliably labeled datasets leveraging cutting-edge transfer learning, which are later used to train models for specific entity labeling tasks. By building models for the well-known text-processing libraries spaCy and AllenNLP, this technique is used to find Pharmaceutical Organizations and Drugs in texts from the pharmaceutical domain. The PharmKE platform also incorporates the NER findings to resolve co-references of entities and examine the semantic linkages in each phrase, creating a foundation for further text analysis tasks, such as fact extraction and question answering. Additionally, the knowledge graph created by DBpedia Spotlight for a specific pharmaceutical text is expanded using the identified entities. The obtained results with the proposed methodology result in about a 96% F1-score on the NER tasks, which is up to 2% better than those of the fine-tuned BERT and BioBERT models developed using the same dataset. The ultimate benefits of the platform are that pharmaceutical domain specialists may more easily identify the knowledge extracted from the input texts thanks to the platform’s visualization of the model findings. Likewise, the proposed techniques can be integrated into mobile and pervasive systems to give patients more relevant and comprehensive information from scanned medication guides. Similarly, it can provide preliminary insights to patients and even medical personnel on whether a drug from a different vendor is compatible with the patient’s prescription medication.
... Each correctly classified pharmaceutical text is further analyzed by recognizing combined entities through the proposed models, as well as by using BioBERT for the detection of BC5CDR 4 and BioNLP13CG 5 tags [36], which include Disease, Chemical, Cell, Organ, Organism, Gene, etc. Additionally, we use a fine-tuned BioBERT model in order to detect Pharmaceutical Organizations and Drugs, entity classes that are not covered by the standard NER tasks. ...
The challenge of recognizing named entities in a given text has been a very dynamic field in recent years. This is due to the advances in neural network architectures, increase of computing power and the availability of diverse labeled datasets, which deliver pre-trained, highly accurate models. These tasks are generally focused on tagging common entities, but domain-specific use-cases require tagging custom entities which are not part of the pre-trained models. This can be solved by either fine-tuning the pre-trained models, or by training custom models. The main challenge lies in obtaining reliable labeled training and test datasets, and manual labeling would be a highly tedious task. In this paper we present PharmKE, a text analysis platform focused on the pharmaceutical domain, which applies deep learning through several stages for thorough semantic analysis of pharmaceutical articles. It performs text classification using state-of-the-art transfer learning models, and thoroughly integrates the results obtained through a proposed methodology. The methodology is used to create accurately labeled training and test datasets, which are then used to train models for custom entity labeling tasks, centered on the pharmaceutical domain. The obtained results are compared to the fine-tuned BERT and BioBERT models trained on the same dataset. Additionally, the PharmKE platform integrates the results obtained from named entity recognition tasks to resolve co-references of entities and analyze the semantic relations in every sentence, thus setting up a baseline for additional text analysis tasks, such as question answering and fact extraction. The recognized entities are also used to expand the knowledge graph generated by DBpedia Spotlight for a given pharmaceutical text.
The challenge of recognizing named entities in a given text has been a very dynamic field in recent years. This task is generally focused on tagging common entities, such as Person, Organization, Date, etc. However, many domain-specific use-cases exist which require tagging custom entities that are not part of the pre-trained models. This can be solved by fine-tuning the pre-trained models or training custom models. The main challenge lies in obtaining reliable labeled training and test datasets, and manual labeling would be a highly tedious task.
This paper presents a text analysis platform focused on the pharmaceutical domain. We perform text classification using state-of-the-art transfer learning models based on spaCy, AllenNLP, BERT, and BioBERT. We developed methodology that is used to create accurately labeled training and test datasets used for custom entity labeling model fine-tuning. Finally, this methodology is applied in the process of detecting Pharmaceutical Organizations and Drugs in texts from the pharmaceutical domain. The obtained F1 scores are 96.14% for the entities occuring in the training set, and 95.14% for the unseen entities, which is noteworthy compared to other state-of-the-art methods. The proposed approach implemented in the platform could be applied in mobile and pervasive systems since it can provide more relevant and understandable information to patients by allowing them to scan the medication guides of their drugs. Furthermore, the proposed methodology has a potential application in verifying whether another drug from another vendor is compatible with the patient's prescription medicine. Such approaches are the future of patient empowerment.