To read the full-text of this research, you can request a copy directly from the authors.
Abstract
The challenge of recognizing named entities in a given text has been a very dynamic field in recent years. This task is generally focused on tagging common entities, such as Person, Organization, Date, etc. However, many domain-specific use-cases exist which require tagging custom entities that are not part of the pre-trained models. This can be solved by fine-tuning the pre-trained models or training custom models. The main challenge lies in obtaining reliable labeled training and test datasets, and manual labeling would be a highly tedious task.
This paper presents a text analysis platform focused on the pharmaceutical domain. We perform text classification using state-of-the-art transfer learning models based on spaCy, AllenNLP, BERT, and BioBERT. We developed methodology that is used to create accurately labeled training and test datasets used for custom entity labeling model fine-tuning. Finally, this methodology is applied in the process of detecting Pharmaceutical Organizations and Drugs in texts from the pharmaceutical domain. The obtained F1 scores are 96.14% for the entities occuring in the training set, and 95.14% for the unseen entities, which is noteworthy compared to other state-of-the-art methods. The proposed approach implemented in the platform could be applied in mobile and pervasive systems since it can provide more relevant and understandable information to patients by allowing them to scan the medication guides of their drugs. Furthermore, the proposed methodology has a potential application in verifying whether another drug from another vendor is compatible with the patient's prescription medicine. Such approaches are the future of patient empowerment.
To read the full-text of this research, you can request a copy directly from the authors.
... Frequent information source is the PubMed engine together with the PubTator model [211] for automated annotation. The PharmKE tool [91] labels pharmaceutical entities and the relationships between them. In new diseases, such as COVID-19, LBD technique have proved useful to extract relevant information [159; 139]. ...
... It contains ML models for NER, POS tagging, dependency parsing, sentence segmentation, text classification, entity linking, morphological analysis, etc. The library is employed for entity recognition for Pharmaceutical Organizations and Drugs in PharmKE -a text analysis platform focused on the pharmaceutical domain [91]. ...
... Dobreva et al.[51] highlighted drug entities with the help of this tool in the process of extracting drug-disease relations and drug effectiveness.AllenNLP[60] 66 is an open-soruce research library, built on PyTorch, for developing deep learning models for a wide variety of linguistic tasks. The PharmaKe[91] model uses AllenNLP for NER of drugs and pharmaceutical organizations that appear in texts. ...
Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers.
... In our instance, this tactic is used on texts from the pharmaceutical domain, i.e., in news articles from the domain. The research described in [11], where the basic findings about named entity recognition and knowledge extraction from pharmaceutical texts were reported, is expanded in this paper. ...
Even though named entity recognition (NER) has seen tremendous development in recent years, some domain-specific use-cases still require tagging of unique entities, which is not well handled by pre-trained models. Solutions based on enhancing pre-trained models or creating new ones are efficient, but creating reliable labeled training for them to learn on is still challenging. In this paper, we introduce PharmKE, a text analysis platform tailored to the pharmaceutical industry that uses deep learning at several stages to perform an in-depth semantic analysis of relevant publications. The proposed methodology is used to produce reliably labeled datasets leveraging cutting-edge transfer learning, which are later used to train models for specific entity labeling tasks. By building models for the well-known text-processing libraries spaCy and AllenNLP, this technique is used to find Pharmaceutical Organizations and Drugs in texts from the pharmaceutical domain. The PharmKE platform also incorporates the NER findings to resolve co-references of entities and examine the semantic linkages in each phrase, creating a foundation for further text analysis tasks, such as fact extraction and question answering. Additionally, the knowledge graph created by DBpedia Spotlight for a specific pharmaceutical text is expanded using the identified entities. The obtained results with the proposed methodology result in about a 96% F1-score on the NER tasks, which is up to 2% better than those of the fine-tuned BERT and BioBERT models developed using the same dataset. The ultimate benefits of the platform are that pharmaceutical domain specialists may more easily identify the knowledge extracted from the input texts thanks to the platform’s visualization of the model findings. Likewise, the proposed techniques can be integrated into mobile and pervasive systems to give patients more relevant and comprehensive information from scanned medication guides. Similarly, it can provide preliminary insights to patients and even medical personnel on whether a drug from a different vendor is compatible with the patient’s prescription medication.
... Citation information: DOI 10.1109/ACCESS.2022.3202889 recognition [77]- [80], extractive and abstractive document summarization [81]- [84], among others. ...
Rapid technological developments in the last decade have contributed to using machine learning (ML) in various economic sectors. Financial institutions have embraced technology and have applied ML algorithms in trading, portfolio management, and investment advising. Large-scale automation capabilities and cost savings make the ML algorithms attractive for personal and corporate finance applications. Using ML applications in finance raises ethical issues that need to be carefully examined. We engage a group of experts in finance and ethics to evaluate the relationship between ethical principles of finance and ML. The paper compares the experts’ findings with the results obtained using natural language processing (NLP) transformer models, given their ability to capture the semantic text similarity. The results reveal that the finance principles of integrity and fairness have the most significant relationships with ML ethics. The study includes a use case with SHapley Additive exPlanations (SHAP) and Microsoft Responsible AI Widgets explainability tools for error analysis and visualization of ML models. It analyzes credit card approval data and demonstrate that the explainability tools can address ethical issues in fintech, and improve transparency, thereby increasing the overall trustworthiness of ML models. The results show that both humans and machines could err in approving credit card requests despite using their best judgment based on the available information. Hence, human-machine collaboration could contribute to improved decision-making in finance. We propose a conceptual framework for addressing ethical challenges in fintech such as bias, discrimination, differential pricing, conflict of interest, and data protection.
... There was no part-of-speech tag-ging involved since the morphological information is irrelevant to us, simply a rule-based method was used to annotate the sentences. Having an annotated domain specific dataset provides better results as was shown in [34]. Table 2 presents the content of the Architecture type, Activation functions and Building blocks categories. ...
Choosing optimal Deep Learning (DL) architecture and hyperparameters for a particular problem is still not a trivial task among researchers. The most common approach relies on popular architectures proven to work on specific problem domains led on the same experiment environment and setup. However, this limits the opportunity to choose or invent novel DL networks that could lead to better results. This paper proposes a novel approach for providing general recommendations of an appropriate DL architecture and its hyperparameters based on different configurations presented in thousands of published research papers that examine various problem domains. This architecture can further serve as a starting point of investigating DL architecture for a concrete data set. Natural language processing (NLP) methods are used to create structured data from unstructured scientific papers upon which intelligent models are learned to propose optimal DL architecture, layer type, and activation functions. The advantage of the proposed methodology is multifold. The first is the ability to eventually use the knowledge and experience from thousands of DL papers published through the years. The second is the contribution to the forthcoming novel researches by aiding the process of choosing optimal DL setup based on the particular problem to be analyzed. The third advantage is the scalability and flexibility of the model, meaning that it can be easily retrained as new papers are published in the future, and therefore to be constantly improved.
Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage.
The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the CRAFT (Colorado Richly Annotated Full-Text) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. Further, we evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.
Drug repurposing, which is concerned with the study of the effectiveness of existing drugs on new diseases, has been growing in importance in the last few years. One of the core methodologies for drug repurposing is text-mining, where novel biological entity relationships are extracted from existing biomedical literature and publications, whose number skyrocketed in the last couple of years. This paper proposes an NLP approach for drug-disease relation discovery and labeling (DD-RDL), which employs a series of steps to analyze a corpus of abstracts of scientific biomedical research papers. The proposed ML pipeline restructures the free text from a set of words into drug-disease pairs using state-of-the-art text mining methodologies and natural language processing tools. The model’s output is a set of extracted triplets in the form (drug, verb, disease), where each triple describes a relationship between a drug and a disease detected in the corpus. We evaluate the model based on a gold standard dataset for drug-disease relationships, and we demonstrate that it is possible to achieve similar results without requiring a large amount of annotated biological data or predefined semantic rules. Additionally, as an experimental case, we analyze the research papers published as part of the COVID-19 Open Research Dataset (CORD-19) to extract and identify relations between drugs and diseases related to the ongoing pandemic.
Motivation:
Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing, extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in natural language processing to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this paper, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora.
Results:
We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement), and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts.
Availability and implementation:
We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.
Supplementary information:
Supplementary data are available at Bioinformatics online.
Medical datasets that contain data relating to drugs and chemical substances, in general tend to contain multiple variations of a generic name which denotes the same drug or a drug product. This ambiguity lies in the fact that a single drug, referenced by a unique code, has an active substance which can be known under different chemical names in different countries, thus forming an obstacle during the process for extracting relevant and useful information. To overcome the issues presented by this ambiguity, we developed a scalable, term frequency based data cleaning algorithm, that solely uses the data available in the dataset to infer the correct generic name for each drug based on text similarities, thus forming the roots for building a model that would be able to predict generic names for related and previously unseen drug records with high accuracy. This paper describes the application of the algorithm towards the cleaning and standardization process of an already populated drug products availability dataset, by representing all of the variations of a substance under a single generic name, thus eliminating ambiguity. Our proposed algorithm is also evaluated against a Linked Data approach for detecting related drug products in the dataset.
Background: Drug product data is available on the Web in a distributed fashion. The reasons lie within the regulatory domains, which exist on a national level. As a consequence, the drug data available on the Web are independently curated by national institutions from each country, leaving the data in varying languages, with a varying structure, granularity level and format, on different locations on the Web. Therefore, one of the main challenges in the realm of drug data is the consolidation and integration of large amounts of heterogeneous data into a comprehensive dataspace, for the purpose of developing data-driven applications. In recent years, the adoption of the Linked Data principles has enabled data publishers to provide structured data on the Web and contextually interlink them with other public datasets, effectively de-siloing them. Defining methodological guidelines and specialized tools for generating Linked Data in the drug domain, applicable on a global scale, is a crucial step to achieving the necessary levels of data consolidation and alignment needed for the development of a global dataset of drug product data. This dataset would then enable a myriad of new usage scenarios, which can, for instance, provide insight into the global availability of different drug categories in different parts of the world.
Results: We developed a methodology and a set of tools which support the process of generating Linked Data in the drug domain. Using them, we generated the LinkedDrugs dataset by seamlessly transforming, consolidating and publishing high-quality, 5-star Linked Drug Data from twenty-three countries, containing over 248,000 drug products, over 99,000,000 RDF triples and over 278,000 links to generic drugs from the LOD Cloud. Using the linked nature of the dataset, we demonstrate its ability to support advanced usage scenarios in the drug domain.
Conclusions: The process of generating the LinkedDrugs dataset demonstrates the applicability of the methodological guidelines and the supporting tools in transforming drug product data from various, independent and distributed sources, into a comprehensive Linked Drug Data dataset. The presented user-centric and analytical usage scenarios over the dataset show the advantages of having a de-siloed, consolidated and comprehensive dataspace of drug data available via the existing infrastructure of the Web.
We describe the design and use of the Stanford CoreNLP toolkit, an extensible pipeline that provides core natural language analysis. This toolkit is quite widely used, both in the research NLP community and also among commercial and government users of open source NLP technology. We suggest that this follows from a simple, approachable design, straight-forward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented.
Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, we developed DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs. DBpedia Spotlight allows users to configure the annotations to their specific needs through the DBpedia Ontology and quality measures such as prominence, topical pertinence, contextual ambiguity and disambiguation confidence. We compare our approach with the state of the art in disambiguation, and evaluate our results in light of three baselines and six publicly available annotation systems, demonstrating the competitiveness of our system. DBpedia Spotlight is shared as open source and deployed as a Web Service freely available for public use.
Motivation:
State-of-the-art biomedical named entity recognition (BioNER) systems often require handcrafted features specific to each entity type, such as genes, chemicals and diseases. Although recent studies explored using neural network models for BioNER to free experts from manual feature engineering, the performance remains limited by the available training data for each entity type.
Results:
We propose a multi-task learning framework for BioNER to collectively use the training data of different types of entities and improve the performance on each of them. In experiments on 15 benchmark BioNER datasets, our multi-task model achieves substantially better performance compared with state-of-the-art BioNER systems and baseline neural sequence labeling models. Further analysis shows that the large performance gains come from sharing character- and word-level information among relevant biomedical entities across differently labeled corpora.
Availability:
Our source code is available at https://github.com/yuzhimanhua/lm-lstm-crf.
Supplementary information:
Supplementary data are available at Bioinformatics online.
The CoNLL-2012 shared task involved predicting coreference in three languages -- English, Chinese and Arabic -- using OntoNotes data. It was a follow-on to the English-only task organized in 2011. Until the creation of the OntoNotes corpus, resources in this subfield of language processing have tended to be limited to noun phrase coreference, often on a restricted set of entities, such as ACE entities. OntoNotes provides a large-scale corpus of general anaphoric coreference not restricted to noun phrases or to a specified set of entity types and covering multiple languages. OntoNotes also provides additional layers of integrated annotation, capturing additional shallow semantic structure. This paper briefly describes the OntoNotes annotation (coreference and other layers) and then describes the parameters of the shared task including the format, pre-processing information, evaluation criteria, and presents and discusses the results achieved by the participating systems. Being a task that has a complex evaluation history, and multiple evalation conditions, it has, in the past, been difficult to judge the improvement in new algorithms over previously reported results. Having a standard test set and evaluation parameters, all based on a resource that provides multiple integrated annotation layers (parses, semantic roles, word senses, named entities and coreference) that could support joint models, should help to energize ongoing research in the task of entity and event coreference.
spaCy 2: Natural Language Understanding with Bloom Embeddings