Figure - uploaded by Wenhao Yu
Content may be subject to copyright.
Contexts in source publication
Context 1
... are Machine Learning (AI domain), PubMed (biology domain), and Amazon (shopping domain). Table 4 presents the performance of using different auxiliary corpora on the benchmarks. We observe that model performances are correlated with the domain relevance between auxiliary corpus and target corpus. ...Context 2
... Table 4, we know top 30K sentences in PubMed helps BioNER via pre-fine tuning; top 30K sentences in ML helps on SciERC. We also show the performance when choose the bottom 30K Table 5: The merged set of entity candidates (combining existing dictionaries, pattern mining, and phrase mining results) perform the best to support the MLM CAND and ECB tasks. ...Context 3
... are Machine Learning (AI domain), PubMed (biology domain), and Amazon (shopping domain). Table 4 presents the performance of using different auxiliary corpora on the benchmarks. We observe that model performances are correlated with the domain relevance between auxiliary corpus and target corpus. ...Context 4
... Table 4, we know top 30K sentences in PubMed helps BioNER via pre-fine tuning; top 30K sentences in ML helps on SciERC. We also show the performance when choose the bottom 30K Table 5: The merged set of entity candidates (combining existing dictionaries, pattern mining, and phrase mining results) perform the best to support the MLM CAND and ECB tasks. ...Similar publications
Citations
... For example, the 2011 Protein Coreference dataset is used for the evaluation of methods for coreference extraction among protein/gene [70]. A combination of the datasets from BioNLP 2009, 2011 and 2013 is used in [71] to benchmark a novel fine-tuning approach. ...
The advent of large language models (LLMs) such as BERT and, more recently, GPT, is transforming our approach of analyzing and understanding biomedical texts. To stay informed about the latest advancements in this area, there is a need for up-to-date summaries on the role of LLM in Natural Language Processing (NLP) of biomedical texts. Thus, this scoping review aims to provide a detailed overview of the current state of biomedical NLP research and its applications, with a special focus on the evolving role of LLMs. We conducted a systematic search of PubMed, EMBASE, and Google Scholar for studies and conference proceedings published from 2017 to December 19, 2023, that develop or utilize LLMs for NLP tasks in biomedicine. We evaluated the risk of bias in these studies using a 3-item checklist. From 13,823 references, we selected 199 publications and conference proceedings for our review. LLMs are being applied to a wide array of tasks in the biomedical field, including knowledge management, text mining, drug discovery, and evidence synthesis. Prominent among these tasks are text classification, relation extraction, and named entity recognition. Although BERT-based models remain prevalent, the use of GPT-based models has substantially increased since 2023. We conclude that, despite offering opportunities to manage the growing volume of biomedical data, LLMs also present challenges, particularly in clinical medicine and evidence synthesis, such as issues with transparency and privacy concerns.
... For example, the 2011 Protein Coreference dataset is used for the evaluation of methods for coreference extraction among protein/gene [69]. A combination of the datasets from BioNLP 2009, 2011 and 2013 is used in [70] to benchmark a novel fine-tuning approach. ...
The advent of large language models (LLMs) such as BERT and, more recently, GPT, is transforming our approach of analyzing and understanding biomedical texts. To stay informed about the latest advancements in this area, there is a need for up-to-date summaries on the impact of LLM on Natural Language Processing (NLP) in the biomedical field. Thus, this scoping review aims to provide a detailed overview of the current state of biomedical NLP research and its applications, with a special focus on the evolving role of LLMs. We conducted a systematic search of PubMed, EMBASE, and Google Scholar for studies and conference proceedings published from 2017 to December 19, 2023, that develop or utilize LLMs for NLP tasks in biomedicine. From 13,823 references, we selected 199 publications and conference proceedings for our review.
LLMs are being applied to a wide array of tasks in the biomedical field, including knowledge management, text mining, drug discovery, and evidence synthesis. Prominent among these tasks are text classification, relation extraction, and named entity recognition. Although BERT-based models remain prevalent, the use of GPT-based models has substantially increased since 2023.
... The research of multi-task preadaptation can be roughly divided into three categories: (1) exploring the effectiveness of preadaptation. First, big PTMs could further learn more task capabilities that are not reflected in the self-supervised learning signals by incorporating the intermediate knowledge transfer from auxiliary tasks, such as text classification [78], named entity recognition [128], relation extraction [82], and question answering [30]. Second, preadaptation on domainspecific unlabeled data for downstream tasks could provide rich domain-specific knowledge for PTMs [33,35,56,83]. ...
Pre-training-fine-tuning has recently become a new paradigm in natural language processing, learning better representations of words, sentences, and documents in a self-supervised manner. Pre-trained models not only unify semantic representations of multiple tasks, multiple languages, and multiple modalities but also emerge high-level capabilities approaching human beings. In this chapter, we introduce pre-trained models for representation learning, from pre-training tasks to adaptation approaches for specific tasks. After that, we discuss several advanced topics toward better pre-trained representations, including better model architecture, multilingual, multi-task, efficient representations, and chain-of-thought reasoning.
... (1) Limited term coverage -They identify new topics from a set of candidate terms, while relying on entity extraction tools (Zeng et al., 2020) or phrase mining techniques (Liu et al., 2015;Shang et al., 2018;Gu et al., 2021) to obtain the high-frequency candidate terms in a corpus. Such extraction techniques will miss a lot of topic-related terms that have low frequency, and thus lead to an incomplete set of candidate terms (Zeng et al., 2021). ...
Topic taxonomies display hierarchical topic structures of a text corpus and provide topical knowledge to enhance various NLP applications. To dynamically incorporate new topic information, several recent studies have tried to expand (or complete) a topic taxonomy by inserting emerging topics identified in a set of new documents. However, existing methods focus only on frequent terms in documents and the local topic-subtopic relations in a taxonomy, which leads to limited topic term coverage and fails to model the global topic hierarchy. In this work, we propose a novel framework for topic taxonomy expansion, named TopicExpan, which directly generates topic-related terms belonging to new topics. Specifically, TopicExpan leverages the hierarchical relation structure surrounding a new topic and the textual content of an input document for topic term generation. This approach encourages newly-inserted topics to further cover important but less frequent terms as well as to keep their relation consistency within the taxonomy. Experimental results on two real-world text corpora show that TopicExpan significantly outperforms other baseline methods in terms of the quality of output taxonomies.
... Task-specific multi-task pre-training. Under a typical "pre-train then fine-tune" paradigm, many NLP works attempted to design pre-training tasks that are relevant to downstream objectives (Zeng et al., 2020;Févry et al., 2020;Yu et al., 2022b;Wang et al., 2021b). Such approaches endow the model with task-specific knowledge acquired from massive pre-training data. ...
Multi-task learning (MTL) has become increasingly popular in natural language processing (NLP) because it improves the performance of related tasks by exploiting their commonalities and differences. Nevertheless, it is still not understood very well how multi-task learning can be implemented based on the relatedness of training tasks. In this survey, we review recent advances of multi-task learning methods in NLP, with the aim of summarizing them into two general multi-task training methods based on their task relatedness: (i) joint training and (ii) multi-step training. We present examples in various NLP downstream applications, summarize the task relationships and discuss future directions of this promising topic.
... (1) Limited term coverage -They identify new topics from a set of candidate terms, while relying on entity extraction tools (Zeng et al., 2020) or phrase mining techniques (Liu et al., 2015;Shang et al., 2018;Gu et al., 2021) to obtain the high-frequency candidate terms in a corpus. Such extraction techniques will miss a lot of topic-related terms that have low frequency, and thus lead to an incomplete set of candidate terms (Zeng et al., 2021). ...
... (as depicted by the red outlined nodes in Figure 1) and around half of the terms cannot be found in the corpus (see Table 1 in Section 4). Concept extraction tools [38] often fail to find them at the top of a list of over half a million concept candidates; and there is insufficient data to learn their embedding vectors. The incompleteness of concepts is a critical challenge in taxonomy completion, and has not yet been properly studied. ...
... We observe that many concepts that have multiple words appear fewer than 100 times in the corpus (as depicted by the red outlined nodes in Figure 1) and around half of the terms cannot be found in the corpus (see Table 1 in Section 4). Concept extraction tools [38] often fail to find them at the top of a list of over half a million concept candidates; and there is insufficient data to learn their embedding vectors. The incompleteness of concepts is a critical challenge in taxonomy completion, and has not yet been properly studied. ...
Automatic construction of a taxonomy supports many applications in e-commerce, web search, and question answering. Existing taxonomy expansion or completion methods assume that new concepts have been accurately extracted and their embedding vectors learned from the text corpus. However, one critical and fundamental challenge in fixing the incompleteness of taxonomies is the incompleteness of the extracted concepts, especially for those whose names have multiple words and consequently low frequency in the corpus. To resolve the limitations of extraction-based methods, we propose GenTaxo to enhance taxonomy completion by identifying positions in existing taxonomies that need new concepts and then generating appropriate concept names. Instead of relying on the corpus for concept embeddings, GenTaxo learns the contextual embeddings from their surrounding graph-based and language-based relational information, and leverages the corpus for pre-training a concept name generator. Experimental results demonstrate that GenTaxo improves the completeness of taxonomies over existing methods.