
Wei Lu- Wuhan University
Wei Lu
- Wuhan University
About
136
Publications
16,406
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,331
Citations
Introduction
Current institution
Additional affiliations
September 2002 - present
Publications
Publications (136)
Parameter-efficient fine-tuning (PEFT) methods optimize large language models (LLMs) by modifying or introducing a small number of parameters to enhance alignment with downstream tasks. However, they can result in catastrophic forgetting, where LLMs prioritize new knowledge at the expense of comprehensive world knowledge. A promising approach to mi...
Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become essential for tasks such as question answering and content generation. However, their increasing impact on public opinion and information dissemination has made them a critical focus for security research due to inherent vulnerabilities. Previous studies...
Parameter-efficient fine-tuning (PEFT) methods optimize large language models (LLMs) by modifying or introducing a small number of parameters to enhance alignment with downstream tasks. However, they can result in catastrophic forgetting, where LLMs prioritize new knowledge at the expense of comprehensive world knowledge. A promising approach to mi...
Topic analysis aims to study topic evolution and trends in order to help researchers understand the process of knowledge evolution and creation. This paper develops a novel topic evolution analysis framework, which we use to demonstrate, forecast, and explain topic evolution from the perspective of the geometrical motion of topic embeddings generat...
This study quantifies and analyzes individual-level abilities of scientists from utilizing either an exploration or an exploitation strategy. Specifically, we present a Research Strategy Q model, which untangles the coupling effect of scientists’ research ability (Qα) and research strategy ability (Eαπ) on research performance. Qα indicates scienti...
Even though significant progress has been made in standardizing document layout analysis, complex layout documents like magazines and newspapers still present challenges. Models trained on standardized documents struggle with these complexities, and the high cost of annotating such documents limits dataset availability. To address this, we propose...
The rapid development of LLMs brings both convenience and potential threats. As costumed and private LLMs are widely applied, model copyright protection has become important. Text watermarking is emerging as a promising solution to AI-generated text detection and model protection issues. However, current text watermarks have largely ignored the cri...
Purpose
Fine-tuning pre-trained language models (PLMs), e.g. SciBERT, generally require large numbers of annotated data to achieve state-of-the-art performance on a range of NLP tasks in the scientific domain. However, obtaining fine-tuning data for scientific NLP tasks is still challenging and expensive. In this paper, the authors propose the mix...
The Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2024; https://eeke-workshop.github.io/) and the 4th AI + Informetrics (AII2024; https://ai-informetrics.github.io/) was held in Changchun, China and online, co-located with the iConference2024. The two workshop series are designed to activel...
This paper tackles a key issue in the interpretation of scientific figures: the fine-grained alignment of text and figures. It advances beyond prior research that primarily dealt with straightforward, data-driven visualizations such as bar and pie charts and only offered a basic understanding of diagrams through captioning and classification. We in...
Retrieval-Augmented Generation (RAG) is applied to solve hallucination problems and real-time constraints of large language models, but it also induces vulnerabilities against retrieval corruption attacks. Existing research mainly explores the unreliability of RAG in white-box and closed-domain QA tasks. In this paper, we aim to reveal the vulnerab...
Tables and figures are usually used to present information in a structured and visual way in scientific documents. Understanding the tables and figures in scientific documents is significant for a series of downstream tasks, such as academic search, scientific knowledge graphs, and so on. Existing studies mainly focus on detecting figures and table...
The swift advancement of Large Language Models (LLMs) and their associated applications has ushered in a new era of convenience, but it also harbors the risks of misuse, such as academic cheating. To mitigate such risks, AI-generated text detectors have been widely adopted in educational and academic scenarios. However, their effectiveness and robu...
Scholar performance evaluation is extremely important in research assessment decisions, such as funding allocation, academic rankings, and academic promotion. In this article, we propose the institution Q model (IQ) and its two variants (IQ-2 and IQ-3), which aim to evaluate the individual-level research ability to publish high-quality scientific p...
Influential scientific papers tend to be primarily based on combinations of prior works. However, assessing the potential impact of a new scientific paper remains a challenging task. In this article, we introduce an innovative framework to investigate the relationship between the embedding of citation networks and a paper’s future citation counts,...
The Joint Workshop of the 4th Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2023; https://eeke-workshop.github.io/) and the 3rd AI + Informetrics (AII2023; https://ai-informetrics.github.io/) was held at Santa Fe, New Mexico, USA and online, co-located with the ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2...
Fast-growing scientific publications present challenges to the scientific community. In this paper, we describe their implications to researchers. As references form explicit foundations for researchers to conduct a study, we investigate the evolution in reference patterns based on 60.8 million papers published from 1960 to 2015. The results demons...
Fine-tuning pre-trained language models (PLMs), e.g., SciBERT, generally requires large numbers of annotated data to achieve state-of-the-art performance on a range of NLP tasks in the scientific domain. However, obtaining the fine-tune data for scientific NLP task is still challenging and expensive. Inspired by recent advancement in prompt learnin...
The potential impact of a paper is often quantified by how many citations it will receive. However, most commonly used models may underestimate the influence of newly published papers over time, and fail to encapsulate this dynamics of citation network into the graph. In this study, we construct hierarchical and heterogeneous graphs for target pape...
The increasingly mature artificial intelligence technologies, such as big data, deep learning, and natural language processing, provide technical support for research on automatic text understanding and bring development opportunities for innovative measurement of scientific communication. Innovation measurement in scientific communication is a cha...
Informal knowledge constantly transitions into formal domain knowledge in the dynamic knowledge base. This article focuses on an integrative understanding of the knowledge role transition from the perspective of knowledge codification. The transition process is characterized by several dynamics involving a variety of bibliometric entities, such as...
The research on studying exploration-exploitation behavior in topic choice has consistently been the focus of a great deal of attention. In this study, we propose five novel research strategies under exploration and exploitation based on the general but significant features of topics, and present a series of metrics to quantify and identify these s...
Neural text ranking models have witnessed significant advancement and are increasingly being deployed in practice. Unfortunately, they also inherit adversarial vulnerabilities of general neural models, which have been detected but remain underexplored by prior studies. Moreover, the inherit adversarial vulnerabilities might be leveraged by blackhat...
While previous studies of customer service chat systems (CSCS) understood user satisfaction as individuals’ subjective perceptions and depended heavily on self-report methods for satisfaction measurement, this article presents an obtrusive chat log analysis that followed the established approaches of search log analysis and examined the relationshi...
The rapid explosion of scientific publications has made related work writing increasingly laborious. In this paper, we propose a fully automated approach to generate related work sections by leveraging a seq2seq neural network. In particular, the main goal of our work is to improve the abstractive generation of related work by introducing problem a...
The 3rd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2022) was held online at the ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2022. The goal of this workshop series (https://eekeworkshop.github.io/) is to engage the related communities in open problems in the extraction and evaluation of know...
Background
Biomedical sciences, with their focus on human health and disease, have attracted unprecedented attention in the 21st century. The proliferation of biomedical sciences has also led to a large number of scientific articles being produced, which makes it difficult for biomedical researchers to find relevant articles and hinders the dissemi...
As critical building blocks of scientific research, research questions and research methods are put forward to reveal the nature of a publication's scientific novelty. Although existing studies have examined scientific novelty from multiple combination-based views, the temporal and semantic complexity of research questions and methods remains to be...
Machine learning (ML) has drawn great attention from academics as well as industries during the past decades and continues to achieve impressive human-level performance on nontrivial tasks such as image classification, voice recognition, natural language processing, and autopiloting. Both data and algorithms are critical to ensure the performance,...
Previous studies have confirmed that citation mention and location reveal different contributions of the cited articles, and that both are significant in scientific research evaluation. However, traditional citation count prediction only focuses on predicting citation frequency. In this paper, we propose a novel fine-grained citation count predicti...
Automated legal text classification is a prominent research topic in the legal field. It lays the foundation for building an intelligent legal system. Current literature focuses on international legal texts, such as Chinese cases, European cases, and Australian cases. Little attention is paid to text classification for U.S. legal texts. Deep learni...
Emerging topic detection has attracted considerable attention in recent times. While various detection approaches have been proposed in this field, designing a method for accurately detecting emerging topics remains challenging. This paper introduces the perspective of knowledge ecology to the detection of emerging topics and utilizes author-keywor...
Although there is an increasingly number of research about the design and use of conversational agents, it is still difficult for conversational agents to completely replace human service. Therefore, more and more companies have adopted human-AI collaborative systems to deliver customer service. It is important to understand how people obtain infor...
Yong Huang Wei Lu Jialin Liu- [...]
Yi Bu
This paper studies the transdisciplinary impact of scientific publications with a longitudinal, comprehensive, and large-scale analysis on the Microsoft Academic Graph (MAG) dataset. More specifically, this paper aims to understand to what extent publications in discipline A have impact on discipline B. To this end, we propose a novel method to cha...
Jinqing Yang Yi Bu Wei Lu- [...]
Li Zhang
Knowledge diffusion is a significant driving force behind discipline development and technological innovation. Keyword is a unique knowledge diffusion trajectory, in which the sleeping beauty phenomenon sometimes appears. In this paper, we first put forward the concept of Keyword Sleeping Beauties (KSBs) on the basis of the scientific literature ph...
Each section header of an article has its distinct communicative function. Citations from distinct sections may be different regarding citing motivation. In this paper, we grouped section headers with similar functions as a structural function and defined the distribution of citations from structural functions for a paper as its citation structure....
In scientific research collaboration, researchers collaborate with different scholars throughout their career stages. Researchers at different career stages may play various roles in science teams. This paper focuses only on the researchers’ roles in their respective science teams, defined as “relative roles” here, rather than comparing roles among...
Chatbot is increasingly thriving in different domains, however, because of unexpected discourse complexity and training data sparseness, its potential distrust hatches vital apprehension. Recently, Machine-Human Chatting Handoff (MHCH), predicting chatbot failure and enabling human-algorithm collaboration to enhance chatbot quality, has attracted i...
The unprecedented COVID-19 outbreak at the end of 2019 has produced a worldwide health crisis. Scientific research, especially international research collaboration, is crucial to deal successfully with the epidemic. This article aims to review the response modes, and especially the international collaboration characteristic, of the academic communi...
Purpose
This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised learning-based approach is proposed to identify data set entities automatically from large-scale scientific literature in an open domain.
Design/methodology/appro...
Detecting research trends helps researchers and decision makers to promptly identify and analyze research topics. However, due to citation and publication delay, previous studies on trend analysis are more likely to identify ex-post trends. In this study, we employ author-defined keywords to represent topics and propose a simple, effective, and ex-...
Objective
PubMed has suffered from the author ambiguity problem for many years. Existing studies on author name disambiguation (AND) for PubMed only used internal metadata for development. However, some of them are incomplete (eg, a large number of names are only abbreviated and their full names are not available) or less discriminative. To this en...
Is chatbot able to completely replace the human agent? The short answer could be – ``it depends...''. For some challenging cases, e.g., dialogue's topical spectrum spreads beyond the training corpus coverage, the chatbot may malfunction and return unsatisfied utterances. This problem can be addressed by introducing the Machine-Human Chatting Handof...
In this paper, we present a method to automatically generate a large-scale labeled dataset for author name disambiguation (AND) in the academic world by leveraging authoritative sources, ORCID and DOI. Using the method, we built LAGOS-AND, a large, gold standard dataset for AND, which is substantially different from existing ones. It contains 7.5M...
Yi Bu Wei Lu Yifei Wu- [...]
Yong Huang
Although scientometricians have focused on the strength of the citation impact of scientific publications, only a few have paid special attention to the width of the citation impact. In this article, we aim to understand the width by establishing our empirical study on a previously built structure, namely ego-centered citation networks (ECCNs). We...
Is chatbot able to completely replace the human agent? The short answer could be - "it depends...". For some challenging cases, e.g., dialogue's topical spectrum spreads beyond the training corpus coverage, the chatbot may malfunction and return unsatisfied utterances. This problem can be addressed by introducing the Machine-Human Chatting Handoff...
Author keywords for scientific literature are terms selected and created by authors. Although most studies have focused on how to apply author keywords to represent their research interests, little is known about the process of how authors select keywords. To fill this research gap, this study presents a pilot study on author keyword selection beha...
Purpose
Citation contexts have been found useful in many scenarios. However, existing context-based recommendations ignored the importance of diversity in reducing the redundant issues and thus cannot cover the broad range of user interests. To address this gap, the paper aims to propose a novelty task that can recommend a set of diverse citation c...
National Institutes of Health (NIH) is the world largest public funder of biomedical research, investing more than $30 billion dollars to achieve its mission to enhance health, lengthen life, and reduce illness and disability. Here, by leveraging individual‐level characteristics and contextual/time‐dependent features of professional scholarly netwo...
This article defines and explores the direct citations between citing publications (DCCPs) of a publication. We construct an ego-centred citation network for each paper that contains all of its citing papers and itself, as well as the citation relationships among them. By utilising a large-scale scholarly dataset from the computer science field in...
The goal of this workshop is to engage the related communities in open problems in the extraction and evaluation of knowledge entities from scientific documents. This workshop entitles this cutting-edge and cross-disciplinary direction Extraction and Evaluation of Knowledge Entity (EEKE), highlighting the development of intelligent methods for iden...
This paper explores the relationship between nations or organizations from the perspective of British Parliament. The co-occurrence network of countries was constructed to detect the characteristics and interaction relationship among these countries. The evolution venation was also mapped to elucidate its continuous development. Results show that t...
http://ceur-ws.org/Vol-2658/
EEKE 2020: 1st Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
This paper proposes keyword-citation-keyword (KCK) network to analyze the knowledge structure of a discipline. Different from traditional co-word network analysis, KCK network highlights the importance of keywords assigned in different articles, as well as the semantic relationship between keywords in various articles. In this study, we select comp...
Background: Drug development is still a costly and time-consuming process with a low rate of success. Drug repurposing (DR) has attracted significant attention because of its significant advantages over traditional approaches in terms of development time, cost, and safety. Entitymetrics, defined as bibliometric indicators based on biomedical entiti...
Dividing papers based on their numbers of citations into several groups constitutes one of the most common research practices in bibliometrics and beyond. However, existing dividing methods are both arbitrary and subject to bias. This article proposes a novel approach to partition highly, medium and lowly cited publications based on their citation...
Understanding the process of drug repurposing is critically significant for drug development. In this paper, we employ extracted bio-entities to detect the features of different phases in drug repurposing. We proposed a transparent and easy entitymetric indicator for bio-entities, i.e., Popularity Index, to quantify and visualize the dynamic change...
The nonliteral interpretation of a text is hard to be understood by machine models due to its high context-sensitivity and heavy usage of figurative language. In this study, inspired by human reading comprehension, we propose a novel, simple, and effective deep neural framework, called Skim and Intensive Reading Model (SIRM), for figuring out impli...
Author-selected keywords have been widely utilized for indexing, information retrieval, bibliometrics and knowledge organization in previous studies. However, few studies exist concerning how author-selected keywords function semantically in scientific manuscripts. In this paper, we investigated this problem from the perspective of term function (T...
BACKGROUND
Drug development is still a costly and time-consuming process with a low rate of success. Drug repurposing (DR) has attracted significant attention because of its significant advantages over traditional approaches in terms of development time, cost, and safety. Entitymetrics, defined as bibliometric indicators based on biomedical entitie...
Keywords for scientific literature are terms selected and created by authors, and are, in general, considered a core element that summarizes and represents the papers’ content, which are often used for the analysis of research hotspots and trends. Keyword semantic function means the semantic role or specific function that a keyword plays in a scien...
Traditionally, publication citation networks are regarded as acyclic, that is, no loops in the network as an earlier published article cannot cite a later published article. However, due to the accessibility of pre-print versions of articles, there might be some loops in a publication citation network. This article presents a descriptive statistic...
It is an important and urgent research problem for decentralized eCommerce services, e.g., eBay, eBid, and Taobao, to detect illegal products, e.g., unclassified pornographic products. However, it is a challenging task as some sellers may utilize and change camouflaged text to deceive the current detection algorithms. In this study, we propose a no...
Purpose
Photographs are a kind of cultural heritage and very useful for cultural and historical studies. However, traditional or manual research methods are costly and cannot be applied on a large scale. This paper aims to present an exploratory study for understanding the cultural concerns of libraries based on the automatic analysis of large-scal...
User requirements for result diversification in image retrieval have been increasing with the explosion of image resources. Result diversification requires that image retrieval systems are made capable of handling semantic gaps between image visual features and semantic concepts, and providing both relevant and diversified image results. Context in...
The concept of Big Data is popular in a variety of domains. The purpose of this review was to summarize the features, applications, analysis approaches, and challenges of Big Data in health care. Big Data in health care has its own features, such as heterogeneity, incompleteness, timeliness and longevity, privacy, and ownership. These features brin...
This article investigates the lengths of time that publications with different numbers of citations take to receive their first citation (the beginning stage), and then compares the lengths of time to receive two or more citations after receiving the first citation (the accumulative stage) in the field of computer science. We find that in the begin...