Songlin Hu

Songlin Hu
  • PhD
  • Professor at Chinese Academy of Sciences

About

247
Publications
19,107
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,259
Citations
Introduction
His research areas include big data storage and intelligent processing, knowledge graph, Large scale distributed system etc. He has published more than one hundred publications in many reputed conferences and journals, like SIGMOD, AAAI, ACL, VLDB, IJCAI, OOPSLA, ICDE, DAC, and ACM/IEEE Trans, etc.
Current institution
Chinese Academy of Sciences
Current position
  • Professor

Publications

Publications (247)
Preprint
Full-text available
Large Language Models (LLMs)-based hybrid retrieval uses LLMs to encode queries and documents into low-dimensional dense or high-dimensional sparse vectors. It retrieves documents relevant to search queries based on vector similarities. Documents are pre-encoded offline, while queries arrive in real-time, necessitating an efficient online query enc...
Article
Large Language Model-based Dense Retrieval (LLM-DR) optimizes over numerous heterogeneous fine-tuning collections from different domains. However, the discussion about its training data distribution is still minimal. Previous studies rely on empirically assigned dataset choices or sampling ratios, which inevitably lead to sub-optimal retrieval perf...
Article
The rapid development of social platforms exacerbates the dissemination of misinformation, which stimulates the research in fact verification. Recent studies tend to leverage semantic features to solve this problem as a single-hop task. However, the process of verifying a claim requires several pieces of evidence with complicated inner logic and re...
Article
Social media platforms like X(Twitter) and Reddit are vital to global communication. However, advancements in Large Language Model (LLM) technology give rise to social media bots with unprecedented intelligence. These bots adeptly simulate human profiles, conversations, and interactions, disseminating large amounts of false information and posing s...
Article
This paper proposes a new principled multi-task representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It ensures sufficiency of shared representations for all tasks and mitigates the negative effect of redundant features, which can enhance language understanding of pre-trained language mode...
Preprint
The rapid development of social platforms exacerbates the dissemination of misinformation, which stimulates the research in fact verification. Recent studies tend to leverage semantic features to solve this problem as a single-hop task. However, the process of verifying a claim requires several pieces of evidence with complicated inner logic and re...
Preprint
This paper proposes a new principled multi-task representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It ensures sufficiency of shared representations for all tasks and mitigates the negative effect of redundant features, which can enhance language understanding of pre-trained language mode...
Preprint
Full-text available
As large language models continue to scale, computational costs and resource consumption have emerged as significant challenges. While existing sparsification methods like pruning reduce computational overhead, they risk losing model knowledge through parameter removal. This paper proposes DSMoE (Dynamic Sparse Mixture-of-Experts), a novel approach...
Preprint
Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework fo...
Preprint
Social media platforms like X(Twitter) and Reddit are vital to global communication. However, advancements in Large Language Model (LLM) technology give rise to social media bots with unprecedented intelligence. These bots adeptly simulate human profiles, conversations, and interactions, disseminating large amounts of false information and posing s...
Preprint
Full-text available
Large language models (LLMs) have demonstrated impressive capabilities in role-playing tasks. However, there is limited research on whether LLMs can accurately simulate user behavior in real-world scenarios, such as social media. This requires models to effectively analyze a user's history and simulate their role. In this paper, we introduce \textb...
Preprint
The widespread deployment of large language models (LLMs) across various domains has showcased their immense potential while exposing significant safety vulnerabilities. A major concern is ensuring that LLM-generated content aligns with human values. Existing jailbreak techniques reveal how this alignment can be compromised through specific prompts...
Preprint
Full-text available
Large language models (LLM) have been attracting much attention from the community recently, due to their remarkable performance in all kinds of downstream tasks. According to the well-known scaling law, scaling up a dense LLM enhances its capabilities, but also significantly increases the computational complexity. Mixture-of-Experts (MoE) models a...
Chapter
Transformer with self-attention was initially crafted to model language sequences, where discrete tokens (i.e., words) showcase high semantic density. However, when applied to time series token inputs (i.e., datapoints) with weak-density semantics and temporal redundancy, it faces challenges as these time-domain tokens impede its ability to capture...
Preprint
Automated red teaming is an effective method for identifying misaligned behaviors in large language models (LLMs). Existing approaches, however, often focus primarily on improving attack success rates while overlooking the need for comprehensive test case coverage. Additionally, most of these methods are limited to single-turn red teaming, failing...
Preprint
Full-text available
Jailbreak vulnerabilities in Large Language Models (LLMs) refer to methods that extract malicious content from the model by carefully crafting prompts or suffixes, which has garnered significant attention from the research community. However, traditional attack methods, which primarily focus on the semantic level, are easily detected by the model....
Preprint
Full-text available
Large Language Model-based Dense Retrieval (LLM-DR) optimizes over numerous heterogeneous fine-tuning collections from different domains. However, the discussion about its training data distribution is still minimal. Previous studies rely on empirically assigned dataset choices or sampling ratios, which inevitably leads to sub-optimal retrieval per...
Preprint
Multimodal fact verification is an under-explored and emerging field that has gained increasing attention in recent years. The goal is to assess the veracity of claims that involve multiple modalities by analyzing the retrieved evidence. The main challenge in this area is to effectively fuse features from different modalities to learn meaningful mu...
Preprint
Full-text available
Scaling model capacity enhances its capabilities but significantly increases computation. Mixture-of-Experts models (MoEs) address this by allowing model capacity to scale without substantially increasing training or inference costs. Despite their promising results, MoE models encounter several challenges. Primarily, the dispersion of training toke...
Preprint
Many fake news detection studies have achieved promising performance by extracting effective semantic and structure features from both content and propagation trees. However, it is challenging to apply them to practical situations, especially when using the trained propagation-based models to detect news with no propagation data. Towards this scena...
Preprint
Vision-language models (VLMs) seamlessly integrate visual and textual data to perform tasks such as image classification, caption generation, and visual question answering. However, adversarial images often struggle to deceive all prompts effectively in the context of cross-prompt migration attacks, as the probability distribution of the tokens in...
Preprint
This paper proposes an information-theoretic representation learning framework, named conditional information flow maximization, to extract noise-invariant sufficient representations for the input data and target task. It promotes the learned representations have good feature uniformity and sufficient predictive ability, which can enhance the gener...
Preprint
Recently, relational metric learning methods have been received great attention in recommendation community, which is inspired by the translation mechanism in knowledge graph. Different from the knowledge graph where the entity-to-entity relations are given in advance, historical interactions lack explicit relations between users and items in recom...
Preprint
As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts (CoTs) reasoning abilities, practical resource constraints drive efforts to distill these capabilities into more compact Smaller Language Models (SLMs). We find that CoTs consist mainly of simple reasoning forms, with a small proportion ($\approx 4.7\%$) of key reasoning s...
Preprint
Large language models (LLMs) exhibit enhanced reasoning at larger scales, driving efforts to distill these capabilities into smaller models via teacher-student learning. Previous works simply fine-tune student models on teachers' generated Chain-of-Thoughts (CoTs) data. Although these methods enhance in-domain (IND) reasoning performance, they stru...
Preprint
Large language models, initially pre-trained with a limited context length, can better handle longer texts by continuing training on a corpus with extended contexts. However, obtaining effective long-context data is challenging due to the scarcity and uneven distribution of long documents across different domains. To address this issue, we propose...
Preprint
As an important multimodal sentiment analysis task, Joint Multimodal Aspect-Sentiment Analysis (JMASA), aiming to jointly extract aspect terms and their associated sentiment polarities from the given text-image pairs, has gained increasing concerns. Existing works encounter two limitations: (1) multi-level modality noise, i.e., instance- and featur...
Article
Social bot detection is essential for maintaining the safety and integrity of online social networks (OSNs). Graph neural networks (GNNs) have emerged as a promising solution. Mainstream GNN-based social bot detection methods learn rich user representations by recursively performing message passing along user–user interaction edges, where users are...
Article
This paper presents a new supervised representation learning framework, namely structured probabilistic coding (SPC), to learn compact and informative representations from input related to the target task. SPC is an encoder-only probabilistic coding technology with a structured regularization from the target space. It can enhance the generalization...
Chapter
Frequent happening of disk failures affects the reliability of the storage system, which can cause jittering of performance or even data loss of services and thus seriously threaten the quality of service. Although a host of machine (deep) learning-based disk failure prediction approaches have been proposed to prevent system breakdown due to unexpe...
Preprint
Full-text available
ChatGPT has gained significant interest due to its impressive performance, but people are increasingly concerned about its potential risks, particularly around the detection of AI-generated content (AIGC), which is often difficult for untrained humans to identify. Current datasets utilized for detecting ChatGPT-generated text primarily center aroun...
Preprint
Full-text available
In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for...
Article
Dense passage retrieval aims to retrieve the relevant passages of a query from a large corpus based on dense representations (i.e., vectors) of the query and the passages. Recent studies have explored improving pre-trained language models to boost dense retrieval performance. This paper proposes CoT-MAE (ConTextual Masked Auto-Encoder), a simple ye...
Chapter
Conversational recommender systems (CRS) can dynamically capture user fine-grained preference by directly asking whether a user likes an attribute or not. However, like traditional recommender systems, accurately comprehending users’ preferences remains a critical challenge for CRS to make effective conversation policy decisions. While there have b...
Preprint
Full-text available
Dialogue response selection aims to select an appropriate response from several candidates based on a given user and system utterance history. Recent studies have been improving the accuracy of dialogue response selection through post-training, mostly relying on naive masked language modeling methods. However, the recently developed generative meth...
Preprint
Extracting generalized and robust representations is a major challenge in emotion recognition in conversations (ERC). To address this, we propose a supervised adversarial contrastive learning (SACL) framework for learning class-spread structured representations. The framework applies contrast-aware adversarial training to generate worst-case sample...
Preprint
This paper describes our system designed for SemEval-2023 Task 12: Sentiment analysis for African languages. The challenge faced by this task is the scarcity of labeled data and linguistic resources in low-resource settings. To alleviate these, we propose a generalized multilingual system SACL-XLMR for sentiment analysis on low-resource languages....
Preprint
Full-text available
News recommendation aims to predict click behaviors based on user behaviors. How to effectively model the user representations is the key to recommending preferred news. Existing works are mostly focused on improvements in the supervised fine-tuning stage. However, there is still a lack of PLM-based unsupervised pre-training methods optimized for u...
Preprint
Full-text available
Passage retrieval aims to retrieve relevant passages from large collections of the open-domain corpus. Contextual Masked Auto-Encoding has been proven effective in representation bottleneck pre-training of a monolithic dual-encoder for passage retrieval. Siamese or fully separated dual-encoders are often adopted as basic retrieval architecture in t...
Chapter
Aspect-Opinion Pair Extraction (AOPE) is an emerging combination task of fine-grained sentiment analysis. Traditional works devise pipeline frameworks for AOPE, which potentially suffer from error propagation. To solve this problem, numerous joint methods have been proposed recently. However, these joint methods have not simultaneously considered t...
Chapter
Learning distributed representations of events is an indispensable but challenging task for event understanding. Existing studies address this problem by either composing the embeddings of event arguments as well as their attributes, or exploiting various relations between events like co-occurrence and discourse relations. In this paper we argue th...
Chapter
Extending the forecasting horizon is a crucial demand for real applications in time series forecasting with multiple exogenous series (TFME). Previous studies adopt Transformer to effectively capture long-term dependency coupling between output and input in a sequence. However, the potential entanglement in multi-dimensional feature space still pre...
Preprint
Growing techniques have been emerging to improve the performance of passage retrieval. As an effective representation bottleneck pretraining technique, the contextual masked auto-encoder utilizes contextual embedding to assist in the reconstruction of passages. However, it only uses a single auto-encoding pre-task for dense representation pre-train...
Conference Paper
Full-text available
Capturing emotions within a conversation plays an essential role in modern dialogue systems. However, the weak correlation between emotions and semantics brings many challenges to emotion recognition in conversation (ERC). Even semantically similar utterances, the emotion may vary drastically depending on contexts or speakers. In this paper, we pro...
Article
Aspect Sentiment Triplet Extraction (ASTE) is an emerging task of fine-grained sentiment analysis, which aims to extract aspect terms, associated opinion terms, and sentiment polarities in the form of triplets. Thus, ASTE involves two groups of subtasks: aspect/opinion term extraction and aspect-opinion-pair sentiment classification. Due to the hig...
Article
Multimodal fake news detection has obtained increasing attention recently. Existing works generally encode multimodal contents into a deterministic point in semantic subspaces, and then fuse multimodal features by simple concatenation or attention mechanisms. However, most methods suffer from adapting to noisy multimodal contents since they neglect...
Preprint
This paper presents a pre-training technique called query-as-context that uses query prediction to improve dense retrieval. Previous research has applied query prediction to document expansion in order to alleviate the problem of lexical mismatch in sparse retrieval. However, query prediction has not yet been studied in the context of dense retriev...
Chapter
Deception occurring in multi-turn question answering (QA) circumstances such as interviews, court depositions, and online marketplaces, can cause serious consequences. Due to the lack of proper datasets and difficulty of finding deceptive signals, existing deception detection methods haven’t utilized the QA contexts to detect deception. Previous me...
Article
Full-text available
Knowledge graph embedding has been proposed to embed entities and relations into continuous vector spaces, which can benefit various downstream tasks, such as question answering and recommender systems, etc. A common assumption of existing knowledge graph embedding models is that the relation is a translation vector connecting the embedded head ent...
Preprint
Full-text available
Capturing emotions within a conversation plays an essential role in modern dialogue systems. However, the weak correlation between emotions and semantics brings many challenges to emotion recognition in conversation (ERC). Even semantically similar utterances, the emotion may vary drastically depending on contexts or speakers. In this paper, we pro...
Preprint
Video language pre-training methods have mainly adopted sparse sampling techniques to alleviate the temporal redundancy of videos. Though effective, sparse sampling still suffers inter-modal redundancy: visual redundancy and textual redundancy. Compared with highly generalized text, sparsely sampled frames usually contain text-independent portions,...
Preprint
Contrastive learning has been extensively studied in sentence embedding learning, which assumes that the embeddings of different views of the same sentence are closer. The constraint brought by this assumption is weak, and a good sentence representation should also be able to reconstruct the original sentence fragments. Therefore, this paper propos...
Preprint
Dense passage retrieval aims to retrieve the relevant passages of a query from a large corpus based on dense representations (i.e., vectors) of the query and the passages. Recent studies have explored improving pre-trained language models to boost dense retrieval performance. This paper proposes CoT-MAE (ConTextual Masked Auto-Encoder), a simple ye...
Article
The success of convolutional neural networks (CNNs) has made low-latency inference services on Graphic Processing Units (GPUs) a hot research topic. However, GPUs are hardware processors with high power consumption. To have the least energy consumption while meeting latency Service-Level-Objective (SLO), batching strategy and dynamic voltage freque...
Article
Automatic rumor detection is critical for maintaining a healthy social media environment. The mainstream methods generally learn rich features from information cascades by modeling the cascade as a tree or graph structure where edges are built based on interactions between a tweet and retweets. Some psychology studies have empirically shown that us...
Conference Paper
Full-text available
The emotion recognition in conversation (ERC) task aims to predict the emotion label of an utterance in a conversation. Since the dependencies between speakers are complex and dynamic, which consist of intra- and inter-speaker dependencies, the modeling of speaker-specific information is a vital role in ERC. Although existing researchers have propo...

Network

Cited By