Zibin Zheng

Zibin Zheng
  • PhD
  • Professor at Sun Yat-sen University

About

664
Publications
764,503
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
36,547
Citations
Current institution
Sun Yat-sen University
Current position
  • Professor
Additional affiliations
April 2018 - August 2019
Sun Yat-sen University
Position
  • Professor

Publications

Publications (664)
Preprint
Code generation with large language models (LLMs) is highly sensitive to token selection during decoding, particularly at uncertain decision points that influence program logic. While standard strategies like greedy and beam search treat all tokens uniformly, they overlook code-specific uncertainty patterns, leading to suboptimal performance. This...
Article
Real-world graphs are typically complex, exhibiting heterogeneity in the global structure, as well as strong heterophily within local neighborhoods. While a growing body of literature has revealed the limitations of graph neural networks (GNNs) in handling homogeneous graphs with heterophily, little work has been conducted on investigating the hete...
Article
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for fai...
Article
In recent years, blockchain technology has developed rapidly and received widespread attention. However, its pseudonymous and decentralized nature has also attracted many criminal activities. Ponzi schemes, a kind of classic financial scam, also hide their true face in smart contracts, causing massive financial losses to blockchain users. Although...
Preprint
Full-text available
The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current bench...
Preprint
Solana is a rapidly evolving blockchain platform that has attracted an increasing number of users. However, this growth has also drawn the attention of malicious actors, with some phishers extending their reach into the Solana ecosystem. Unlike platforms such as Ethereum, Solana has distinct designs of accounts and transactions, leading to the emer...
Preprint
In recent years, graph neural networks (GNNs) have shown great potential in addressing various graph structure-related downstream tasks. However, recent studies have found that current GNNs are susceptible to malicious adversarial attacks. Given the inevitable presence of adversarial attacks in the real world, a variety of defense methods have been...
Preprint
Graph neural networks have been widely utilized to solve graph-related tasks because of their strong learning power in utilizing the local information of neighbors. However, recent studies on graph adversarial attacks have proven that current graph neural networks are not robust against malicious attacks. Yet much of the existing work has focused o...
Preprint
Graph Transformers (GTs), which simultaneously integrate message-passing and self-attention mechanisms, have achieved promising empirical results in some graph prediction tasks. Although these approaches show the potential of Transformers in capturing long-range graph topology information, issues concerning the quadratic complexity and high computi...
Preprint
Code intelligence is an emerging domain in software engineering, aiming to improve the effectiveness and efficiency of various code-related tasks. Recent research suggests that incorporating contextual information beyond the basic original task inputs (i.e., source code) can substantially enhance model performance. Such contextual signals may be ob...
Preprint
Full-text available
As the development of Solidity contracts on Ethereum, more developers are reusing them on other compatible blockchains. However, developers may overlook the differences between the designs of the blockchain system, such as the Gas Mechanism and Consensus Protocol, leading to the same contracts on different blockchains not being able to achieve cons...
Preprint
Code repair is a fundamental task in software development, facilitating efficient bug resolution and software maintenance. Although large language models (LLMs) have demonstrated considerable potential in automated code repair, their ability to comprehend and effectively leverage diverse types of feedback remains insufficiently understood. To bridg...
Preprint
Cross-chain technology enables seamless asset transfer and message-passing within decentralized finance (DeFi) ecosystems, facilitating multi-chain coexistence in the current blockchain environment. However, this development also raises security concerns, as malicious actors exploit cross-chain asset flows to conceal the provenance and destination...
Preprint
Full-text available
Repository-level code generation remains challenging due to complex code dependencies and the limitations of large language models (LLMs) in processing long contexts. While retrieval-augmented generation (RAG) frameworks are widely adopted, the effectiveness of different retrieved information sources-contextual code, APIs, and similar snippets-has...
Preprint
Full-text available
Ensuring the robustness of code generated by large language models (LLMs) is crucial for real-world reliability. However, existing evaluations predominantly focus on correctness, often neglecting key robustness concerns such as missing input validation and insufficient error handling. In this paper, we present the first empirical study on the robus...
Preprint
Full-text available
Large language models (LLMs) have behaved well in function-level code translation without repository-level context. However, the performance of LLMs in repository-level context code translation remains suboptimal due to complex dependencies and context, hindering their adoption in industrial settings. In this work, we propose a novel LLM-based code...
Preprint
Full-text available
Large Language Models (LLMs) have become pivotal tools for automating code generation in software development. However, these models face significant challenges in producing version-aware code for rapidly evolving languages like Rust, where frequent Application Programming Interfaces (API) changes across versions lead to compatibility issues and co...
Preprint
Full-text available
In recent years, the Ethereum platform has witnessed a proliferation of smart contracts, accompanied by exponential growth in total value locked (TVL). High-TVL smart contracts often require complex numerical computations, particularly in mathematical financial models used by many decentralized applications (DApps). Improper calculations can introd...
Article
Smart contracts are programs that permanently store and automatically execute on the blockchain system such as Ethereum. Due to the non-tamperable nature of the underlying blockchain, smart contracts are difficult to update once deployed, which requires redeploying the contracts and migrating the data. It means that the observation of smart contrac...
Preprint
Large language models (LLMs) have demonstrated remarkable in-context reasoning capabilities across a wide range of tasks, particularly with unstructured inputs such as language or images. However, LLMs struggle to handle structured data, such as graphs, due to their lack of understanding of non-Euclidean structures. As a result, without additional...
Preprint
Full-text available
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, and their application in edge scenarios has attracted significant attention. However, sparse-activated Mixture-of-Experts (MoE) models, which are well suited for edge scenarios, have received relatively little attention due to their high memory demands. Offl...
Preprint
Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets-an aspect crucial for robust model performance-remains a significant challenge. In this paper, we i...
Preprint
Full-text available
Mixture of Experts (MoE), with its distinctive sparse structure, enables the scaling of language models up to trillions of parameters without significantly increasing computational costs. However, the substantial parameter size presents a challenge for inference, as the expansion in GPU memory cannot keep pace with the growth in parameters. Althoug...
Preprint
Decompiler is a specialized type of reverse engineering tool extensively employed in program analysis tasks, particularly in program comprehension and vulnerability detection. However, current Solidity smart contract decompilers face significant limitations in reconstructing the original source code. In particular, the bottleneck of SOTA decompiler...
Article
In the field of heterogeneous federated learning (FL), the key challenge is to efficiently and collaboratively train models across multiple clients with different data distributions, model structures, task objectives, computational capabilities, and communication resources. This diversity leads to significant heterogeneity, which increases the comp...
Article
Full-text available
Federated learning (FL) has gained substantial attention as a promising solution to the need for client privacy in mobile edge computing (MEC). However, FL suffers from instability of accuracy because of the invalid clients who become stragglers caused by frequent fluctuation of available resources in MEC. To tackle this challenge, most of the fram...
Article
Code comment plays an important role in program understanding, and a large number of automatic comment generation methods have been proposed in recent years. To get a better effect of generating comments, many studies try to extract a variety of information (e.g., code tokens, AST traverse sequence, APIs call sequence) from source code as model inp...
Article
As decentralized applications (DApps) proliferate, the increased complexity and usage of smart contracts have heightened their susceptibility to security incidents and financial losses. Although various vulnerability detection tools have been developed to mitigate these issues, they often suffer poor performance in detecting vulnerabilities, as the...
Article
Permissioned blockchains play a significant role in various application scenarios. Applications built on heterogeneous permissioned blockchains need to migrate data from one chain to another, aiming to keep their competitiveness and security. Thus, data migration across heterogeneous chains is a building block of permissioned blockchains. However,...
Article
Full-text available
Numerous blockchain simulators have been proposed to allow researchers to simulate mainstream blockchains. However, we have not yet found a testbed that enables researchers to develop and evaluate their new consensus algorithms or new protocols for blockchain sharding systems. To fill this gap, we developed BlockEmulator, which is designed as an ex...
Article
Vertical federated learning (VFL) allows parties to build robust shared machine learning models based on learning from distributed features of the same samples, without exposing their own data. However, current VFL solutions are limited in their ability to perform inference on non-overlapping samples, and data stored on clients is often subject to...
Article
Full-text available
Virtual Reality (VR) has accelerated its prevalent adoption in emerging metaverse applications, but it is not a fundamentally new technology. On the one hand, most VR operating systems (OS) are based on off-the-shelf mobile OS (e.g., Android OS). As a result, VR apps also inevitably inherit privacy and security deficiencies from conventional mobile...
Article
In recent years, graph neural networks (GNNs) have gained significant attention due to their outstanding performance on graph-related tasks by utilizing neighborhood aggregation. However, traditional GNNs are primarily designed based on the homophily assumption, which means that they show poor performance on heterophilic networks where dissimilar n...
Article
State-of-the-art blockchain sharding solutions, such as Monoxide, can cause severely imbalanced distribution of transaction (TX) workloads across all blockchain shards due to the deployment policy of their accounts. Imbalanced TX distributions then produce hot shards , in which the cross-shard TXs may experience an unlimited confirmation latency. T...
Article
In recent years, the Ethereum platform has witnessed a proliferation of smart contracts, accompanied by exponential growth in total value locked (TVL). High-TVL smart contracts often require complex numerical computations, particularly in mathematical financial models used by many decentralized applications (DApps). Improper calculations can introd...
Article
Federated learning provides a privacy-preserving modeling schema for distributed data, which coordinates multiple clients to collaboratively train a global model. However, data stored in different clients may be collected from diverse domains, and the resulting feature shift is prone to the degraded performance of global model. In this paper, we pr...
Article
Federated Learning (FL) is a promising distributed machine learning framework that allows clients to collaboratively train a global model without data leakage. The synchronous FL suffers from the inefficient training caused by the slow-speed clients, which are called stragglers. Though asynchronous FL can well address the efficiency challenge, it i...
Article
Full-text available
Large Language Models (LLMs) have drawn widespread attention and research due to their astounding performance in text generation and reasoning tasks. Derivative products, like ChatGPT, have been extensively deployed and highly sought after. Meanwhile, the evaluation and optimization of LLMs in software engineering tasks, such as code generation, ha...
Preprint
Recently, an increasing number of AI-driven programming assistants powered by code LLMs have been integrated into various real-world software development environments, significantly boosting developer productivity. However, existing code generation benchmarks primarily focus on general-purpose scenarios, leaving the code generation performance of L...
Preprint
AI agents are systems capable of perceiving their environment, autonomously planning and executing tasks. Recent advancements in LLM have introduced a transformative paradigm for AI agents, enabling them to interact with external resources and tools through prompts. In such agents, the workflow integrates developer-written code, which manages frame...
Preprint
Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository. Many benchmarks have been proposed to evaluate the performance of such code translators. However, previous benchmarks mostly provide fine-grained samples, focusing at...
Article
Comments play a crucial role in code comprehension and maintenance. This is particularly vital when the code is changed, as comments should be promptly updated to maintain consistency between the code and the comments. Existing comment update methods usually treat code as natural language text, ignore the information of code structure, and often fa...
Preprint
Full-text available
State-of-the-art blockchain sharding solutions such as Monoxide, can cause severely imbalanced distribution of transaction (TX) workloads across all blockchain shards due to the deployment policy of their accounts. Imbalanced TX distributions then produce hot shards, in which the cross-shard TXs may experience an unlimited confirmation latency. Thu...
Preprint
Large language models (LLMs) have been shown to memorize and reproduce content from their training data, raising significant privacy concerns, especially with web-scale datasets. Existing methods for detecting memorization are largely sample-specific, relying on manually crafted or discretely optimized memory-inducing prompts generated on a per-sam...
Article
Graph contrastive learning defines a contrastive task to pull similar instances close and push dissimilar instances away. It learns discriminative node embeddings without supervised labels, which has aroused increasing attention in the past few years. Nevertheless, existing methods of graph contrastive learning ignore the differences between divers...
Article
Blockchain store states in Log-Structured Merge (LSM) tree-based database. Due to blockchain traceability, the growing ancient states are inevitably stored in the databases. Unfortunately, by default, this process mixes current and ancient states in the data layout, increasing unnecessary disk I/O access and slowing transaction execution. This...
Preprint
Full-text available
Recent advances in large language models (LLMs) have shown significant capabilities in code translation, often evaluated using benchmarks like CodeTransOcean. However, these evaluations typically focus on simple, function-level translations without considering dependencies, which does not reflect the complexities of real-world software development....
Preprint
In recent years, security incidents stemming from centralization defects in smart contracts have led to substantial financial losses. A centralization defect refers to any error, flaw, or fault in a smart contract's design or development stage that introduces a single point of failure. Such defects allow a specific account or user to disrupt the no...
Article
With the development of blockchain technology, smart contracts have become an important component of blockchain applications. Despite their crucial role, the development of smart contracts may introduce vulnerabilities and potentially lead to severe consequences, such as financial losses. Meanwhile, large language models, represented by ChatGPT, ha...
Article
Mashup technology enables developers to create new applications more readily by combining existing services. As its popularity grows, research on service recommendation for mashup creation has gained increasing attention. Existing recommendation methods have the following limitations: either they are susceptible to data sparsity problems, or they e...
Chapter
Existing risk control techniques have primarily been developed from the perspectives of de-anonymizing address clustering and illicit account classification. However, these techniques cannot be used to ascertain the potential risks for all accounts and are limited by specific heuristic strategies or insufficient label information. These constraints...
Preprint
Code search is crucial for code reuse, enabling developers to efficiently locate relevant snippets. Current methods rely on encoder-based models, which suffer from limitations such as poor generalization and restricted input lengths. Decoder-only large language models (LLMs), with their extensive pre-training, larger size, and longer input capabili...
Chapter
Federated learning enables multiple clients to collaboratively train a global model without revealing their local data. However, conventional federated learning often overlooks the fact that data stored on different clients may originate from diverse domains, and the resulting domain shift problem can significantly impair the performance of the glo...
Article
The rapid advancement of Intelligent Connected Vehicles (ICVs) in the automotive sector has significantly intensified security and privacy issues. Particularly, previous studies have indicated that ICV users (owners) are deeply concerned about the extensive data gathered by these vehicles. However, current research into vehicle security predominant...
Article
Unsupervised multi-source domain transfer in federated scenario has become an emerging research direction, which can help unlabeled target domain to obtain the adapted model through source domains under privacy-preserving. However, when local data is graph, the difference of domains (or data heterogeneity) mainly originate from the difference in no...
Preprint
Graph autoencoders (GAEs) are self-supervised learning models that can learn meaningful representations of graph-structured data by reconstructing the input graph from a low-dimensional latent space. Over the past few years, GAEs have gained significant attention in academia and industry. In particular, the recent advent of GAEs with masked autoenc...
Article
Being the largest Initial Coin Offering project, EOSIO has attracted great interest in cryptocurrency markets. Despite its popularity and prosperity (e.g., 26,311,585,008 token transactions occurred from June 8, 2018 to Aug. 5, 2020), there is almost no work investigating the EOSIO token ecosystem. To fill this gap, we are the first to conduct a sy...
Article
Recommendation system plays a remarkable role in solving the problem of information overload on the Internet. Existing research demonstrates that a recommended list enclosed with appropriate explanations can enhance the transparency of the system and encourage users to make decisions. Although existing works have achieved effective results, they st...
Article
Graph neural networks (GNNs) have shown intrinsic topology bias inherited from graph-structured data, where a majority of nodes are associated with specific sensitive attributes (e.g., age, and race). In this regard, GNNs make discriminatory decisions toward certain groups defined by the sensitive attribute. Over the past few years, efforts have be...
Article
Global navigation satellite systems (GNSS) provide efficient positioning services for location-aware Internet of Things (IoT) devices. However, GNSS non-line-of-sight (NLOS) signals can result in severe positioning errors in urban canyon areas. Existing deep learning-based NLOS signal classification methods cannot appropriately model the spatiotemp...
Preprint
Code generation aims to automatically generate code from input requirements, significantly enhancing development efficiency. Recent large language models (LLMs) based approaches have shown promising results and revolutionized code generation task. Despite the promising performance, LLMs often generate contents with hallucinations, especially for th...
Preprint
Despite the increasing popularity of Decentralized Applications (DApps), they are suffering from various vulnerabilities that can be exploited by adversaries for profits. Among such vulnerabilities, Read-Only Reentrancy (called ROR in this paper), is an emerging type of vulnerability that arises from the complex interactions between DApps. In the r...
Preprint
The emergence of Large Language Models (LLMs) has significantly influenced various aspects of software development activities. Despite their benefits, LLMs also pose notable risks, including the potential to generate harmful content and being abused by malicious developers to create malicious code. Several previous studies have focused on the abili...
Preprint
In recent years, Large Language Models (LLMs) have achieved remarkable success and have been widely used in various downstream tasks, especially in the tasks of the software engineering (SE) field. We find that many studies combining LLMs with SE have employed the concept of agents either explicitly or implicitly. However, there is a lack of an in-...
Preprint
Decentralized bridge applications are important software that connects various blockchains and facilitates cross-chain asset transfer in the decentralized finance (DeFi) ecosystem which currently operates in a multi-chain environment. Cross-chain transaction association identifies and matches unique transactions executed by bridge DApps, which is i...
Article
AI-Generated Content (AIGC) services are essential in developing the Metaverse, providing various digital content to build shared virtual environments. The services can also offer personalized content with user assistance, making the Metaverse more human-centric. However, user-assisted content creation requires significant communication resources t...
Article
An increasing number of investors are active on Ethereum, resulting in numerous transactions. These historical transactions can be applied to complete contract testing. For example, it can be used for gas optimization or contract repair to verify that improved contracts meet expectations. Most existing methods deploy private chains to use non-real...
Article
Federated learning, as a privacy-preserving learning paradigm, restricts the access to data of each local client, for protecting the privacy of the parties. However, in the case of heterogeneous data settings, the different data distributions among clients usually lead to the divergence of learning targets, which is an essential challenge for feder...
Preprint
The rapid advancement of blockchain platforms has significantly accelerated the growth of decentralized applications (DApps). Similar to traditional applications, DApps integrate front-end descriptions that showcase their features to attract users, and back-end smart contracts for executing their business logic. However, inconsistencies between the...
Preprint
Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries. While the performance of code search models improves with an increase in high-quality data, obtaining such data can be challenging and expensive. Recently, large language models (LLMs) such as ChatGPT have made re...
Preprint
With the advent of large language models (LLMs), numerous software service providers (SSPs) are dedicated to developing LLMs customized for code generation tasks, such as CodeLlama and Copilot. However, these LLMs can be leveraged by attackers to create malicious software, which may pose potential threats to the software ecosystem. For example, the...
Article
Graph neural networks (GNNs) have found successful applications in various graph-related tasks. However, recent studies have shown that many GNNs are vulnerable to adversarial attacks. In a vast majority of existing studies, adversarial attacks on GNNs are launched via direct modification of the original graph such as adding/removing links, which m...
Preprint
Code generation aims to automatically generate code snippets that meet given natural language requirements and plays an important role in software development. Although Code LLMs have shown excellent performance in this domain, their long generation time poses a signification limitation in practice use. In this paper, we first conduct an in-depth p...
Preprint
Repository-level code completion aims to generate code for unfinished code snippets within the context of a specified repository. Existing approaches mainly rely on retrieval-augmented generation strategies due to limitations in input sequence length. However, traditional lexical-based retrieval methods like BM25 struggle to capture code semantics,...
Preprint
Graph neural networks (GNNs) have recently emerged as an effective approach to model neighborhood signals in collaborative filtering. Towards this research line, graph contrastive learning (GCL) demonstrates robust capabilities to address the supervision label shortage issue through generating massive self-supervised signals. Despite its effectiven...
Preprint
Smart contract developers frequently seak solutions to developmental challenges on Q&A platforms such as Stack Overflow (SO). Although community responses often provide viable solutions, the embedded code snippets can also contain hidden vulnerabilities. Integrating such code directly into smart contracts may make them susceptible to malicious atta...

Network

Cited By