Haoyu Wang

Haoyu Wang
Huazhong University of Science and Technology | hust

Doctor of Philosophy

About

212
Publications
33,137
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,938
Citations

Publications

Publications (212)
Preprint
The rapid development of large language models (LLMs) with advanced programming capabilities has paved the way for innovative approaches in software testing. Fuzz testing, a cornerstone for improving software reliability and detecting vulnerabilities, often relies on manually written fuzz drivers, limiting scalability and efficiency. To address thi...
Preprint
Impersonation tactics, such as app squatting and app cloning, have posed longstanding challenges in mobile app stores, where malicious actors exploit the names and reputations of popular apps to deceive users. With the rapid growth of Large Language Model (LLM) stores like GPT Store and FlowGPT, these issues have similarly surfaced, threatening the...
Preprint
Vision-language pre-training (VLP) models, trained on large-scale image-text pairs, have become widely used across a variety of downstream vision-and-language (V+L) tasks. This widespread adoption raises concerns about their vulnerability to adversarial attacks. Non-universal adversarial attacks, while effective, are often impractical for real-time...
Preprint
Code Large Language Models (LLMs) have demonstrated remarkable capabilities in generating, understanding, and manipulating programming code. However, their training process inadvertently leads to the memorization of sensitive information, posing severe privacy risks. Existing studies on memorization in LLMs primarily rely on prompt engineering tech...
Article
Large language models (LLMs) have revolutionized language processing, but face critical challenges with security, privacy, and generating hallucinations — coherent but factually inaccurate outputs. A major issue is fact-conflicting hallucination (FCH), where LLMs produce content contradicting ground truth facts. Addressing FCH is difficult due to t...
Preprint
In recent years, Large Language Models (LLMs) have gained widespread use, accompanied by increasing concerns over their security. Traditional jailbreak attacks rely on internal model details or have limitations when exploring the unsafe behavior of the victim model, limiting their generalizability. In this paper, we introduce PathSeeker, a novel bl...
Article
Large Language Models (LLMs) have significantly impacted numerous domains, including Software Engineering (SE). Many recent publications have explored LLMs applied to various SE tasks. Nevertheless, a comprehensive understanding of the application, effects, and possible limitations of LLMs on SE is still in its early stages. To bridge this gap, we...
Preprint
Declarative UI frameworks have gained widespread adoption in mobile app development, offering benefits such as improved code readability and easier maintenance. Despite these advantages, the process of translating UI designs into functional code remains challenging and time-consuming. Recent advancements in multimodal large language models (MLLMs)...
Preprint
The proliferation of pre-trained models (PTMs) and datasets has led to the emergence of centralized model hubs like Hugging Face, which facilitate collaborative development and reuse. However, recent security reports have uncovered vulnerabilities and instances of malicious attacks within these platforms, highlighting growing security concerns. Thi...
Preprint
The exponential growth of open-source package ecosystems, particularly NPM and PyPI, has led to an alarming increase in software supply chain poisoning attacks. Existing static analysis methods struggle with high false positive rates and are easily thwarted by obfuscation and dynamic code execution techniques. While dynamic analysis approaches offe...
Preprint
With the rapid advancement of technologies like text-to-speech (TTS) and voice conversion (VC), detecting deepfake voices has become increasingly crucial. However, both academia and industry lack a comprehensive and intuitive benchmark for evaluating detectors. Existing datasets are limited in language diversity and lack many manipulations encounte...
Preprint
Ransomware attacks have emerged as one of the most significant cybersecurity threats. Despite numerous proposed detection and defense methods, existing approaches face two fundamental limitations in large-scale industrial applications: intolerable system overheads and notorious alert fatigue. To address these challenges, we propose CanCal, a real-t...
Preprint
The swift advancement of large language models (LLMs) has profoundly shaped the landscape of artificial intelligence; however, their deployment in sensitive domains raises grave concerns, particularly due to their susceptibility to malicious exploitation. This situation underscores the insufficiencies in pre-deployment testing, highlighting the urg...
Preprint
Full-text available
WebAssembly (Wasm), as a compact, fast, and isolation-guaranteed binary format, can be compiled from more than 40 high-level programming languages. However, vulnerabilities in Wasm binaries could lead to sensitive data leakage and even threaten their hosting environments. To identify them, symbolic execution is widely adopted due to its soundness a...
Preprint
Full-text available
The rapid advancement of Large Language Models (LLMs) has opened up new opportunities for leveraging artificial intelligence in a variety of application domains, including cybersecurity. As the volume and sophistication of cyber threats continue to grow, there is an increasing need for intelligent systems that can automatically detect vulnerabiliti...
Preprint
Large language models (LLMs) have achieved unprecedented success in the field of natural language processing. However, the black-box nature of their internal mechanisms has brought many concerns about their trustworthiness and interpretability. Recent research has discovered a class of abnormal tokens in the model's vocabulary space and named them...
Preprint
In contemporary software development, the widespread use of indirect calls to achieve dynamic features poses challenges in constructing precise control flow graphs (CFGs), which further impacts the performance of downstream static analysis tasks. To tackle this issue, various types of indirect call analyzers have been proposed. However, they do not...
Preprint
Full-text available
The thriving mobile app ecosystem encompasses a wide range of functionalities. However, within this ecosystem, a subset of apps provides illicit services such as gambling and pornography to pursue economic gains, collectively referred to as "underground economy apps". While previous studies have examined these apps' characteristics and identificati...
Preprint
OpenAI's advanced large language models (LLMs) have revolutionized natural language processing and enabled developers to create innovative applications. As adoption grows, understanding the experiences and challenges of developers working with these technologies is crucial. This paper presents a comprehensive analysis of the OpenAI Developer Forum,...
Preprint
Image captioning (IC) systems, such as Microsoft Azure Cognitive Service, translate image content into descriptive language but can generate inaccuracies leading to misinterpretations. Advanced testing techniques like MetaIC and ROME aim to address these issues but face significant challenges. These methods require intensive manual labor for detail...
Preprint
Deep Neural networks (DNNs), extensively applied across diverse disciplines, are characterized by their integrated and monolithic architectures, setting them apart from conventional software systems. This architectural difference introduces particular challenges to maintenance tasks, such as model restructuring (e.g., model compression), re-adaptat...
Article
With the expanding application of Large Language Models (LLMs) in various domains, it becomes imperative to comprehensively investigate their unforeseen behaviors and consequent outcomes. In this study, we introduce and systematically explore the phenomenon of “glitch tokens”, which are anomalous tokens produced by established tokenizers and could...
Preprint
Full-text available
LLM app stores have seen rapid growth, leading to the proliferation of numerous custom LLM apps. However, this expansion raises security concerns. In this study, we propose a three-layer concern framework to identify the potential security risks of LLM apps, i.e., LLM apps with abusive potential, LLM apps with malicious intent, and LLM apps with ex...
Preprint
When mobile meets LLMs, mobile app users deserve to have more intelligent usage experiences. For this to happen, we argue that there is a strong need to appl LLMs for the mobile ecosystem. We therefore provide a research roadmap for guiding our fellow researchers to achieve that as a whole. In this roadmap, we sum up six directions that we believe...
Preprint
WebAssembly (abbreviated Wasm) has emerged as a cornerstone of web development, offering a compact binary format that allows high-performance applications to run at near-native speeds in web browsers. Despite its advantages, Wasm's binary nature presents significant challenges for developers and researchers, particularly regarding readability when...
Preprint
The Android ecosystem faces a notable challenge known as fragmentation, which denotes the extensive diversity within the system. This issue is mainly related to differences in system versions, device hardware specifications, and customizations introduced by manufacturers. The growing divergence among devices leads to marked variations in how a give...
Article
NFT rug pull is one of the most prominent type of NFT scam, whose definition is that the developers of an NFT project abandon it and run away with investors' funds. Although they have drawn attention from our community, to the best of our knowledge, the NFT rug pulls have not been systematically measured. To fill the void, this paper presents the f...
Article
Vulnerability detectors based on deep learning (DL) models have proven their effectiveness in recent years. However, the shroud of opacity surrounding the decision-making process of these detectors makes it difficult for security analysts to comprehend. To address this, various explanation approaches have been proposed to explain the predictions by...
Preprint
Full-text available
In Ethereum, the practice of verifying the validity of the passed addresses is a common practice, which is a crucial step to ensure the secure execution of smart contracts. Vulnerabilities in the process of address verification can lead to great security issues, and anecdotal evidence has been reported by our community. However, this type of vulner...
Preprint
Full-text available
Maximal Extractable Value (MEV) drives the prosperity of the blockchain ecosystem. By strategically including, excluding, or reordering transactions within blocks, block producers/validators can extract additional value, which in turn incentivizes them to keep the decentralization of the whole blockchain platform. Before The Merge of Ethereum in Se...
Preprint
Full-text available
The rapid advancements in Large Language Models (LLMs) have revolutionized natural language processing, with GPTs, customized versions of ChatGPT available on the GPT Store, emerging as a prominent technology for specific domains and tasks. To support academic research on GPTs, we introduce GPTZoo, a large-scale dataset comprising 730,420 GPT insta...
Preprint
In the digital era, Online Social Networks (OSNs) play a crucial role in information dissemination, with sharing cards for link previews emerging as a key feature. These cards offer snapshots of shared content, including titles, descriptions, and images. In this study, we investigate the construction and dissemination mechanisms of these cards, foc...
Preprint
As a pivotal extension of the renowned ChatGPT, the GPT Store serves as a dynamic marketplace for various Generative Pre-trained Transformer (GPT) models, shaping the frontier of conversational AI. This paper presents an in-depth measurement study of the GPT Store, with a focus on the categorization of GPTs by topic, factors influencing GPT popular...
Article
Full-text available
Cryptocurrency wallets, acting as fundamental infrastructure to the blockchain ecosystem, have seen significant user growth, particularly among browser-based wallets (i.e., browser extensions). However, this expansion accompanies security challenges, making these wallets prime targets for malicious activities. Despite a substantial user base, there...
Article
Beyond an emerging popular web applications runtime supported in almost all commodity browsers, WebAssembly (WASM) is further regarded to be the next-generation execution environment for blockchain-based applications. Indeed, many popular blockchain platforms such as EOSIO and NEAR have adopted WASM-based execution engines. Most recently, WASM has...
Article
Smart contracts are automated programs stored on a blockchain, featuring unique attributes such as permissionlessness, trustlessness, immutability, and transparency. These properties underpin an array of unprecedented decentralized services. Compiled into bytecodes, Ethereum smart contracts are executed within the Ethereum Virtual Machine (EVM). Et...
Article
NFT rug pull is one of the most prominent type of NFT scam, whose definition is that the developers of an NFT project abandon it and run away with investors' funds. Although they have drawn attention from our community, to the best of our knowledge, the NFT rug pulls have not been systematically measured. To fill the void, this paper presents the f...
Chapter
The vigorous growth of cryptocurrencies has infiltrated into social media, manifested by extensive advertisement campaigns popping up on platforms such as Telegram and Twitter. This new way of promotion also introduces an additional attack surface on the blockchain community, with a number of recently-identified scam projects distributed via this c...
Chapter
Binary rewriting is a widely adopted technique in software analysis. WebAssembly (Wasm), as an emerging bytecode format, has attracted great attention from our community. Unfortunately, there is no general-purpose binary rewriting framework for Wasm, and existing effort on Wasm binary modification is error-prone and tedious. In this paper, we prese...
Article
WebAssembly (abbreviated WASM) has emerged as a promising language of the Web and also been used for a wide spectrum of software applications such as mobile applications and desktop applications. These applications, named as WASM applications, commonly run in WASM runtimes. Bugs in WASM runtimes are frequently reported by developers and cause the c...
Preprint
Full-text available
Large Language Models (LLMs) have significantly impacted numerous domains, notably including Software Engineering (SE). Nevertheless, a well-rounded understanding of the application, effects, and possible limitations of LLMs within SE is still in its early stages. To bridge this gap, our systematic literature review takes a deep dive into the inter...
Preprint
Full-text available
WebAssembly (Wasm) is an emerging binary format that draws great attention from our community. However, Wasm binaries are weakly protected, as they can be read, edited, and manipulated by adversaries using either the officially provided readable text format (i.e., wat) or some advanced binary analysis tools. Reverse engineering of Wasm binaries is...
Preprint
Provenance-Based Endpoint Detection and Response (P-EDR) systems are deemed crucial for future APT defenses. Despite the fact that numerous new techniques to improve P-EDR systems have been proposed in academia, it is still unclear whether the industry will adopt P-EDR systems and what improvements the industry desires for P-EDR systems. To this en...
Preprint
Full-text available
Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) services due to their exceptional proficiency in understanding and generating human-like text. LLM chatbots, in particular, have seen widespread adoption, transforming human-machine interactions. However, these LLM chatbots are susceptible to "jailbreak" attacks, where ma...
Preprint
Full-text available
Smart contracts play a vital role in the Ethereum ecosystem. Due to the prevalence of kinds of security issues in smart contracts, the smart contract verification is urgently needed, which is the process of matching a smart contract's source code to its on-chain bytecode for gaining mutual trust between smart contract developers and users. Although...
Article
Deep learning (DL) has become a key component of modern software. In the “ big model ” era, the rich features of DL-based software (i.e., DL software) substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained on the powerful cloud with large datasets. Hence, training effective DL models has beco...
Preprint
Full-text available
NFT rug pull is one of the most prominent type of scam that the developers of a project abandon it and then run away with investors' funds. Although they have drawn attention from our community, to the best of our knowledge, the NFT rug pulls have not been systematically explored. To fill the void, this paper presents the first in-depth study of NF...
Preprint
Full-text available
Binary rewriting is a widely adopted technique in software analysis. WebAssembly (Wasm), as an emerging bytecode format, has attracted great attention from our community. Unfortunately, there is no general-purpose binary rewriting framework for Wasm, and existing effort on Wasm binary modification is error-prone and tedious. In this paper, we prese...
Preprint
Full-text available
Although existing techniques have proposed automated approaches to alleviate the path explosion problem of symbolic execution, users still need to optimize symbolic execution by applying various searching strategies carefully. As existing approaches mainly support only coarse-grained global searching strategies, they cannot efficiently traverse thr...
Article
Serverless computing is a popular cloud computing paradigm that frees developers from server management. Function-as-a-Service (FaaS) is the most popular implementation of serverless computing, representing applications as event-driven and stateless functions. However, existing studies report that functions of FaaS applications severely suffer from...
Article
The rapid growth of Decentralized Finance (DeFi) boosts the blockchain ecosystem. At the same time, attacks on DeFi applications (apps) are increasing. However, to the best of our knowledge, existing smart contract vulnerability detection tools cannot directly detect DeFi attacks. That's because they lack the capability to recover and understand hi...
Article
Being able to automatically detect the performance issues in apps can significantly improve apps’ quality as well as having a positive influence on user satisfaction. A pplication P erformance M anagement (APM) libraries are used to locate the apps’ performance bottleneck, monitor their behaviors at runtime, and identify potential security ri...