Haipeng CaiUniversity at Buffalo, The State University of New York | SUNY Buffalo · Department of Computer Science and Engineering
Haipeng Cai
Doctor of Philosophy
About
126
Publications
24,226
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,979
Citations
Introduction
Researcher in Software Engineering with a focus on program analysis for systems reliability and code security
Skills and Expertise
Additional affiliations
August 2016 - August 2024
August 2015 - August 2016
August 2012 - August 2015
Education
August 2012 - July 2015
Publications
Publications (126)
With the rise of large language models, such as ChatGPT, non-decisional models have been applied to various tasks. Moreover, ChatGPT has drawn attention to the traditional decision-centric task of Android malware detection. Despite effective detection methods proposed by scholars, they face low interpretability issues. Specifically, while these met...
The Amazon Alexa marketplace has grown rapidly in recent years due to third-party developers creating large amounts of content and publishing directly to a skills store. Despite the growth of the Amazon Alexa skills store, there have been several reported security and usability concerns, which may not be identified during the vetting phase. However...
Timely and effective vulnerability patching is essential for cybersecurity defense, for which various approaches have been proposed yet still struggle to generate valid and correct patches for real-world vulnerabilities. In this paper, we leverage the power and merits of pre-trained large language models (LLMs) to enable automated vulnerability pat...
Detecting vulnerabilities is a crucial task for maintaining the integrity, availability, and security of software systems. Utilizing DL-based models for vulnerability detection has become commonplace in recent years. However, such deep learning-based vulnerability detectors (DLVD) suffer from a shortage of sizable datasets to train effectively. Dat...
With the rise of large language models (LLMs), researchers are increasingly exploring their applications in var ious vertical domains, such as software engineering. LLMs have achieved remarkable success in areas including code generation and vulnerability detection. However, they also exhibit numerous limitations and shortcomings. LLM-based agents,...
Increasing studies have shown bugs in multi-language software as a critical loophole in modern software quality assurance, especially those induced by language interactions (i.e., multilingual bugs). Yet existing tool support for bug detection/localization remains largely limited to single-language software, despite the long-standing prevalence of...
The prosperity of software applications brings fierce market competition to developers. Employing third-party libraries (TPLs) to add new features to projects under development and to reduce the time to market has become a popular way in the community. However, given the tremendous TPLs ready for use, it is challenging for developers to effectively...
With the emergence of smartphones, Android has become a widely used mobile operating system. However, it is vulnerable when encountering various types of attacks. Every day, new malware threatens the security of users’ devices and private data. Many methods have been proposed to classify malicious applications, utilizing static or dynamic analysis...
Intrusion Detection Systems (IDSs) are an essential element of modern cyber defense, alerting users to when and where cyber-attacks occur. Machine learning can enable IDSs to further distinguish between benign and malicious behaviors, but it comes with several challenges, including lack of quality training data and high false positive rates. Genera...
Developing software projects that incorporate multiple languages has been a prevalent practice for many years. However, the
issues
encountered by developers during the development process, the underlying
challenges
causing these issues, and the
solutions
provided to developers remain unknown. In this paper, our objective is to provide answers...
For many years now, modern software is known to be developed in multiple languages (hence termed as multilingual or multi-language software). Yet to this date we still only have very limited knowledge about how multilingual software systems are constructed. For instance, it is not yet really clear how different languages are used, selected together...
Artificial intelligence (AI) for software engineering (SE) tasks has recently achieved promising performance. In this paper, we investigate to what extent the pre-trained language model truly understands those SE tasks such as code search, code summarization, etc. We conduct a comprehensive empirical study on a board set of AI for SE (AI4SE) tasks...
Fragmentation is a serious problem in the Android ecosystem, which is mainly caused by the fast evolution of the system itself and the various system customizations. Many efforts have attempted to mitigate its impact via approaches to automatically pinpointing compatibility issues in Android apps. We conducted a literature review to identify all th...
Software vulnerabilities are a major source of cybersecurity threats. Therefore, it is of paramount importance to defend against (e.g., detect and repair) them. Data-driven approaches, especially those based on machine/deep learning (ML/DL), have demonstrated a great potential to that end. To achieve practical efficacy, these approaches rely on a l...
Developing a software project using multiple languages together has been a dominant practice for years. Yet it remains unclear what issues developers encounter during the development, which challenges cause the issues, and what solutions developers receive. In this paper, we aim to answer these questions via a study on developer discussions on Stac...
Traditional dynamic dependence analysis approaches have limited utilities for continuously running distributed systems (i.e., distributed services) because of their low cost-effectiveness. A recent technique, SEADS, was developed to improve the cost-effectiveness by adjusting analysis configurations on the fly using a general Q-learning algorithm....
Building new, powerful data-driven defenses against prevalent software vulnerabilities needs sizable, quality vulnerability datasets, so does large-scale benchmarking of existing defense solutions. Automatic data generation would promisingly meet the need, yet there is little work aimed to generate much-needed quality vulnerable samples. Meanwhile,...
Security of Android devices is now paramount, given their wide adoption among consumers. As researchers develop tools for statically or dynamically detecting suspicious apps, malware writers regularly update their attack mechanisms to hide malicious behavior implementation. This poses two problems to current research techniques: static analysis app...
Artificial intelligence (AI) for software engineering (SE) tasks has recently achieved promising performance. In this paper, we investigate to what extent the pre-trained language model truly understands those SE tasks such as code search, code summarization, etc. We conduct a comprehensive empirical study on a board set of AI for SE (AI4SE) tasks...
Software construction using multiple languages has long been a norm, yet it is still unclear if multilingual code construction has significant security implications and real security consequences. This paper aims to address this question with a large-scale study of popular multi-language projects on GitHub and their evolution histories, enabled by...
The availability of large-scale, realistic vulnerability datasets is essential for both benchmarking existing techniques and developing effective new ones, especially those using data-driven (e.g., machine/deep-learning based) approaches, for software security. Yet such datasets are critically lacking. A promising solution is to generate such datas...
Today's software systems are mostly developed in multiple languages (i.e., multi-language software), yet tool support for understanding and assuring these systems is rare. To facilitate future research on multi-language software engineering, this paper presents PolyFax, a toolkit that offers automated means for dataset collection from GitHub and tw...
Security of Android devices is now paramount, given their wide adoption among consumers. As researchers develop tools for statically or dynamically detecting suspicious apps, malware writers regularly update their attack mechanisms to hide malicious behavior implementation. This poses two problems to current research techniques: static analysis app...
Open science is a practice that makes scientific research publicly accessible to anyone, hence is highly beneficial. Given the benefits, the software engineering (SE) community has been diligently advocating open science policies during peer reviews and publication processes. However, to this date, there has been few studies that look into the stat...
Analyzing multilingual code holistically is key to systematic quality assurance of real-world software which is mostly developed in multiple computer languages. Toward such analyses, state-of-the-art approaches propose an almost-fully language-agnostic methodology and apply it to dynamic dependence analysis/slicing of multilingual code, showing gre...
Fragmentation is a serious problem in the Android ecosystem. This problem is mainly caused by the fast evolution of the system itself and the various customizations independently maintained by different smartphone manufacturers. Many efforts have attempted to mitigate its impact via approaches to automatically pinpoint compatibility issues in Andro...
Despite the fact that most real-world software systems today are written in multiple programming languages, existing program analysis based security techniques are still limited to single-language code. In consequence, security flaws (e.g., code vulnerabilities) at and across language boundaries are largely left out as blind spots. We present POLYC...
Data-oriented attacks manipulate non-control data to alter a program’s benign behavior without violating its control-flow integrity. It has been shown that such attacks can cause significant damage even in the presence of control-flow defense mechanisms. However, these threats have not been adequately addressed. In this survey article, we first map...
As modern software systems are increasingly developed for running in distributed environments, it is crucial to provide fundamental techniques such as dependence analysis for checking, diagnosing, and evolving those systems. However, traditional dependence analysis is either inapplicable or of very limited utility for distributed programs due to th...
As modern software systems are increasingly developed for running in distributed environments, it is crucial to provide fundamental techniques such as dependence analysis for checking, diagnosing, and evolving those systems. However, traditional dependence analysis is either inapplicable or of very limited utility for distributed programs due to th...
Dynamic information flow analysis (DIFA) supports various security applications such as malware analysis and vulnerability discovery. Yet traditional DIFA approaches have limited utility for distributed software due to applicability, portability, and scalability barriers. We present FLOWDIST, a DIFA for common distributed software that overcomes th...
Cryptographic protocols are often expected to be provably secure. However, this security guarantee often falls short in practice due to various implementation flaws. We propose a new paradigm called
cryptographic program analysis (CPA)
which prescribes the use of program analysis to detect these implementation flaws at compile time. The principal...
Context
Memory error vulnerabilities have been consequential and several well-known, open-source memory error vulnerability detectors exist, built on static and/or dynamic code analysis. Yet there is a lack of assessment of such detectors based on rigorous, quantitative accuracy and efficiency measures while not being limited to specific applicatio...
Malware detection at scale in the Android realm is often carried out using machine learning techniques. State-of-the-art approaches such as DREBIN and MaMaDroid are reported to yield high detection rates when assessed against well-known datasets. Unfortunately, such datasets may include a large portion of duplicated samples, which may bias recorded...
Bug reports (BR) contain vital information that can help triaging teams prioritize and assign bugs to developers who will provide the fixes. However, studies have shown that BR fields often contain incorrect information that need to be reassigned, which delays the bug fixing process. There exist approaches for predicting whether a BR field should b...
Bug reports (BR) contain vital information that can help triaging teams prioritize and assign bugs to developers who will provide the fixes. However, studies have shown that BR fields often contain incorrect information that need to be reassigned, which delays the bug fixing process. There exist approaches for predicting whether a BR field should b...
A playtest is the process in which human testers are recruited to play video games and to reveal software bugs. Manual testing is expensive and time-consuming, especially when there are many mobile games to test and every software version requires for extensive testing before being released. Existing testing frameworks (e.g., Android Monkey) are li...
We envision visual semantics learning (VSL), a novel methodology that derives high-level functional description of given software from its visual (graphical) outputs. By visual semantics, we mean the semantic description about the software’s behaviors that are exhibited in its visual outputs. VSL works by composing this description based on visual...
Distributed software systems are increasingly developed and deployed today. Many of these systems are supposed to run continuously. Given their critical roles in our society and daily lives, assuring the quality of distributed systems is crucial. Analyzing runtime program dependencies has long been a fundamental technique underlying numerous tool s...
Machine learning–based classification dominates current malware detection approaches for Android. However, due to the evolution of both the Android platform and its user apps, existing such techniques are widely limited by their reliance on new malware samples, which may not be timely available, and constant retraining, which is often very costly....
With the rise of the mobile computing market, Android has received tremendous attention from both academia and industry. Application programming in Android is known to have unique characteristics, and Android apps be particularly vulnerable to various security attacks. In response, numerous solutions for particular security issues have been propose...
Context
The constant evolution of the Android platform and its applications have imposed significant challenges both to understanding and securing the Android ecosystem. Yet, despite the growing body of relevant research, it remains unclear how Android apps evolve in terms of their run-time behaviors in ways that impede our gaining consistent empir...
As in other software domains, information flow security is a fundamental aspect of code security in distributed systems. However, most existing solutions to information flow security are limited to centralized software. For distributed systems, such solutions face multiple challenges, including technique applicability, tool portability, and analysi...
The rapid expansion of the Android ecosystem is accompanied by continuing diversification of platforms and devices, resulting in increasing incompatibility issues which damage user experiences and impede app development productivity. In this paper, we conducted a large-scale, longitudinal study of compatibility issues in 62,894 benign apps develope...
Most existing Android malware detection and categorization techniques are static approaches, which suffer from evasion attacks such as obfuscation. By analyzing program behaviors, dynamic approaches are potentially more resilient against these attacks. Yet existing dynamic approaches mostly rely on characterizing system calls which are subject to s...
Context: Requirement traceability (RT) is defined as the ability to describe and follow the life of a requirement. RT helps developers ensure that relevant requirements are implemented and that the source code is consistent with its requirement with respect to a set of traceability links called trace links. Previous work leverages Parts Of Speech (...
Machine learning-based malware detection dominates current security defense approaches for Android apps. However, due to the evolution of Android platforms and malware, existing such techniques are widely limited by their need for constant retraining that are costly, and reliance on new malware samples that may not be timely available. As a result,...
Machine learning-based malware detection dominates current security defense approaches for Android apps. However, due to the evolution of Android platforms and malware, existing such techniques are widely limited by their need for constant retraining that are costly, and reliance on new malware samples that may not be timely available. As a result,...
Today, computing on various Android devices is pervasive. However, growing security vulnerabilities and attacks in the Android ecosystem constitute various threats through user apps. Taint analysis is a common technique for defending against these threats, yet it suffers from challenges in attaining practical simultaneous scalability and effectiven...
Approaches to Android malware detection built on supervised learning are commonly subject to frequent retraining, or the trained classifier may fail to detect newly emerged or emerging kinds of malware. This work targets a sustainable Android malware detector that, once trained on a dataset, can continue to effectively detect new malware without re...
The runtime permission model of Android enhances security yet also constitutes a source of incompatibility issues that impedes the productivity of mobile developers. This paper presents a novel analysis that detects the incompatible permission uses in a given app and repairs them when found, hence automatically adapting the app to the runtime permi...
We present ICC-INSPECT, a tool for understanding Android app behaviors exhibited at runtime via inter-component communication (ICC). Through lightweight Intent profiling, ICC-INSPECT streams run-time ICC information to a dynamic visualization framework which depicts interactive ICC call graphs along with informative ICC statistics. This framework a...
To devise efficient approaches and tools for detecting malicious packages in the Android ecosystem, researchers are increasingly required to have a deep understanding of malware. There is thus a need to provide a framework for dissecting malware and locating malicious program fragments within app code in order to build a comprehensive dataset of ma...
Most existing research for Android focuses on particular security issues, yet there is little broad understanding of Android application run-time characteristics and their implications. To mitigate this gap, we present the first systematic dynamic characterization study of Android apps that targets a broad understanding of application behaviors in...
As the Android app market keeps growing, there is a pressing need for automated tool supports to empower Android developers to produce quality apps with higher productivity. Yet existing tools for Android mostly aim at security and privacy protection, primarily targeting end users and security analysts. Towards filling this gap, we present DROIDFAX...
Inter-component communication (ICC) serves as a key element of any Android app's implementation. Specifically, an Android app uses Intents as the main mechanism for ICC to complete tasks such as switching between different user interfaces, starting background services, communicating to other apps on the Android device, and saving or retrieving data...