Available via license: CC BY 4.0
Content may be subject to copyright.
LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing
Hongxiang Zhang∗
hxxzhang@ucdavis.edu
University of California, Davis
Davis, California, USA
Yuyang Rong
Advanced Micro Devices, Inc. and UC Davis
Davis, California, USA
Yifeng He
University of California, Davis
Davis, California, USA
Hao Chen
University of California, Davis
Davis, California, USA
Figure 1: The overview of fuzzing with LLAMAFUZZ. Each data point in the ne-tuning dataset is represented as a pair (
𝑆𝑖
,
𝑆𝑖′
), which denotes a seed before and after a successful mutation. The mutation process within LLAMAFUZZ is dual-layered,
featuring both traditional fuzzing mutations and LLM-based mutations. These two mutation processes are asynchronous. Four
grey boxes (Execution, Behavior monitoring, Mutation, and Seeds queue) on the right represent the fuzzing loop.
ABSTRACT
Greybox fuzzing has achieved success in revealing bugs and vulnera-
bilities in programs. However, randomized mutation strategies have
limited the fuzzer’s performance on structured data. Specialized
fuzzers can handle complex structured data, but require additional
eorts in grammar and suer from low throughput.
In this paper, we explore the potential of utilizing the Large Lan-
guage Model to enhance greybox fuzzing for structured data. We
utilize the pre-trained knowledge of LLM about data conversion
and format to generate new valid inputs. We further ne-tuned it
with paired mutation seeds to learn structured format and muta-
tion strategies eectively. Our LLM-based fuzzer, LLAMAFUZZ,
integrates the power of LLM to understand and mutate structured
data to fuzzing. We conduct experiments on the standard bug-based
benchmark Magma and a wide variety of real-world programs. LLA-
MAFUZZ outperforms our top competitor by 41 bugs on average.
We also identied 47 unique bugs across all trials. Moreover, LLA-
MAFUZZ demonstrated consistent performance on both bug trig-
ger and bug reached. Compared to AFL++, LLAMAFUZZ achieved
27.19% more branches in real-world program sets on average. We
also demonstrate a case study to explain how LLMs enhance the
fuzzing process in terms of code coverage.
CCS CONCEPTS
•Security and privacy
→
Software security engineering;•
Software and its engineering
→
Software testing and debug-
ging.
KEYWORDS
Fuzzing, Large Language Mode, binary structured data
1 INTRODUCTION
Fuzz testing, also known as fuzzing, is an automated software test-
ing technique that generates test seeds to discover vulnerabilities in
the target programs or applications. In the past few years, greybox
fuzzing has drawn much attention because of its eectiveness in
discovering new vulnerabilities in many programs. As software
systems continue to grow in complexity and evolve at an acceler-
ated pace, the need for adapted test inputs has become increasingly
important. Randomized mutation [
9
] has achieved a lot, but they
reached a bottleneck in which traditional greybox fuzzers struggle
to eectively generate structured data.
General-purpose greybox fuzzers employ bit-level mutation with
high throughput. AFL++ [
9
], one of the state-of-the-art greybox
arXiv:2406.07714v1 [cs.CR] 11 Jun 2024
Hongxiang et al.
fuzzers, combines multiple mutation strategies and scheduling
strategies, leading fuzzing to a new level. However, when deal-
ing with applications that require structured input, blind random
bit-level mutation can be problematic. Such mutations often disrupt
the integrity of data formats, resulting in inecient seeds. As a
result, converging to a high and stable coverage to reach bugs takes
a very long time.
Therefore, to speed up this process, honggfuzz [
32
] proposes to
share the le corpus to enable fuzzer to run on multiprocess and
multithreaded to improve the throughput to generate more test
cases in a limited time. However, naively increasing the throughput
and adding more random mutation strategies can create a bottleneck
when mutating structured seeds because of the complex structure
requirement. AFL++ and honggfuzz need an excessive number of
attempts to mutate valid structured seed. On the other hand, using
randomized strategies, fuzzers demonstrate unstable results. To
mitigate such uncertainty, evaluating fuzz testing requires repeated
trials for fair comparison. Nevertheless, real-world bugs are scarce,
and even ten repeated trials cannot ensure bug detection.
While boosting throughput helps, it is not sucient as fuzzers
relying on randomness and coverage information lack structural
awareness about the test seed. To generate valuable structured bi-
nary seeds, specialized fuzzers have been proposed using predened
grammars to create structured data. Gramatron [
31
] restructures
the grammar to enable unbiased sampling from the input state
space and permits more aggressive mutation operations. Grama-
tron combines Search-based Testing with Grammar-based fuzzing
to co-evolve both aspects of test case generation. However, Gra-
matron requires additional specications in pre-dened Chomsky
Normal Form and Greibach Normal Form to construct grammar
automata. Meanwhile, Gramatron focuses on JSON format. An-
other approach is using chunk-based mutator. WEIZZ [
8
] proposes
a technique for automatically generating and mutating inputs for
unknown chunk-based binary formats. Nevertheless, WEIZZ strug-
gles to handle grammar-based formats such as JSON, XML, and
programming languages.
Therefore, fuzzer developers face a trade-o between employing
general-purpose fuzzers and specialized ones. General-purpose
fuzzers, while versatile, often struggle with handling structured
seeds eectively. On the other hand, specialized fuzzers can produce
high-quality structured seeds, but this specialization can limit their
exibility and applicability. Additionally, the reliance on grammar
rules for seed generation requires extensive domain knowledge,
which can be a barrier to their widespread use. Thus, there is a need
for a better approach that leverages the strengths of both generic
and specialized fuzzers.
To tackle the aforementioned problems, we propose using Large
Language Models (LLMs) to enhance the mutation process in fuzzing.
Figure 1 provide an overview of LLAMAFUZZ architecture. By pre-
training LLMs on diverse datasets, LLMs can learn intricate patterns
for data conversion and data format information, which are crucial
for structured data mutation. Additionally, we ne-tuned LLMs to
learn specic structured seed patterns and mutate structured seeds,
trying to nd a balance between generic fuzzer and specialized
fuzzer.
We implemented our prototype LLAMAFUZZ and conducted
evaluations on two benchmarks. To evaluate the performance of
nding bugs, we evaluate LLAMAFUZZ with state-of-the-art fuzzers
AFL++, Mopta, Honggfuzz, and Fairfuzz, on bug-based benchmark
Magma [11]. LLAMAFUZZ outperforms our top competitor by 41
bugs on average. We also nd 47 unique bugs across all trials. In
addition, we would like to know how LLAMAFUZZ performed
on real-world programs in dierent structured data formats. The
experimental result illustrates LLAMAFUZZ outperforming AFL++
in 10 out of 15 fuzzing targets by 27.19% higher coverage on average.
Last but not least, we demonstrate a case study to visually explain
how LLM mutated seeds augment the fuzzing process in terms of
code coverage.
In summary, this paper makes the following contributions.
•
We propose an LLM-enhanced mutation strategy that can
be applied to both binary-based and text-based data formats
with only steps of ne-tuning.
•
We provide a solution between the generic fuzzer and the
specialized fuzzer, that can learn structured seed patterns
and mutate structured seeds.
•We provide empirical evidence that LLM can augment the
mutation process to benet fuzzing to improve the code
coverage.
•
We accommodate experimental explanations to illustrate
how LLM augments the fuzzing process.
•
We design a lightweight asynchronized method to harness
LLMs and fuzzer, allowing LLAMAFUZZ to be easily de-
ployed on either single GPU or multi-GPUs.
2 BACKGROUND AND MOTIVATION
In this section, we start by introducing the background of mutation-
based fuzzing and coverage-guided greybox fuzzing. We then pro-
vide some current solutions on structured data, and background on
large language models to motivate our solution.
2.1 Mutation-based fuzzing
The goal of mutation strategies is to generate new test cases from
a given seed to uncover previously unexplored areas. Mutations
can be random, making arbitrary changes to the seed. AFL++, a
state-of-the-art fuzzer, employs three phases for mutation: The
deterministic phase involves bit-level ips with varying lengths,
addition, and subtraction of small integers, and insertion of known
interesting integers; In the havoc phase, AFL++ randomly selects
mutation operators multiple times and applies them to various
random positions within the seed. The splicing phase combines
segments from two distinct seeds to create a new test case, which
is then subjected to further mutation in the havoc phase.
The mutation-based strategy explores the input space more
broadly, but its inherent randomness introduces greater uncertainty
and ineciency to the fuzzing process. Consequently, traditional
greybox fuzzers are highly unlikely to produce valid inputs, requir-
ing exponentially more time to probe deeper into the program. We
propose leveraging LLMs to mutate seeds based on existing ones, as
LLMs can understand the structure of the input seed and modify it
while preserving its validity. Our approach has shown eectiveness
in accelerating bug discovery, improving overall edge coverage, and
increasing the total number of bugs found.
LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing
2.2 Coverage-guided greybox fuzzing
To overcome the inherent randomness challenges in mutation-based
fuzzing, researchers suggest using bit-map to record coverage in-
formation as feedback to more eectively guide the fuzzing pro-
cess [
41
]. Since vulnerabilities cannot be detected in uncovered
paths, focusing on expanding the coverage of execution paths is
a reasonable step toward improving the performance of fuzzing
techniques.
Given a program under test and a set of initial seeds, the coverage-
guided greybox fuzzing process mainly consists of four stages.
(1)
Seeds queue: a seed was selected from the seeds pool for
mutation.
(2)
Seed mutation: the selected seed was mutated by various
mutation strategies to generate new test seeds.
(3) Execution: Execute the current seed into the program.
(4)
Behavior monitoring: Each new seed will be fed into the
instrumented program for execution and evaluated by cov-
erage metric; If the seed triggers a new coverage, it will be
added to the seeds queue for further fuzzing.
As the fuzzing loop continues, more code branches will be reached,
which holds the potential to trigger a bug [35].
2.3 Structured seed mutator
Coverage-guided greybox fuzzing has been eective in identifying
vulnerabilities in many real-world programs. However, with the
increasing complexity of software development, many programs use
highly structured data in special formats, which poses signicant
challenges for traditional fuzzing techniques. Traditional fuzzers
primarily perform mutations at the bit level, requiring excessive
attempts to mutate such structured data eectively.
Grammar-based fuzzing provides a solution to generate well-
structured seed by human-specied grammar. This guarantees that
the generated inputs are syntactically valid while in diversity. Three
grammar-aware mutation operators have been found to be particu-
larly eective in uncovering deep bugs [
1
,
31
]: random mutation,
which involves selecting a random non-leaf non-terminal node and
creating a new context-free grammars derivation subtree; random
recursive unrolling, which nds recursive production rules and
expands them up to
𝑛
times; and splicing, which combines two
inputs while preserving their syntactic validity.
2.4 Large Language Model
In recent studies, pre-trained Large Language Models (LLMs) have
shown impressive performance on natural language tasks, includ-
ing Natural language understanding, reasoning, natural language
generation, multilingual, and factuality [4].
Utilizing unsupervised learning, Large Language Models are pre-
trained on extensive textual data, enabling LLM with a broad range
of knowledge. Additionally, with billions or trillions of parame-
ters, LLM can not only capture the patterns in context but also
understand the textual data at a deeper level such as format and
chunk information within les. Such capabilities have facilitated
LLMs to exhibit remarkable competencies beyond traditional Natu-
ral Language Processing tasks. Evidence of their versatility includes
visual classication [
23
], protein sequence generation [
21
], code
generation [17].
Building upon this versatile foundation. This inherent capability
to interpret and process dierent data structures renders LLMs
particularly eective in the mutation stage of fuzzing processes.
CHATFUZZ [
14
] employs LLMs to directly generate seeds, though
its application is limited to text-based target programs such as JSON
and XML. Moreover, Pérez et al
.
demonstrate that Compressed-
Language Models can understand les compressed by Compressed
File Formats. In our experiment, LLMs can produce valuable test
seeds that are instrumental in navigating new paths, thereby facili-
tating the attainment of enhanced edge coverage.
3 METHODOLOGY
3.1 Architecture
We introduce LLAMAFUZZ, a LLM-based greybox fuzzer designed
to eciently mutate structured data. As illustrated in Figure 1, our
approach consists of two primary stages. First, we utilize paired
structured data to ne-tune the LLM, enabling it to understand
the underlying structure and mutation transformation. Second, we
integrate the fuzzer with LLM, which generates structured seeds
based on existing inputs. A crucial aspect of our approach is the ne-
tuning stage, which empowers the LLM to understand the target
data structure and mutation conversion, allowing LLAMAFUZZ to
adapt to various data formats through ne-tuning.
Our workow includes three parts:
(1)
Fine-tune preparation. Our training data were collected
from a variance of sources, including FuzzBench [
24
] exper-
iment data, and AFL++ experiment data. Also, we introduce
a data conversion method allowing LLM to generate various
data formats.
(2)
Fine-tuning LLM for mutation. We introduce the ap-
proach to ne-tuning the LLM and then leveraging the
LLM to do the structure-aware mutation.
(3)
Integrate fuzzer and LLM. We demonstrate an asynchro-
nous approach for integrating the fuzzer and the LLM. The
integration enables asynchronous communication between
the two components.
3.2 Fine-tune preparation
We follow the LLMs training process by generative pre-training of
a language model on a diverse corpus of unlabeled text, followed
by discriminative ne-tuning on specic tasks. For the base model,
we selected llama-2-7b-chat-hf, which has been pre-trained with
approximately 2 trillion tokens [
33
]. The ne-tuning data was col-
lected from real-world fuzzing processes [
24
]. We use this data to
teach LLM the pattern and mutation of structured data so it is able
to do modications on a given seed to generate valuable seeds while
keeping the original structure.
3.2.1 Fine-tuning data collection. We expect the LLM to be able to
understand the structure of data and generate structured seeds for
testing, thus we need to collect a training set rst. Specically, we
collect valuable seeds from FuzzBench [
24
] experiment data and
AFL++ fuzzing data that satisfy: (1) could nd new paths, (2) have
dierent hit-counts, or (3) trigger crashes. The reasons are intuitive,
improving coverage will help fuzzer explore target programs to nd
vulnerabilities in the unvisited path since bugs can not be found in
Hongxiang et al.
Figure 2: PNG example of sample input and LLM mutated result. The left section demonstrates the raw data of example les in
010 editor [30] view. Two tables on the right highlight the modications introduced by the LLM, with changes marked in red.
Notably, in this example, the PNG le includes the signature, IHDR chunk, gAMA chunk, and data represented in blue, green,
yellow, and white, respectively. LLM targeted modication on gAMA chunk and data while keeping the integrity of format and
eectively mutating.
undiscovered path; Seeds have dierent hit-counts may not directly
improve the coverage, but they execute the program in a dierent
way, which can discover vulnerabilities in the visited path; The goal
of fuzzing is to nd vulnerabilities, so the seed trigger crashes can
be regarded as valuable. Notice, that the dataset does not include
any experiment data from Manga. This is to avoid the LLMs just
replaying the memorized seeds to trigger bugs.
3.2.2 Data conversion and pre-processing. To construct a generic
seed mutation model, we follow the mechanism to convert the bi-
nary input les to uniform hex representations [
25
]. The reasons
for conducting such a conversion are as follows. First, one of our
goals is to make LLAMAFUZZ be able to deal with various data
formats. However, it is impractical to do format reading adaption
for all dierent data formats. Thus, we should gure out a uni-
form method to read data from the training set; Second, traditional
fuzzers operate on bit-level to binary seeds, but LLM typically takes
natural language as input and generates responses. Therefore, it
is essential to convert the training data to a format that LLMs can
understand; Third, the data conversion is expected to be ecient
and fast, as slow conversion would directly impact the fuzzing
throughput. Compared to other encoding schemes like base64, hex
representation is more intuitive, easy to implement, and can be
easily converted by binary. Note that this data conversion only
operates on binary-based data. For text-based data, we only add
prompts to the seeds, since LLM has been proven to be able to
process the text-based seeds [14].
As illustrated in Figure 3, our approach to data conversion in-
volves the following steps: Initially, the binary seed le will be
converted into hex representation; Subsequently, each two contigu-
ous hexadecimal will be compiled into a token, thereby reducing
the token length of the input string. This is a necessary step since
Figure 3: The workow of dataset pre-processing. Each pair
in the binary seed pairs represents the original binary seed
and the mutated binary seed, whereas
𝑆1
and
𝑆1′
indicate
seed before and after mutation. Pair 1 to Pair n on the right
represents the data in the ne-tuning dataset, corresponding
to Pair 1 to Pair n in Figure 1.
most current LLMs have a limited maximum input length. Finally,
we add the prompt to each ne-tuning data. In addition to data
conversion, we incorporate noise data into our training set to miti-
gate the risk of overtting and training data replay. Each piece of
training data is characterized by a pair of seeds: the original seed
and its corresponding mutated version. This setup is designed to
aid the LLM in learning not only the mutation transformation but
also the underlying structure of the data formats.
3.3 Fine-tuning LLM for mutation
In this section, we describe the approach to ne-tuning the LLM
and then leveraging the LLM to do the structure-aware mutation.
LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing
3.3.1 Fine-tuning. Fine-tuning the pre-trained models is a common
paradigm to achieve prociency in a specic downstream task [
16
,
34
]. Similarly, supervised ne-tuning is necessary when utilizing
general-purpose LLM for structured data mutation. This process
builds upon the pre-trained model’s general understanding and
adapts it to specic tasks through supervised learning. During the
supervised ne-tuning, LLM can adjust the weight based on the
gradients derived from the task-specic loss. Therefore, we can
use supervised ne-tuning to teach the LLM to comprehend input
syntax and output mutated patterns.
The rst step towards ne-tuning is preparing valid prompts.
Pair 1 to Pair n on the right of Figure 3 provide the example to the
model to mutate the structured data and corresponding mutation
result. In this prompt, fuzzer provides the current structured data
and desired mutated result in hex representation. Subsequently, we
emphasize the format keywords allowing LLM to take the general
understanding of the format from pre-trained knowledge to do the
mutation. Note that the LLM may occasionally produce stochastic
outputs, such as ‘Ox00Ox00’ This behavior was relatively rare and
will not have a signicant impact on the overall fuzzing process. As
these responses were often void, they could be easily disregarded
by fuzzer without aecting the results.
Figure 2 presents an example of a PNG seed mutated by the
LLM, where the bytes highlighted in red are the mutations done by
LLM. In this example, the PNG le is structured into PNG signature,
IHDR, gAMA, and raw data chunks. The signature, often referred
to as the le header, consists of a xed eight bytes that mark the
beginning of the le. The IHDR chunk, which immediately follows
the signature, is crucial as it contains essential image information
such as width, height, bit depth, color type, compression method,
lter method, and interlace method. The gAMA chunk species
the relationship between the image and the desired display output.
Subsequent chunks are typically ancillary in nature. Following the
IHDR, the PNG le contains several ancillary chunks.
In this example, the LLM selectively modies only the gAMA and
data chunks. It not only preserves the original format’s integrity
but also introduces valid modications that enhance the seed’s
potential to expose vulnerabilities.
3.3.2 Generation. The next step is to use the current seed from
the fuzzer combined with the prompt as input into the LLM. This
prompt guides the LLM to produce mutations consistent with the
format used during its ne-tuning phase. Once the LLM generates
a response, the output is parsed and converted back into the binary
format. We expect the generated output to be a modied variant of
the provided seed while maintaining its structural integrity.
3.4 Integrate fuzzer and LLM
Speed is paramount for greybox fuzzers, which are capable of ex-
ecuting hundreds or even thousands of seeds per second [
40
,
43
].
Any additional processes integrated into the greybox fuzzer, such as
LLM-based mutation, could potentially impair overall throughput
and negatively impact fuzzing performance. Particularly, LLM gen-
eration is slower and more resource-intensive, primarily requiring
substantial GPU resources.
To address this speed mismatch, we design an asynchronous
approach for communication between LLM and fuzzer to integrate
the LLM and fuzzer. As discussed in Section 3.2.2, conversion be-
tween binary and hex is fast and robust. To be specic, the workow
involves the following steps: initially, the current seed will be con-
verted into a hexadecimal representation; Then, the fuzzer sends
the current seed to LLM, meanwhile trying to receive seed from
LLM; Once LLM receives the seeds, it performs mutation on the
current seed and sends the newly generated seed back to the fuzzer
for further testing. The whole asynchronous process eliminates
any waiting time, allowing the fuzzer to continue processing at
high speed without waiting for the LLM to complete its mutation
tasks. By separating the fuzzing process from the slower LLM mu-
tation, we ensure that the integration of LLM enhances the fuzzer’s
capabilities without compromising its eciency.
4 EXPERIMENT
To evaluate the potential of LLMs in addressing the limitations of
traditional fuzzing approaches for structured data, we implement
LLAMAFUZZ by extending AFL++ (version 61e27c6). As described
in Section 3.4, we develop an asynchronous approach to communi-
cate between fuzzer and LLM. Moreover, we revise AFL++ source
code (e.g. a-fuzz-bitmap.c) to incorporate LLAMAFUZZ’s function-
alities, such as message queue, and seeds evaluations. To address
the generation speed mismatch between fuzzer and LLM, we lim-
ited the message queue length to 30. This adjustment ensures the
LLM always mutates on the most recent seeds, maintaining the rel-
evancy and eectiveness of mutations. The source code and related
artifacts of LLAMAFUZZ will be made publicly available for the
research community upon acceptance.
In order to collect sucient and diverse ne-tuning data, and
satisfy the LLM max token limitation, we set the maximum length of
the hex pair to 4096. This limit corresponds to the maximum token
capacity of the LLM. In the preliminary experiment, we observed
that many structured data occupy a large le size e.g. PDF les.
Moreover, increasing the maximum token length allowed for the
inclusion of a more diverse range of data, enhancing the model’s
learning and application potential.
As for the LLM, we employ llama2-7b-chat-hf [
33
], which is one
of the state-of-the-art LLMs that is powerful yet ecient on the
hardware. Following previous work [
27
,
36
], we select a relatively
low temperature of 1.0 to mutate the structured data for precise and
factual responses. In addition, we adopt the model quantization [
26
]
into mixed precision oating-point 16
𝑓 𝑝
and enable Lora [
13
] to
freeze some of the parameters to increase training and inference
speed without sacricing too much accuracy.
We investigate the following research questions to reveal the
power of LLAMAFUZZ:
(1)
RQ1: Is LLAMAFUZZ state-of-the-art? How does it per-
form on standard benchmarks such as Magma?
(2)
RQ2: How does LLAMAFUZZ perform against similar
fuzzers on real-world open-source programs?
(3)
RQ3: How does LLAMAFUZZ augment integrated fuzzer?
To answer these questions, we rst design experiments to exam-
ine whether LLAMAFUZZ is state-of-the-art. We employ Magma
V1.2 [
11
], a ground-truth fuzzing benchmark suite based on real
programs with real bugs. Table 1 outlines the details of fuzz targets,
where four columns indicate the benchmark, project, fuzz target,
Hongxiang et al.
Table 1: Targets information. The programs tested under the
Magma are utilized in their default versions as provided by
Magma. For programs included in the real-world bench, we
specify the exact versions used by listing the Git SHA (Secure
Hash Algorithm) identiers behind each project name.
Project & version Fuzz Target File format
Magma V1.2
libpng libpng_read_fuzzer PNG
libsndle sndle_fuzzer Audio
libti ti_read_rgba_fuzzer TIFF
ticp
libxml2 xml_read_memory_fuzzer XML
xmllint
lua Targets lua Lua
openssl
asn1
Binary blobs
asn1parse
bignum
server
client
x509
php
json JSON
exif EXIF
unserialize Serialize object
parser PHP
poppler pdf_fuzzer PDF
pdmages
pdftoppm
sqlite3 sqlite3_fuzz SQL query
Real-world bench
bloaty 34f4a66 fuzz_target
ELF, Mach-O, We-
bAssembly
zlib 0f51fb4 zlib_uncompress_fuzzer Zlib compressed
binutils 7320840
fuzz_nm ELF
fuzz_objcopy
fuzz_readelf
fuzz_strings String
grok b9286c2 grk_decompress_fuzzer JPEG 2000
kamailio 3f774f3 fuzz_parse_msg sip_msg
fuzz_uri URI
libavc 828cdb7 avc_dec_fuzzer AVC
mvc_dec_fuzzer MVC
svc_dec_fuzzer SVC
openh264 1c23887 decoder_fuzzer
H.264/MPEG-4
AVC
libhevc d0897de hevc_dec_fuzzer HEVC
freetype2 cd02d35 ftfuzzer TTF, OTF, WOFF
and expected le format. We choose Magma for several reasons.
First, Magma involves a wide range of popular programs with real-
world environments, including 9 libraries and 21 objects. Second,
unlike LAVA-M [
6
], which primarily employs synthetic bugs and
magic byte comparisons, Magma oers a diverse range of real vul-
nerabilities, with a total of 138 bugs spanning integer errors, divide-
by-zero faults, memory overows, use-after-free, double-free, and
null-pointer dereference scenarios. It incorporates real-world bugs
from older versions of software updated to their latest releases, en-
suring the benchmark’s relevance and practical applicability. Third,
Magma includes programs that process diverse structured data for-
mats such as images, audio les, XML, programming languages,
and PDFs.
In the rst experiment, we compare LLAMAFUZZ with AFL++,
Mopta, Honggfuzz, and Fairfuzz. Except for AFL++, all baseline
fuzzers are provided in Magma. We used a more recent version of
AFL++ (version 61e27c6) than the one provided in Magma, ensuring
that we have access to the latest enhancements.
AFL++ was selected as the reference competitor for its standing
as the state-of-the-art greybox fuzzer that incorporates numerous
improvements and functional enhancements over the original AFL.
Moreover, since LLAMAFUZZ was developed on top of AFL++,
every observed dierence between LLAMAFUZZ and AFL++ can be
attributed to our changes to the implementation of LLM mutation.
To answer RQ2: How does LLAMAFUZZ perform against sim-
ilar fuzzers on real-world open-source programs? We conducted
an evaluation using a selection of real-world programs sourced
from OSS-Fuzz [
28
]. For fairness and consistency in the evaluation
process, we selected a common subset of the projects in OSS-Fuzz
and in FuzzBench [
24
]. The specic applications chosen for this
study are detailed in Table 1. The chosen benchmark encompasses
diverse open-source programs that process dierent structured data
in their latest versions. Our selection follows three criteria. First,
the benchmark should have a diverse structure format. Second, the
program should handle complex structured data. Third, the pro-
gram should be popular and important. To ensure a fair comparison
among fuzzing tools, we utilized FuzzBench [
24
], an evaluation
framework that employs Docker containers to standardize the test-
ing environment for each fuzzer. This setup guarantees the fairness
that all fuzzers operate under identical conditions, thus ensuring
the comparability of results.
Regarding the build process for the targets we selected from OSS-
Fuzz, Fuzzbench, and Magma, we follow the standard instructions
provided by the benchmark developers. For real-world programs
tested under OSS-Fuzz, we utilized the default initial seed corpus
as outlined by OSS-FUZZ [
28
]. Similarly, during our experiments
with the Magma benchmark, we selected the default initial seeds
specied by the benchmark developers. This approach ensures our
experiments align with the recommended practices to maintain
consistency across all tests.
4.1 Variables and Measures
To evaluate the eectiveness of LLAMAFUZZ versus the baseline
fuzzers, we adopt the setting of the Magma benchmark and leverage
the number of bugs and time to reach/trigger bugs as indicators.
Additionally, we leverage the branch coverage as the code coverage
indicator generated by a-cov, which is the default in AFL++. As
suggested by Magma [
11
], each experiment lasts for 24 hours. LLA-
MAFUZZ is repeated 3 times due to the limitation of GPU resources,
while the other fuzzers were repeated by 10 times.
4.2 Experimental result on Magma
We utilized bug-based benchmark Magma V1.2 to evaluate LLA-
MAFUZZ. We run LLAMAFUZZ for 3 repetitions over 24 hours,
comparing with 10 repetitions of SOTA fuzzers (AFL++, Mopta,
Honggfuzz, and Fairfuzz) over 24 hours.
4.2.1 Performance on bug triggered. Simply covering the branches
in the vulnerable code does not mean that the program is in the
correct state to trigger the bug. Hence, We assessed LLAMAFUZZ
performance against other popular fuzzers by checking whether it
is able to discover more bugs on Magma.
The results are presented in Figure 4a, illustrating the distribution
of bugs triggered by LLAMAFUZZ, AFL++, Mopta, Honggfuzz,
and Fairfuzz at the end of the 24-hour trial. According to Figure 4a,
LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing
LLAMAFUZZ aflplusplus fairfuzz honggfuzz moptafl
Fuzzers
20
25
30
35
40
# of bugs triggered
(a) Distribution of the average number of bugs triggered
over 24 hours of LLAMAFUZZ, AFL++, Mopta, Hongg-
fuzz, and Fairfuzz. The Y-axis represents the number of
bugs reached.
libpng libsndfile libtiff libxml2 lua openssl php poppler sqlite3
Fuzz targets
0
1
2
3
4
5
6
7
8
# of bugs triggered
LLAMAFUZZ
aflplusplus
fairfuzz
honggfuzz
moptafl
(b) Arithmetic mean number of bugs identied for each project per trial. The black
line denotes the 95% condence interval.
Figure 4: Bug triggered by each fuzzer among trials and fuzz targets
LLAMAFUZZ outperforms all other fuzzers in terms of the average
number of bugs triggered for each trail. The result can be attributed
to the LLM’s pre-training knowledge about the overall structure
of data formats. Further enhancement comes from the ne-tuning
process, where the LLM learns specic data formats and mutation
conversion to the needs of fuzzing, thereby advancing LLAMA-
FUZZ in bug triggering. These results highlight LLAMAFUZZ’s
competitiveness and robustness with the SOTA in bug-triggering
capabilities.
To further investigate the performance of LLAMAFUZZ across
dierent fuzzing targets. Figure 4b presents the arithmetic mean
number of bugs identied for each project per trial per day. Ac-
cording to the results, LLAMAFUZZ triggers the most unique
bugs among the evaluated fuzzers. It discovered 47 unique bugs
in Magma, while AFL++, Mopta, Honggfuzz, and Fairfuzz found
46, 42, 37, and 31 errors, respectively. Vulnerabilities were found
in 9 tested implementations and encompass various types of mem-
ory vulnerabilities, including use-after-free, buer overow, and
memory leaks. Notably, SQL003, XML006, and XML002 were never
found by any other fuzzers. In terms of library-specic performance,
LLAMAFUZZ rank #1 on libsndle, libti, libxml2, lua Targets, php,
and sqlite3. #2 on libpng. #3 on openssl and poppler.
Compared to honggfuzz, LLAMAFUZZ found three unique bugs
in libpng. We investigate the missing bug: PNG001, it can be trig-
gered by falsely calculating the row_factor leading to a memory
leak, especially for large images and high dimensionally or multi-
channel images. In poppler, LLAMAFUZZ only found six unique
bugs, while AFL++ and Mopa found seven unique bugs. This is
because each PDF seed has a big size, and most PDF seeds exceed
the maximum token length that LLM can handle.
To understand the contributions of the LLM mutation, we con-
ducted a more detailed investigation. Bug XML006 (CVE-2017-9048)
demonstrates that randomness-only mutation is insucient, a com-
prehension understanding through structure is necessary. XML006
is a stack-based buer overow vulnerable in libxml2, to trigger
it the mutator must recursively dump the element content deni-
tion into a char buer buf of size size. At the end of the routine,
the mutator appends two more characters to exceed the size. In
our experiment, only LLAMAFUZZ triggered this bug. Proving
LLAMAFUZZ can facilitate the host fuzzers with the capability of
nding bugs.
4.2.2 Performance on bug triggered time. Consequently, we list all
the unique bugs triggered, including bug ID and the expected time
used to trigger it in Figure 5. The reported time accounts for missed
measurements (where the fuzzer only triggers a bug in M out of N
campaigns) and ts the distribution of time-to-bug samples onto
an exponential distribution [11].
Compared to AFL++, LLAMAFUZZ triggers a greater number of
bugs and signicantly speeds up, whereas the grid bluer represents
trigger bugs faster. Specically, LLAMAF UZZ achieved signicant
speed up in 29 of 43 bugs that triggered in both LLAMAFUZZ and
AFL++ with the remaining bugs exhibiting similar trigger times.
In comparison to mopta, honggfuzz, and fairfuzz, LLAMAFUZZ
triggered bugs faster in 25, 23, and 21 cases respectively. Overall,
the result indicates a substantial advantage of LLAMAFUZZ over
AFL++, Mopta, Honggfuzz, and Fairfuzz in exploring bugs.
In summary, LLAMAFUZZ reached 85 unique bugs, trig-
gered 47 unique bugs and 41 bugs on average on Magma,
the most compared with other state-of-the-art fuzzers. In
addition, LLAMAFUZZ performs a faster manner in bug
trigger. Therefore, we can answer RQ1 with condence
that LLAMAFUZZ is the state-of-the-art.
Hongxiang et al.
Figure 5: Heatmap of expected bug trigger time achieved
by LLAMAFUZZ, AFL++, Mopta, Honggfuzz, and Fairfuzz
at the end of each 24-hour trail. Within each block, more
intense blue shades denote shorter trigger times.
4.3
Experimental result on real-world programs
While performing well on Magma is sucient to claim LLAMA-
FUZZ is state-of-the-art, we are committed to further validating its
ecacy on real-world applications. To this end, we have selected a
series of open-source programs currently in production to conduct
a comprehensive evaluation. This step is crucial for demonstrating
the practical eectiveness of LLAMAFUZZ’s methodologies and
techniques in various le formats under real-world conditions. To
maintain consistency and fairness in our comparisons, all branch
coverage metrics reported are generated using a-cov. This uni-
form approach ensures that discrepancies in branch counting across
dierent tools do not aect the integrity of our results.
Figure 6 shows the distribution of branch coverage achieved by
LLAMAFUZZ and the baselines AFL++ over 24 hours. In 10 out of
15 targets, LLAMAFUZZ shows signicant improvement compared
with AFL++ in terms of code coverage. Further detail is provided
in Table 2 reports the average branch coverage and the percentage
improvement in average branch coverage over the same timeframe
(see column Improv).
Specically, LLAMAFUZZ denotes signicant coverage improve-
ment in bloaty (4.37%), binutils_nm (54.81%), binutils_objcopy (78.64%),
binutils_readelf (48.19%), binutils_strings (21.65%), grok (61.12%),
kamailio-parse_msg (39.06%), libavc_mvc_dec (8.94%), libavc_svc_dec
(84.48%), and freetype2-ftfuzzer (5.45%). These results underscore
LLAMAFUZZ’s eectiveness in enhancing coverage across a vari-
ety of applications.
Table 2: fuzzbench code coverage
Fuzz target Fuzz object Branch coverage (avg)
LLAMAFUZZ AFL++ Improv.
bloaty fuzz_target 5972 5722 4.37%
zlib zlib_uncompress_fuzzer 384 385 -0.47%
binutils
fuzz_nm 13958 9017 54.81%
fuzz_objcopy 22318 12494 78.64%
fuzz_readelf 6576 4437 48.19%
fuzz_strings 6442 5295 21.65%
grok grk_decompress_fuzzer 3750 2313 62.12%
kamailio fuzz_parse_msg 3743 2692 39.06%
fuzz_uri 1392 1391 0.04%
libavc avc_dec_fuzzer 9872 9838 0.35%
mvc_dec_fuzzer 6463 5933 8.94%
svc_dec_fuzzer 11812 6403 84.48%
openh264 decoder_fuzzer 7394 7396 -0.03%
libhevc hevc_dec_fuzzer 15154 15122 0.21%
freetype2 freetype2-ftfuzzer 10521 9978 5.45%
Sum - - 27.19%
However, performance in other objects showed little to no ad-
vantage, for example, zlib, kamailio-uri, and openh264, etc. One
possible reason is that the seed sizes exceed what the LLM can ef-
fectively process, hindering its ability to generate useful mutations.
Another factor might be a deciency in the training data specic to
those targets, which could limit the LLM’s capability to learn and
apply eective mutations in those formats.
In summary, LLAMAFUZZ outperforms the state-of-the-
art fuzzer AFL++ in all targets in terms of code coverage.
LLAMAFUZZ reaches 27.19% more than AFL++. Thus, we
can answer RQ2 with condence that LLAMAFUZZ is the
state-of-the-art on real-world open-source programs.
4.4 Case study: How LLM argument fuzzing
We have demonstrated the superiority of LLAMAFUZZ among
standard benchmark and real-world programs. Moving forward,
we would like to investigate how LLM augments the fuzzing pro-
cess. In the fuzzing process, seeds that can trigger new behavior
will be regarded as valuable and used for further fuzzing. There-
fore, understanding the relationship between these seeds and code
coverage improvements is crucial for optimizing the fuzzing pro-
cess. Figure 7 displays the code coverage and highlights seeds that
originated from LLM-generated seeds. The black triangle marks
seeds directly generated by the LLM. The subsequent generations
of seeds, which are sourced from these LLM seeds, are indicated by
red vertical lines.
Figure 7a provides the growth of coverage over time. Within the
initial 10,000 seconds, seeds mutated by the LLM directly enhance
coverage. As the fuzzing process progresses, seeds derived from
LLM-mutated seeds across further augment the fuzzing coverage.
This indicates that LLM-generated seeds not only directly impact
the fuzzing process but also have a profound and indirect inuence
on its development. When compared to the grey area, which repre-
sents the 95% condence interval of AFL++ coverage, LLAMAFUZZ
achieves both higher and faster coverage. This observation aligns
LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing
afl++ LLAMAFUZZ
5700
5800
5900
6000
bloaty-fuzz_target
afl++ LLAMAFUZZ
382
384
386
388
390
zlib-zlib_uncompress_fuzzer
afl++ LLAMAFUZZ
8000
10000
12000
14000
binutils-fuzz_nm
afl++ LLAMAFUZZ
8000
10000
12000
14000
16000
18000
20000
22000
binutils-fuzz_objcopy
afl++ LLAMAFUZZ
2000
3000
4000
5000
6000
binutils-fuzz_readelf
afl++ LLAMAFUZZ
5000
5500
6000
6500
7000
7500
binutils-fuzz_strings
afl++ LLAMAFUZZ
2200
2400
2600
2800
3000
3200
3400
3600
3800
grok-grk_decompress_fuzzer
afl++ LLAMAFUZZ
2400
2600
2800
3000
3200
3400
3600
3800
kamailio-fuzz_parse
afl++ LLAMAFUZZ
1387
1388
1389
1390
1391
1392
kamailio-fuzz_uri
afl++ LLAMAFUZZ
9820
9830
9840
9850
9860
9870
9880
libavc-avc_dec_fuzzer
afl++ LLAMAFUZZ
1000
2000
3000
4000
5000
6000
libavc-mvc_dec_fuzzer
afl++ LLAMAFUZZ
2000
4000
6000
8000
10000
12000
14000 libavc-svc_dec_fuzzer
afl++ LLAMAFUZZ
7380
7390
7400
7410
7420
7430
7440
openh264-decoder_fuzzer
afl++ LLAMAFUZZ
15060
15080
15100
15120
15140
15160
libhevc-hevc_dec_fuzzer
afl++ LLAMAFUZZ
9600
9800
10000
10200
10400
10600
freetype2-ftfuzzer
Figure 6: The distribution of nal branch coverage achieved by LLAMAFUZZ and AFL++ at the conclusion of each 24-hour
trial in each fuzzing object. The Y-axis represents the nal branch coverage. Within each box, the short bold lines indicate the
median nal branch coverage.
with the outcomes from previous experiments conducted with the
Magma benchmark.
Additionally, Figure 7b illustrates coverage dynamics in a dier-
ent manner. A signicant coverage jump can be observed around
17,000 seconds, marked by several LLM-generated seeds indicated
with black triangles. These seeds directly contribute to substantial
coverage gains. Approaching the second plateau, numerous seeds
sourced from LLM seeds further enhance coverage, indicating the
lasting benets of LLM in the fuzzing process.
In summary, two visual examples are presented to explain
how LLM benets the fuzzing process by mutating seeds in
terms of code coverage. Therefore, RQ3 can be answered.
5 RELATED WORK
5.1 Fuzzing
Fuzzing is an automated random software testing technique to dis-
cover vulnerabilities and bugs in the target programs or applications.
Traditional fuzzers can be categorized into black-box fuzzers, white-
box fuzzers, and greybox fuzzers depending on whether fuzzers are
aware of the program structure. The black-box fuzzer threat targets
a black box, and it’s unaware of the program structure. Usually, a
black-box fuzzer has a high execution volume since it randomly
generates test input, but it only scratches the surface. YARPGen [
18
]
applies random mutation rigorously applies language specications
to ensure the validity of test cases to test C and C++ compilers. Sim-
ilarly, Csmith [
39
] generates programs that cover a large subset of
C while avoiding undened and unspecied behaviors.
White-box fuzzers utilize program analysis to improve the code
coverage to explore certain code regions, which can be ecient
in revealing vulnerabilities in complex logic. WhisperFuzz [
2
] in-
troduces a static analysis method designed specically to detect
and locate timing vulnerabilities in processors. The tool focuses
on evaluating the coverage of microarchitectural timing behaviors,
providing a targeted and comprehensive assessment that aids in
identifying potential security risks associated with timing aws.
However, program analysis and dening specialized seed genera-
tion grammar could be extremely time-consuming. Greybox fuzzer
combines the eectiveness of white-box fuzzer and the eciency
of black-box fuzzer. It leverages instrumentation to get feedback
from target programs and leading fuzzers to generate more valuable
seeds resulting in higher code coverage. Greybox fuzzers usually
combined with mutation strategies rely on iterative modications
of existing seeds to produce novel fuzzing inputs. In addition to
basic mutations, recent researchers have developed complex trans-
formations to maintain type consistency [
3
,
15
], adding historical
bug-triggering code snippets [
12
,
42
], and coverage feedback [
1
,
9
]
for improved testing eciency. American Fuzzy Lop (AFL) [
41
] and
its variations [
5
,
9
,
20
], employ genetic algorithms with a tness
function to prioritize fuzzing inputs for further mutations aimed at
enhancing coverage, concentrating on byte-level changes.
Hongxiang et al.
0 20000 40000 60000 80000
Relative Time (second)
4000
6000
8000
10000
12000
Code branch coverage
n iteration sourced from LLM
LLM gen
1st iter
(a) Coverage growth by experimental time in binutils-nm target.
0 20000 40000 60000 80000
Relative Time (second)
500
1000
1500
2000
2500
3000
3500
Code branch coverage
n iteration sourced from LLM
LLM gen
1st iter
(b) Coverage growth by experimental time in kamailio-fuzz_parse_msg
target.
Figure 7: Coverage improved by experimental time. The black
triangles indicate the seeds generated by LLM, and the red
vertical line highlights the seeds sourced from LLM. The grey
background indicates the 95% interval of AFL++ coverage
among all experiments.
5.2 Fuzzing for structured data
In applications that require structured input, the aforementioned
methods might utilize an articially constructed dictionary or auto-
matically generate corpora to create test cases that meet the formats.
However, blind random mutation strategies often disrupt the con-
sistency of data formats, leading to the generation of numerous
inecient and ineective test cases.
Grammar guidance fuzzers can accurately identify the target
input format. They can generate test cases that maintain the con-
sistency of the format. This approach ensures that the generated
test cases are not only valid but also eective in triggering and
exploring potential vulnerabilities or issues within the application.
Langfuzz [
12
] combines grammar-based fuzz testing and reusing
project-specic issue-related fragments, maintaining the integrity
of format and having a higher chance to cause new problems than
random input. QuickFuzz [
10
] leverages Haskell’s QuickCheck and
the Hackage to fuzz structured data. This integration, combined
with conventional bit-level mutational fuzzers, negates the need
for an external set of input les and eliminates the requirement to
develop specic models for the le types being tested.
5.3
Augment fuzzing through machine learning
Current research primarily concentrates on two aspects: employing
machine-learning models as generators and leveraging machine-
learning models to guide the fuzzing process.
C. Pérez [
25
] explored the ability of Compressed-Language Mod-
els (CLMs) to interpret les compressed by standard le formats.
Their ndings revealed that CLMs are capable of understanding
the semantics of compressed data directly from the byte streams,
opening a new path for processing raw compressed les. In a re-
lated study, CHATFUZZ[
14
] investigates the mutation capabili-
ties of LLM on text-based seeds, achieving 12.77% edge coverage
improvement over the SOTA greybox fuzzer (AFL++). Similarly,
SmartSeed [
19
] combines deep learning models to generate new
inputs for evaluating 12 dierent applications.
Prior work [
7
] integrates an LLM-based mutator with a rein-
forcement learning approach, utilizing the Term Frequency-Inverse
Document Frequency technique to develop a weighted coverage
map. This method capitalizes on coverage feedback to enhance
the eectiveness of the mutation process. Similarly, Xia et al. [
37
]
introduce an auto-prompting phase that employs LLMs to produce
and mutate test cases across six programming languages. Their
ndings indicate that LLMs can surpass the coverage achieved by
cutting-edge tools.
Additionally, WhiteFox [
38
] employs dual LLMs within their
framework: one analyzes low-level optimization source code to
inform optimization strategies, while the other generates test pro-
grams based on this analysis. CHATAFL [
22
] utilizes LLMs to under-
stand protocol message types and assesses their ability to identify
"states" in stateful protocol implementations. LLM4FUZZ [
29
] lever-
ages LLMs to guide fuzzers towards more critical code areas and
input sequences that are more likely to reveal vulnerabilities, show-
casing the potential of LLMs in prioritizing and rening fuzzing
eorts.
6 CONCLUSION
Mutating the input seeds is a crucial step of greybox fuzzing that
directly aects the fuzzing performance. Although randomized bit-
level mutations are eective in many cases, we identify that it is
challenging for the state-of-the-art mutation-based greybox fuzzers
face to deal with structured data. This is because current mutation-
based fuzzers require exceedingly attempts to mutate valid highly
structured data, and the mutation heavily relies on randomness.
In this paper, we propose utilizing the Large Language Model to
learn the pattern of structured data and mutate the seeds. We build
LLAMAFUZZ based on our theory and demonstrate that LLMs are
eective and ecient in structured-aware mutating. We evaluate
LLAMAFUZZ on a ground truth fuzzing benchmark, Magma, and a
variety of structured data real-world programs set. The results are
highly promising. LLAMAFUZZ covered 27.19% more code com-
pared to the state-of-the-art greybox fuzzer AFL++. Furthermore,
LLAMAFUZZ outperforms top competitor by 41 bugs on average
and 47 unique bugs across all trials.
LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing
REFERENCES
[1]
Cornelius Aschermann, Tommaso Frassetto, Thorsten Holz, Patrick Jauernig,
Ahmad-Reza Sadeghi, and Daniel Teuchert. 2019. NAUTILUS: Fishing for Deep
Bugs with Grammars.. In NDSS.
[2]
Pallavi Borkar, Chen Chen, Mohamadreza Rostami, Nikhilesh Singh, Rahul Kande,
Ahmad-Reza Sadeghi, Chester Rebeiro, and Jeyavijayan Rajendran. 2024. Whis-
perFuzz: White-Box Fuzzing for Detecting and Locating Timing Vulnerabilities
in Processors. arXiv preprint arXiv:2402.03704 (2024).
[3]
Stefanos Chaliasos, Thodoris Sotiropoulos, Diomidis Spinellis, Arthur Gervais,
Benjamin Livshits, and Dimitris Mitropoulos. 2022. Finding typing compiler bugs.
In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming
Language Design and Implementation. 183–198.
[4]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao
Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al
.
2023. A survey on
evaluation of large language models. ACM Transactions on Intelligent Systems
and Technology (2023).
[5]
Addison Crump, Andrea Fioraldi, Dominik Maier, and Dongjia Zhang. 2023.
LIBAFL LIBFUZZER: LIBFUZZER on Top of LIBAFL. In 2023 IEEE/ACM Interna-
tional Workshop on Search-Based and Fuzz Testing (SBFT). IEEE, 70–72.
[6]
Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti,
Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. Lava: Large-scale
automated vulnerability addition. In 2016 IEEE symposium on security and privacy
(SP). IEEE, 110–121.
[7]
Jueon Eom, Seyeon Jeong, and Taekyoung Kwon. 2024. CovRL: Fuzzing JavaScript
Engines with Coverage-Guided Reinforcement Learning for LLM-based Muta-
tion. ArXiv abs/2402.12222 (2024). https://api.semanticscholar.org/CorpusID:
267750648
[8]
Andrea Fioraldi, Daniele Cono D’Elia, and Emilio Coppa. 2020. WEIZZ: Auto-
matic grey-box fuzzing for structured binary formats. In Proceedings of the 29th
ACM SIGSOFT international symposium on software testing and analysis. 1–13.
[9]
Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++:
Combining Incremental Steps of Fuzzing Research. In 14th USENIX Workshop on
Oensive Technologies (WOOT 20). USENIX Association.
[10]
Gustavo Grieco, Martín Ceresa, and Pablo Buiras. 2016. QuickFuzz: An automatic
random fuzzer for common le formats. ACM SIGPLAN Notices 51, 12 (2016),
13–20.
[11]
Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A Ground-
Truth Fuzzing Benchmark. Proc. ACM Meas. Anal. Comput. Syst. 4, 3, Article 49
(Dec. 2020), 29 pages. https://doi.org/10.1145/3428334
[12]
Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with code
fragments. In 21st USENIX Security Symposium (USENIX Security 12). 445–458.
[13]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean
Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large
language models. arXiv preprint arXiv:2106.09685 (2021).
[14]
Jie Hu, Qian Zhang, and Heng Yin. 2023. Augmenting greybox fuzzing with
generative ai. arXiv preprint arXiv:2306.06782 (2023).
[15]
Vivek Jain, Sanjay Rawat, Cristiano Giurida, and Herbert Bos. 2018. TIFF:
using input type inference to improve fuzzing. In Proceedings of the 34th Annual
Computer Security Applications Conference. 505–517.
[16]
Yann LeCun, Yoshua Bengio, and Georey Hinton. 2015. Deep learning. nature
521, 7553 (2015), 436–444.
[17]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi
Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al
.
2022.
Competition-level code generation with alphacode. Science 378, 6624 (2022),
1092–1097.
[18]
Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for
C and C++ compilers with YARPGen. Proceedings of the ACM on Programming
Languages 4, OOPSLA (2020), 1–25.
[19]
Chenyang Lyu, Shouling Ji, Yuwei Li, Junfeng Zhou, Jianhai Chen, and Jing Chen.
2018. Smartseed: Smart seed generation for ecient fuzzing. arXiv preprint
arXiv:1807.02606 (2018).
[20] Chenyang Lyu, Shouling Ji, Chao Zhang, Yuwei Li, Wei-Han Lee, Yu Song, and
Raheem Beyah. 2019.
{
MOPT
}
: Optimized mutation scheduling for fuzzers. In
28th USENIX Security Symposium (USENIX Security 19). 1949–1966.
[21]
Ali Madani, Ben Krause, Eric R Greene, Subu Subramanian, Benjamin P Mohr,
James M Holton, Jose Luis Olmos, Caiming Xiong, Zachary Z Sun, Richard Socher,
et al
.
2023. Large language models generate functional protein sequences across
diverse families. Nature Biotechnology 41, 8 (2023), 1099–1106.
[22]
Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. 2024.
Large language model guided protocol fuzzing. In Proceedings of the 31st Annual
Network and Distributed System Security Symposium (NDSS).
[23]
Sachit Menon and Carl Vondrick. 2022. Visual classication via description from
large language models. arXiv preprint arXiv:2210.07183 (2022).
[24]
Jonathan Metzman, László Szekeres, Laurent Maurice Romain Simon, Read Trev-
elin Sprabery, and Abhishek Arya. 2021. FuzzBench: An Open Fuzzer Bench-
marking Platform and Service. In Proceedings of the 29th ACM Joint Meeting on
European Software Engineering Conference and Symposium on the Foundations of
Software Engineering (ESEC/FSE 2021). Association for Computing Machinery,
New York, NY, USA, 1393–1403. https://doi.org/10.1145/3468264.3473932
[25]
Juan C Pérez, Alejandro Pardo, Mattia Soldan, Hani Itani, Juan Leon-Alcazar,
and Bernard Ghanem. 2024. Compressed-Language Models for Understanding
Compressed File Formats: a JPEG Exploration. arXiv preprint arXiv:2405.17146
(2024).
[26]
Antonio Polino, Razvan Pascanu, and Dan Alistarh. 2018. Model compression
via distillation and quantization. arXiv preprint arXiv:1802.05668 (2018).
[27]
Alec Radford, Jerey Wu, Rewon Child, David Luan, Dario Amodei, Ilya
Sutskever, et al
.
2019. Language models are unsupervised multitask learners.
OpenAI blog 1, 8 (2019), 9.
[28]
Kostya Serebryany. 2017.
{
OSS-Fuzz
}
-Google’s continuous fuzzing service for
open source software. (2017).
[29]
Chaofan Shou, Jing Liu, Doudou Lu, and Koushik Sen. 2024. LLM4Fuzz: Guided
Fuzzing of Smart Contracts with Large Language Models. arXiv preprint
arXiv:2401.11108 (2024).
[30]
SweetScape Software. [n. d.]. 010 Editor - Pro Text/Hex Editor | Edit 160+ Formats
| Fast & Powerful. https://www.sweetscape.com/010editor/
[31]
Prashast Srivastava and Mathias Payer. 2021. Gramatron: Eective grammar-
aware fuzzing. In Proceedings of the 30th acm sigsoft international symposium on
software testing and analysis. 244–256.
[32]
Robert Swiecki. [n. d.]. Honggfuzz: A general-purpose, easy-to-use fuzzer with
interesting analysis options. ([n. d.]). https://github.com/google/honggfuzz
[33] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas-
mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-
ale, et al
.
2023. Llama 2: Open foundation and ne-tuned chat models. arXiv
preprint arXiv:2307.09288 (2023).
[34]
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tris-
tan Thrush, Nathan Lambert, and Shengyi Huang. 2020. TRL: Transformer
Reinforcement Learning. https://github.com/huggingface/trl.
[35]
Jinghan Wang, Yue Duan, Wei Song, Heng Yin, and Chengyu Song. 2019. Be
sensitive and collaborative: Analyzing impact of coverage metrics in greybox
fuzzing. In 22nd International Symposium on Research in Attacks, Intrusions and
Defenses (RAID 2019). 1–15.
[36]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang,
Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain
of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
[37]
Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming
Zhang. 2024. Fuzz4all: Universal fuzzing with large language models. Proc.
IEEE/ACM ICSE (2024).
[38]
Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbar-
vand, and Lingming Zhang. 2023. White-box compiler fuzzing empowered by
large language models. arXiv preprint arXiv:2310.15991 (2023).
[39] Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and under-
standing bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN conference
on Programming language design and implementation. 283–294.
[40]
Insu Yun, Sangho Lee, Meng Xu, Yeongjin Jang, and Taesoo Kim. 2018.
{
QSYM
}
:
A practical concolic execution engine tailored for hybrid fuzzing. In 27th USENIX
Security Symposium (USENIX Security 18). 745–761.
[41]
M. Zalewski. 2016. American Fuzzy Lop - Whitepaper. (2016). https://lcamtuf.
coredump.cx/a/technical_details.txt
[42]
Yingquan Zhao, Zan Wang, Junjie Chen, Mengdi Liu, Mingyuan Wu, Yuqun
Zhang, and Lingming Zhang. 2022. History-driven test program synthesis for
JVM testing. In Proceedings of the 44th International Conference on Software
Engineering. 1133–1144.
[43]
Yaowen Zheng, Ali Davanian, Heng Yin, Chengyu Song, Hongsong Zhu, and
Limin Sun. 2019.
{
FIRM-AFL
}
:
{
High-Throughput
}
greybox fuzzing of
{
IoT
}
rmware via augmented process emulation. In 28th USENIX Security Symposium
(USENIX Security 19). 1099–1114.