PreprintPDF Available

Automated Fuzzing Harness Generation for Library APIs and Binary Protocol Parsers

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Fuzzing is a widely used software security testing technique that is designed to identify vulnerabilities in systems by providing invalid or unexpected input. Continuous fuzzing systems like OSS-FUZZ have been successful in finding security bugs in many different software systems. The typical process of finding security bugs using fuzzing involves several steps: first, the "fuzz-worthy" functions that are likely to contain vulnerabilities must be identified; second, the setup requirements for the API must be understood before it can be called; third, a fuzzing harness must be written and bound to a coverage-guided fuzzer like LLVM's LibFuzzer; and finally, the security bugs discovered by the fuzzing harness must be triaged and checked for reproducibility. This project focuses on automating the first two steps in this process. In particular, we present an automated system that can generate fuzzing harnesses for library APIs and binary protocol parsers by analyzing unit tests. This allows for the scaling of the fuzzing infrastructure in proportion to the growth of the codebase, without the need for manual coding of harnesses. Additionally, we develop a metric to assess the "fuzz-worthiness" of an API, enabling us to prioritize the most promising targets for testing.
Content may be subject to copyright.
Automated Fuzzing Harness Generation for Library APIs and Binary Protocol
Parsers
Chaitanya Rahalkar
School of Cybersecurity and Privacy
Georgia Institute of Technology
cr@gatech.edu
Abstract—Fuzzing is a widely used software security testing
technique that is designed to identify vulnerabilities in systems
by providing invalid or unexpected input. Continuous fuzzing
systems like OSS-FUZZ have been successful in finding security
bugs in many different software systems. The typical process
of finding security bugs using fuzzing involves several steps:
first, the “fuzz-worthy” functions that are likely to contain vul-
nerabilities must be identified; second, the setup requirements
for the API must be understood before it can be called; third,
a fuzzing harness must be written and bound to a coverage-
guided fuzzer like LLVM’s LibFuzzer; and finally, the security
bugs discovered by the fuzzing harness must be triaged and
checked for reproducibility.
This project focuses on automating the first two steps in this
process. In particular, we present an automated system that
can generate fuzzing harnesses for library APIs and binary
protocol parsers by analyzing unit tests. This allows for the
scaling of the fuzzing infrastructure in proportion to the
growth of the codebase, without the need for manual coding
of harnesses. Additionally, we develop a metric to assess the
”fuzz-worthiness” of an API, enabling us to prioritize the most
promising targets for testing.
1. Introduction
The concept of software fuzzing has been around since
the 1980s, when it was first developed as a way of testing
the security of computer networks. In the 1990s, the use
of fuzzing expanded to include testing of specific software
applications, such as operating systems and web browsers.
Over the past decade, the development of sophisticated
fuzzing algorithms and tools has enabled organizations to
use fuzzing to test a wider range of software systems,
including mobile and cloud-based applications [1].
There are several key principles and techniques that are com-
monly used in software fuzzing. One of the main principles
of fuzzing is the idea of ”randomness” or ”fuzziness,” which
refers to the use of randomly generated data as inputs to the
system being tested. This can help to uncover vulnerabilities
that may not be found through more traditional, determin-
istic testing methods.
Another key principle of fuzzing is the idea of ”coverage,
which refers to the extent to which the fuzzer is able
to test the different parts of the system being tested. By
generating a large number of inputs and monitoring the
system’s behavior, the fuzzer can help to identify areas of
the system that may be vulnerable to attack [1].
In terms of techniques, there are several different approaches
to software fuzzing. One common technique is mutation-
based fuzzing, which involves modifying existing inputs to
the system in order to generate new fuzz data. This approach
can help to uncover vulnerabilities that may not be apparent
when using the original inputs.
Another technique is generation-based fuzzing, which in-
volves generating new inputs to the system from scratch,
without using existing inputs as a starting point [2]. This
can help to uncover vulnerabilities that may not be found
when using modified versions of existing inputs.
In addition to these techniques, there are also specialized
fuzzing approaches that focus on specific types of soft-
ware or systems. For example, protocol-aware fuzzing is
a technique that is designed to test the specific protocols
or standards that a system uses to communicate with other
systems. This can help to identify vulnerabilities in the way
that the system communicates with other systems, which
may not be apparent when testing the system in isolation
[3].
Overall, software fuzzing is a powerful technique for test-
ing the security and reliability of software systems. It can
help organizations to identify and fix vulnerabilities in their
software before they are exploited, improving the security
and reliability of their systems. In this research paper, we
will explore the concept of automated software fuzzing at
scale, which involves introducing automation in the end-to-
end fuzzing process. This allows us to scale fuzzing with
the codebase and uncover memory corruption bugs early and
proportionally as the codebase grows.
Some key terminologies related to software fuzzing include:
1) Fuzzer: This is the tool or software that is used
to generate and inject the random data, or ”fuzz,”
into the system being tested. Examples of popular
fuzzers would be AFL, LibFuzzer, HongFuzzer,
PeachFuzzer etc.
2) Fuzzing algorithm: This is the set of rules or in-
arXiv:2306.15596v1 [cs.CR] 27 Jun 2023
structions that the fuzzer uses to generate the fuzz
data. The algorithm may be designed to mimic
specific patterns of data or to test the system in
particular ways.
3) Fuzzing target: This is the specific software or
system that is being tested using the fuzzer.
4) Fuzzing seed: This is the initial data that is used to
generate the fuzz data. The seed may be taken from
existing inputs to the system, or may be generated
randomly.
5) Fuzzing result: This is the output or behavior of the
system being tested when it is subjected to the fuzz
data. The result may be used to identify potential
vulnerabilities or flaws in the system.
6) Fuzzing harness: A fuzzing harness is a piece of
code that is used to run software fuzzing tests.
The fuzzing harness provides the infrastructure for
generating and injecting the fuzz data into the
system being tested, as well as for monitoring
the system’s behavior and collecting the results of
the tests. The fuzzing harness may also include
additional features, such as support for distributed
computing or intelligent algorithms, to improve the
efficiency and effectiveness of the fuzzing tests. For
example, the paper “Fuzzing@ Home: Distributed
Fuzzing on Untrusted Heterogeneous Clients” [4]
talks about using distributed computing for fuzzing
and software security testing.
There are various tools and techniques that can be used for
software fuzzing. Some common types of software fuzzing
include:
1) Mutation-based fuzzing: This is a technique in
which the fuzzer modifies existing inputs to the
system in order to generate new fuzz data. The
modifications may involve changing, deleting, or
adding data to the inputs.
2) Generation-based fuzzing: This is a technique in
which the fuzzer generates new inputs to the system
from scratch, without using existing inputs as a
starting point. The inputs may be generated ran-
domly or according to a specific algorithm.
3) Protocol-aware fuzzing: This is a technique that
focuses on testing the specific protocols or stan-
dards that a system uses to communicate with other
systems. The fuzzer generates fuzz data that is
designed to mimic or violate the protocols in order
to test the system’s behavior.
4) Intelligent fuzzing: This is a technique that uses
machine learning or artificial intelligence algo-
rithms to generate fuzz data that is more likely to
uncover vulnerabilities in the system. The fuzzer
may learn from previous test results and adjust its
fuzzing strategy accordingly [5].
2. Motivation
Automated software fuzzing is a technique for finding
vulnerabilities in software by feeding it large amounts of
random data, or ”fuzz,” and monitoring its behavior. This
approach can help identify potential security flaws that may
not be discovered through traditional testing methods.
As software systems continue to grow in complexity and
scale, the need for effective testing strategies becomes even
more important. Manual testing can be time-consuming
and labor-intensive, and is not always able to keep pace
with the rapid development of new software. Automated
fuzzing, on the other hand, can provide a more efficient and
comprehensive approach to testing, enabling organizations
to identify and fix vulnerabilities before they are exploited.
Moreover, automated fuzzing at scale has the potential
to uncover previously unknown vulnerabilities in large,
complex software systems. By running multiple fuzzing
tests concurrently, organizations can test a wider range of
potential inputs and scenarios, increasing the chances of
finding previously undiscovered vulnerabilities. Liang et al.
have talked about the effectiveness of large scale fuzzing in
their paper on using scaled fuzzing for mobile app testing
[6]. This approach can also be used to test software that
is distributed across multiple systems, such as cloud-based
services, further increasing its effectiveness.
In addition to its potential for improving the security of
software systems, automated fuzzing at scale also has the
potential to save organizations time and resources. By
automating the testing process, organizations can reduce
the need for manual testing and free up their staff to focus
on other tasks. This can help organizations to be more agile
and responsive to changing market conditions, enabling
them to release new software more quickly and efficiently.
Furthermore, automated fuzzing at scale has the potential to
benefit the wider community by helping to identify and fix
vulnerabilities in open-source software. By running large-
scale fuzzing tests on open-source software, organizations
can help to improve the security of this software, making
it more robust and reliable for all users.
In software in-general, APIs are typically fuzzed with a
coverage guided fuzzer like LLVM’s LibFuzzer. Before
fuzzing them, we typically write a test harness that does the
job of performing any required initializations like database
setups, configuration setups, server setup in a client-server
system etc. In short, the harness acts as an entry point
before fuzzing begins. In case of fuzzing a binary protocol
parser, we model the binary input produced by the fuzzer
as per the protocol specification in the harness and then
supply it to the parser so that the input is considered
valid by it. In short, the harness bridges the gap betewen
the mutated binary input produced by the fuzzer and the
structured data input required by the function.
The primary goal of this project is to introduce automation
in the end-to-end fuzzing approach. We have two kinds of
automated harness generation in this project. One is for
library APIs and the other one is for binary protocol parsers.
Looking from an industry standpoint, big software
corporations tend to have a massive codebases with
millions of APIs and new ones being added everyday.
Writing fuzzing harnesses manually (code written by us) for
such APIs becomes a big challenge as the system starts to
grow and scale. Before writing a fuzzing harness, we need
to understand the setup requirements for the API. Unit tests
are the best way to understand these requirements. Usually,
we analyze and understand the setup code of unit tests and
replicate the setup behaviour of a unit test in a fuzzing
harness. Here, we aim to develop an automated system to
generate fuzzing harnesses by analyzing unit tests so that
the fuzzing infrastructure can be scaled proportionally to the
codebase growth without manually writing these harnesses.
While we can write a scalable system that automatically
generates fuzzing harnesses for every API in the codebase,
this would be highly undesirable and inefficient. Not every
API that is written can be considered “fuzz worthy”. Some
library APIs can be pretty straightforward that perform
simple CRUD operations on data. These APIs have a very
low chance of finding memory corruption issues unless
there’s one in the underlying frameworks / systems used to
perform these operations.
Therefore, another big challenge with fuzzing that comes up
is determining which library APIs are ideal candidates for
fuzzing. Typically, we are interested in fuzzing APIs that
implement some kind of parsing logic, low-level memory
manipulation etc. Scoping and choosing useful API targets
requires manual code audit from a security standpoint.
Manual scoping through millions of API endpoints can be
one of the reasons for an imbalance between the growth of
the codebase (new APIs) and the number of automatically
generated useful fuzzing targets. Hence, we also aim to
develop a metric to help us understand the likelihood of
finding security bugs in the given API. By analyzing unit
tests and the code of the underlying library API, we’ll be
able to quantify the fuzz-worthiness of an API thereby
making a decision of whether it should be considered
for continuous fuzzing. Coupling this with the automatic
harness generator, we can narrow down to the API targets
that can potentially lead to memory corruption bugs.
Moving over to binary protocol parsers - in order to fuzz
them for memory corruption bugs, we must first ensure that
the binary input from the fuzzer conforms to some extent to
the protocol specification. For instance, in case of an image
fuzzer, a basic requirement should be to atleast produce an
image that has a valid header. Writing fuzzing harnesses
for binary protocol parsers requires manual reviews of the
RFC specifications to understand how the protocol works.
We can use a binary protocol declarative language like
Kaitai Struct to have the binary protocol definition in a
structured format and thereby automatically generate a
harness to fuzz binary protocols using this definition [7].
By automating all these tasks, the only part in this end-to-
end fuzzing approach that remains manual is the process of
triaging these bugs and understanding the impact of these
security issues discovered by fuzzing. Through this project,
we attempt to significantly cut-short the time period of
end-to-end fuzzing and allow security engineers to focus
on more important phases in this system.
3. Related Work
In recent years, there has been a growing body of
research on the use of automated software fuzzing at scale.
This research has focused on the development of new algo-
rithms and tools for running multiple fuzzing tests concur-
rently, in order to test a wider range of inputs and scenarios.
One key area of research in this field has been the de-
velopment of intelligent fuzzing algorithms that can learn
from previous test results and adjust their fuzzing strategy
accordingly. For example, propose a machine learning-based
approach to intelligent fuzzing, in which the fuzzer uses
reinforcement learning to learn from previous test results
and adapt its fuzzing strategy in real-time. This approach
can help to improve the efficiency and effectiveness of
fuzzing by focusing on inputs that are more likely to uncover
vulnerabilities [8].
Another area of research has focused on the use of dis-
tributed computing techniques for running fuzzing tests at
scale. For example, propose a distributed fuzzing framework
that uses a cluster of computers to run multiple fuzzing tests
concurrently. This approach can help to increase the speed
and throughput of fuzzing, enabling organizations to test
larger and more complex software systems more quickly.
In addition to these research efforts, there has also been a
focus on developing new tools and frameworks for running
fuzzing tests at scale. For example, the American Fuzzy Lop
(AFL) fuzzer is a popular open-source tool that has been
used in a variety of research studies on software fuzzing
[9]. AFL is a mutation-based fuzzer that uses a variety of
algorithms and techniques to generate new inputs and test
the behavior of the system being tested.
Overall, this research on automated software fuzzing at
scale has demonstrated the potential benefits and challenges
of this approach. By using intelligent algorithms and dis-
tributed computing techniques, organizations can run mul-
tiple fuzzing tests concurrently in order to test a wider
range of inputs and scenarios. This can help to identify
and fix vulnerabilities in software systems more quickly and
efficiently, improving the security and reliability of these
systems. In this research paper, we will explore these issues
in more detail and propose new approaches for automated
software fuzzing at scale.
Very few papers have been around the idea of adding
automation in the whole fuzzing process in general. Mingrui
Zhang et al. in their paper “IntelliGen: Automatic Driver
Synthesis for Fuzz Testing” have discussed about developing
a framework to automatically generate fuzzing harnesses
through hierarchical parameter replacement and type infer-
ence [10]. The effectiveness of IntelliGen was evaluated on
real-world programs and it was found to cover more basic
blocks and paths than existing methods, perform on par with
manually-written drivers, and find more bugs. Another paper
by Kyriakos Ispoglou et al. has been around a tool titled Fuz-
zGen that was used to automatically synthesize fuzzers by
analyzing complex codebases. It uses whole system analysis
to infer the library’s interface and generates fuzzers specif-
ically for that library. FuzzGen was evaluated on several
libraries and found 17 previously unpatched vulnerabilities,
achieving an average code coverage of 54.94%, which is an
improvement over manually-written fuzzers. Fioraldi et al.
have discussed the idea of automatic harness generation for
binary formats but have limited their idea of harness gen-
eration to very specific sets of binary protocols [11]. Also,
they haven’t discussed the idea of using declarative binary
protocol languages which would standardize the process
of generating the harness for potentially any protocol and
can handle very complex conditions and structures. While
most of these papers focus on the automation for research
purposes, and introduce it in very specific and niche sections
of the fuzzing system that they cannot be scaled and cannot
work with massive codebases. This paper attempts to solve
these research problems at scale tackling two areas of the
end-to-end fuzzing process which can dramatically improve
the automation in fuzzing system.
4. Methodology
The system is divided into four different components
out of which three of them work together - the Structured
Input Mutator, Unit Test Analyzer and the Fuzz-Worthiness
algorithm. The binary protocol-based harness generator is
a separate component that’s specifically used to produce
fuzzing harnesses for binary protocol parsers. The system
flow can be better explained with a flowchart as seen in
fig. 1 that represents how the three components work col-
lectively to introduce automation in the process of fuzzing.
The fuzz-worthiness algorithm scans the entire codebase for
bug-worthy characteristics which are explained later in this
section. Then the fuzz-worthiness algorithm decides whether
the particular piece of code is fuzz-worthy or not based on
its evaluation score (between 0 to 1). We set a threshold limit
which allow us to classify a piece of code as fuzz-worthy
or not.
4.1. Structured Input Mutator or the Fuzz Data
Producer
In order to generate a fuzzing harness based on an
existing unit test, we need to analyze which APIs or library
functions are called in the unit test and what are the input
parameters to these functions / APIs. Once we figure out
these input parameters, we need a structured data mutator to
convert the binary input generated by the fuzzer to parameter
specific input accepted by the APIs / functions. For example,
if a library function accepts an integer, a string and a
boolean, we need a mutator that first mutates the binary
input produced by the fuzzer into an integer, a string and a
boolean. Once we have the structured input, we can call the
APIs / library functions through the fuzzing harness. This
can be seen in the data flow diagram as seen in fig. 2.
Figure 1. Flowchart of the Fuzzing System in Action
Figure 2. Data Flow Diagram of the Fuzzing System
4.2. Unit Test Analyzer
In this project, we use a unit test analyzer tool to auto-
matically generate fuzzing harnesses for C/C++ codebases
that use the GTest framework. The unit test analyzer takes
in the C++ unit test code as input and generates an Abstract
Syntax Tree (AST) using the Antler parser generator. This
AST is then used to extract information about the library
functions and APIs that are called in the unit test, as well
as the variables that are used as input to these functions
[12].
Once the relevant information has been extracted from the
AST, the unit test analyzer uses this information to generate
the fuzzing harness code. The generated harness code uses
input from the structured data mutator to fuzz the library
functions and APIs that were identified in the unit test code.
This approach allows us to automatically generate fuzzing
harnesses that are tailored to the specific codebase being
tested, without the need for manual coding or configuration.
Overall, the unit test analyzer is a crucial component of this
project, as it enables us to automatically generate effective
and efficient fuzzing harnesses for C/C++ codebases that use
the GTest framework. By using the AST generated by the
Antler parser generator, we are able to extract the relevant
information from the unit test code and use it to generate the
fuzzing harness code. This approach simplifies the process
of writing fuzzing harnesses and allows us to quickly and
easily test the robustness of C/C++ code.
Listings 1 and 2 shows the working of the unit test analyzer.
The code on the left is of a standard GTest-based unit test
that’s written for an example HTTP handler. Lines 11,12, 20
and 21 have the variables in the unit test that are potential
candidates for fuzzing. The unit test analyzer goes through
the unit test and identifies these candidates. Once that’s
done, it looks at the data-type of these variables and accord-
ingly generates a fuzzing harness shown on the right. Notice
that these variables are correctly replaced by the output of
the fuzz data producer (FDP) or the structured input mutator
with the correct data-type produced in the fuzzing harness.
fdp.get_string,fdp.get_int(0,65535) are util-
ity functions written for the FDP that allow us to get mutated
integers, string values. These utility functions also accept
arguments to support ranged mutation, strings with specific
set of characters etc.
4.3. The Fuzz-Worthiness Algorithm
The fuzz-worthiness algorithm is used to determine
whether a library-API or function should be considered for
fuzzing. This allows us to be selective and helps us to
narrow-down to the points in the codebase that can lead to
potential memory corruption bugs. Fuzzing each and every
library API and/or function cannot scale computationally
and time-wise as well as the codebase grows. The algorithm
uses a scoring system to grade several factors like memory
manipulation functions used, code-coverage gained by the
code, cyclometric complexity of the code, parsing logic
written etc. A single metric is derived using a weighted
average formula taking into consideration the score of each
of these factors. A threshold value is set and any function or
library API’s metric that is equal or above the threshold is
considered as a potential candidate for fuzzing. This library
function’s associated unit test is retrieved and sent to the unit
test analyzer to generate the fuzzing harness. We studied a
few generic algorithms used in computer science academia
that use scoring systems to see how their statistical formulas
are designed and the intuition behind them [13] [14] [15]
[16] [17]. Using a weighted average formula seemed as the
appropriate move due to the fact that the weights can be
adaptive and can be tweaked as necessary.
The fuzz-worthiness algorithm uses a weighted average
formula with each factor having its own weight. The current
fuzz-worthiness algorithm uses the following parameters -
1) Code coverage gained
2) Cyclometric complexity
3) Significant lines of code out of the total lines that
are written to develop parsing logic (if any)
4) Significant lines of code out of the total lines that
are written to do memory manipulations using mal-
loc, calloc, realloc etc.
5) Total significant lines of code
6) Significant lines of code out of the total lines that do
static memory access to data structures like arrays,
stack, queue etc.
The challenge here is to understand the ideal weights for the
metric formula. Since it is adaptive, we constantly update
the weights to see which is the most effective option.
Programmatically, the memory manipulation functions are
be detected by scanning the code for function calls to mal-
loc, calloc and other memory manipulation API functions.
Cyclometric complexity is calculated using an open-source
program called Lizard [18]. Code coverage is calculated
using a built-in tool in GCC called GCov [19]. The weighted
average formula is as shown below -
weighted average =(c1˙w1)+(c2˙w2) + · · · + (cn˙wn)
w1+w2+· · · +wn
(1)
where c1, c2, . . . , cnare the values for code coverage gained,
cyclometric complexity, significant lines of code for parsing
logic, memory manipulation, and total significant lines of
code, respectively, and w1, w2, . . . , wnare the correspond-
ing weights for these properties. The weights must be non-
negative and their sum must be non-zero.
4.4. Binary Protocol-based Harness Generator
In this project, we use binary protocol definitions written
in YAML to automatically generate fuzzing harnesses for
binary protocols. A binary protocol is a set of rules and
conventions that govern the exchange of data between two
or more computer systems, typically over a network. These
protocols often have complex and detailed specifications,
which can make it difficult to write effective fuzzing har-
nesses for them.
To address this challenge, we use a declarative language
called Kaitai Struct to represent the binary protocol defini-
tion in a structured and easily-understandable format. This
language allows us to describe the various components of
the binary protocol, such as the data types, field sizes, and
encoding rules, in a clear and concise manner [20]. By
using this language, we can automatically generate a fuzzing
harness that is specifically tailored to the binary protocol
being tested.
The fuzzing harness generator program uses the binary
protocol definition written in YAML as input, and lever-
ages the structured input mutator to generate valid binary
1#include "gtest/gtest.h"
2#include "pistache/endpoint.h"
3#include "fuzzer/FDP.h"
4
5using namespace Pistache;
6using namespace std;
7using namespace fdp;
8
9TEST(HttpHandlerTest, Request) {
10 Address addr;
11 addr.port = 8080;
12 addr.host = "localhost";
13
14 Http::listenAndServe<HttpServer::
VulnerableHandler>(addr.to_string());
15
16 Http::Client client;
17 auto opts = Http::Client::options().threads(1)
.maxConnectionsPerHost(8);
18 client.init(opts);
19
20 shared_ptr<Http:Request> request = new Http::
Request(Http::Port, "/");
21 request->body() = "Hello World!";
22 request->headers().add<Http::Header::
ContentType>(MIME(Text,Plain));
23 client.send(request).then([](Http::Response
response) {
24 EXPECT_EQ(response.code(), Http::Code::Ok)
;
25 });
26 }
27
Listing 1. GT ES T UNIT TEST F OR A V ULN ERA BL E HTTP HA ND LER
1#include "gtest/gtest.h"
2#include "pistache/endpoint.h"
3#include "fuzzer/FDP.h"
4
5using namespace Pistache;
6using namespace std;
7using namespace fdp;
8
9extern "C" int LLVMFuzzerTestOneInput(const
uint8_t *Data, size_t Size) {
10 MutateInput(Data, Size, seed(0));
11
12 Address addr;
13 addr.port = fdp.get_int(0, 65535);
14 addr.host = fdp.get_string();
15
16 Http::listenAndServe<HttpServer::
VulnerableHandler>(addr.to_string());
17
18 Http::Client client;
19 auto opts = Http::Client::options().threads(1)
.maxConnectionsPerHost(8);
20 client.init(opts);
21
22 string endpoint = fdp.get_string();
23 shared_ptr<Http:Request> request = new Http::
Request(Http::Port, endpoint);
24 request->body() = fdp.get_string();
25 request->headers().add<Http::Header::
ContentType>(MIME(Text,Plain));
26 client.send(request).then([](Http::Response
response) {
27 EXPECT_EQ(response.code(), Http::Code::Ok)
;
28 });
29 }
30
Listing 2. FU ZZ ING HA RN ESS F OR A VU LN ERA BL E HTTP HA ND LER
inputs that conform to the protocol specification. This is
accomplished by parsing the YAML definition and invoking
the corresponding input producer for each data type that
is encountered. This modular and flexible approach allows
us to easily adapt the fuzzing harness generator to support
different binary protocols and data types. For example, the
following listing in YAML is the definition of the GIF binary
format. While this YAML definition is just for having a valid
GIF (with a valid header) that can bypass basic GIF checks,
this is sufficient to explain the idea.
1---
2meta:
3id:gif
4title:GIF (Graphics Interchange Format)
5file-extension:gif
6xref:
7forensicswiki:GIF
8justsolve:GIF
9loc:fdd000133 # GIF 89a
10 mime:image/gif
11 pronom:
12 - fmt/3 # GIF 87a
13 - fmt/4 # GIF 89a
14 wikidata:Q2192
15 license:CC0-1.0
16 endian:le
17
18 seq:
19 - id:hdr
20 type:header
21 - id:logical_screen_descriptor
22 type:logical_screen_descriptor_struct
23
24 types:
25 header:
26 seq:
27 - id:magic
28 contents:’GIF’
29 - id:version
30 type:str
31 size:3
32 encoding:ASCII
33---
The harness generator for the binary protocol would analyze
this YAML structure and would attempt to produce a harness
that generates valid GIFs using this YAML definition. This
harness can be used to fuzz a protocol parser that would
parse a GIFs. Protocol parsers are always a good candidate
for fuzzing due to the numerous edge-cases and conditions a
protocol definition has. In order to correctly fuzz a protocol
parser, we need to know how to generate a valid instance of
the protocol that would get past the basic checks to validate
the protocol and thereby allowing us to dig deeper into
the protocol structure and fuzz the edge-cases that haven’t
been correctly handled by the protocol parser. Typically,
these protocols are likely candidates towards finding security
bugs. In general, an engineer would need to read through
the entire RFC to understand how the protocol is designed
and then produce a harness that would generate valid GIF.
However, since this automated generator relies on an already
available YAML definition, it saves the time of writing the
harness manually. Fig. 3 shows an example of a harness
that’s generated based off the YAML definition of the binary
protocol.
1
2#include "gtest/gtest.h"
3#include "fuzzer/FDP.h"
4
5using namespace fdp;
6using namespace std;
7
8extern "C" int
9LLVMFuzzerTest0neInput(const uint8 t *Data,
10 size t Size) {
11
12 MutateInput(Data, Size, seed(0), f);
13
14 // GIF Header
15 f.begin_header(seq = true);
16 f.assign_chunk("GIF",magic_bytes = true,type = str
);
17 f.assign_chunk("version",size = 3,type = u2);
18 f.end_header();
19
20 // GIF Body
21 f.begin_body("logical_screen",seq = true);
22 f.assign_chunk("image_width",type = u2);
23 f.assign_chunk("image_height",type = u1);
24 f.assign_chunk("flags",type = u1);
25 f.assign_chunk("bg_color_index",type = u1);
26 f.assign_chunk("pixel_aspect_ratio",type = u1);
27 f.end_body();
28
29 f.produce(valid_header = true);
30 }
Listing 3. Harness Code Generated from GIF Protocol Definiton
5. Evaluation
The structured input mutator, fuzz-worthiness algorithm
and unit-test analyzer were put to work in order to test
the system against two massive C++ projects - Facebook’s
Folly and OpenCV. Folly is an open-source C++ library
for C++ components catered towards efficiency and perfor-
mance and OpenCV is a widely known Computer Vision
library [21]. After running the fuzz-worthiness algorithm
on both the projects, it picked out 3067 potential fuzz-
worthy candidates from the entire codebase (C++ functions
that could potentially have a memory corruption bug) for
Folly and 6932 for OpenCV. The associated unit tests for
these functions were analyzed by the unit test analyzer
to generate associated fuzzing harnesses. For producing
visual results through coverage graphs, the top 15 harnesses
were hand-picked for C++ functions that had very high
likelihood of finding security issues. In order to test the
efficacy of the automated harness generation system, we
wrote 15 harnesses for the same functions that were picked
by the algorithm. The fuzzing infrastructure was set up on
an Amazon Web Services (AWS) Elastic Compute Cloud
(EC2) instance and Libfuzzer on OSS-FUZZ was configured
to run these 30 harnesses in parallel (15 manually written
and 15 automatically generated). The results showed that out
of the 15 harnesses, 10 of the automated fuzzing harnesses
outperformed their manually written counterparts for the
Folly library, and 11 for the OpenCV library. Several papers
have claimed Libfuzzer to be the state-of-the-art coverage-
guided fuzzer with the best mutation algorithms which is
why it was selected as the primary fuzzer for evaluation.
To prevent the harness generator from producing invalid har-
nesses (harnesses that don’t gain any coverage on execution
or take a lot of time to gain initial coverage), we had a
check where we run the harness for 1 minute and observe
whether it is gaining any coverage or not. If the harness does
not gain any coverage, it will not be a useful harness and
therefore must be discarded. Any harness that gains some
coverage in the initial check is capable of further gains and
finding potential security bugs. For Folly, we discarded 1021
harnesses and for OpenCV we discarded 1412 harnesses for
being invalid as per our explained criteria.
Initially, the coverage gain was quite similar for both the
auto-generated as well as the manually written harness. In
the end, 10 auto-generated harnesses ended up having higher
coverage than their corresponding counterparts (manually
written ones). The entire infrastructure was running for 10
days. The preliminary results were captured for 24 hours,
and the same conclusion can be drawn for the results cap-
tured after 10 days. The metrics have linearly scaled and
there hasn’t been a significant difference in the conclusions
which indicate consistency with the results.
Fig. 3 shows all the harnesses (manual as well as auto-
mated) that have been running on the OSS-FUZZ fuzzing
infrastructure for 24 hrs. The harnesses highlighted with
the arrows are manually-written harnesses. Most of them
are at the bottom of the graph section which indicates that
the initial coverage gains are not as much as the automat-
ically generated harnesses. Eventually, the coverage begins
to saturate for both of the harnesses which is the expected
behavior. However, early coverage is crucial to find bugs
as soon as possible. The higher the coverage, the better is
the chance to find security bugs, which evidently concludes
that the automatically generated harnesses are outperforming
the manually written ones. Similar behavior can be seen for
OpenCV as well as seen in fig. 4. Although some of the
manually written harnesses are performing better, overall
the trend has been more towards the automatically written
counterparts.
Fig. 5 and fig. 6 represent the median reached coverage
by the automatically and manually written harnesses for
both Folly and OpenCV. This graph is also an indicator
of the automatic harnesses outperforming the manual ones
on average. Median values better represent the coverage
information as averages can get askewed if the they are
widely distributed rather than being sparsely distributed.
Fig. 7 and fig. 8 represent a heatmap of the p-values of
Figure 3. Coverage Gains for Automated and Manually Written Fuzzers for Folly
Figure 4. Coverage Gains for Automated and Manually Written Fuzzers for OpenCV
pairwise Mann-Whitney U tests [22] [23] for both of the
harness types for Folly and OpenCV. Green cells indicate
that the reached coverage distribution of a given fuzzing
harness pair is significantly different. The last five row
entries and the first five column entries are for the manually
written fuzzing harnesses while the remaining 6 are for the
automatically written fuzzing harnesses.
Table 1 shows the metrics summary about the two
projects. Here we have a summary of the crashes found over
a period of 10 days, the number of fuzz-worthy candidates
that were identified by the algorithm in the codebase, the
average coverage percentage over 10 days. As discussed
before, the trend for 10 days seems to be consistent with
the trend observed for the data collected for 24 hours.
While many of these crashes can be repetitive, or invalid
bugs that cannot be triaged or replicated in non-simulated
environments, a certain amount seem to be legitimate bugs
that could potentially lead to CVE reports. Further inves-
tigation and research on these bugs is out of the scope of
this project and would be done as a part of future work.
The “fuzz-worthiness” metric helps us to understand how
effective it is to fuzz a library API. It essentially provides a
quantitative value that when compared to a predetermined
threshold can help us understand how effective it will be to
fuzz that library API. Since this is a quantitative metric that
has no standard way to benchmark and analyze its efficacy,
the best way to evaluate how effective this is, is through
qualitative analysis. We compared human audited security
results with the results produced by this “fuzz-worthiness”
metric. We manually audited 64 small-scale projects (under
10,000 LoC) for security issues. All these projects were
open-source projects picked from GitHub by applying spe-
cific filters to get the projects as per the required criteria
(under 10,000 LoC and written in C/C++). We ran the fuzz-
Project LoC Harnesses
Generated
Crashes Found
(10 days)
Fuzz-Worthy
Candidates
Average Coverage (%)
(10 days)
OpenCV 3,320,160 7812 5221 6932 38.55
Folly 564,477 4141 3426 3067 41.43
TABLE 1. FUZZ IN G METRICS FOR BENCHMARK PROJE CT S
Figure 5. Median Reached Coverage by Automated and Manually Written
Harnesses for Folly
Figure 6. Median Reached Coverage by Automated and Manually Written
Harnesses for OpenCV
worthiness algorithm on these projects and it found out 561
potential fuzz-worthy candidates as compared to our 395
candidates. Some of the candidates picked by the algorithm
were repetitive instances of the same bug produced by the
same input but at different places. These instances were
discarded during the manual audits. Overall, the algorithm
picked significantly more candidates than manual audits
including repeats which shows that it managed to effectively
cherry-pick interesting snippets.
Figure 7. Pairwise Mann-Whitney U test Results for Automated and
Manually Written Harnesses - Folly
Figure 8. Pairwise Mann-Whitney U test Results for Automated and
Manually Written Harnesses - OpenCV
6. Discussion
The primary goal of this project was to introduce au-
tomation in the end-to-end fuzzing process. This included
automating the generation of fuzzing harnesses for both
library APIs and binary protocol parsers. Automating this
process helped to scale the fuzzing infrastructure propor-
tionally to the growth of the codebase, without the need for
manual writing of harnesses. Additionally, the automated
harnesses proved to be more performant than their manual
counterparts. Moreover the project introduced automation
in another step of target scoping, by introducing a metric
to help determine the ”fuzz-worthiness” of a given API, in
order to narrow down the number of potential targets for
fuzzing. By automating these tasks, the only part of the
fuzzing process that remains manual is the bug triaging.
7. Limitations and Future Work
In this project, we are using an automated fuzzing tech-
nique to test C/C++ codebases that use the GTest (Google
Test Framework).
However, there are several limitations to this project. First,
since it is an experimental project, it is not designed to work
with large codebases with more than 10 million lines of
code. The current implementation is only suitable for small-
to-medium sized codebases, and may not be effective on
larger codebases. This is because larger codebases typically
have more complex and interdependent code, which can
make it difficult to generate effective test inputs and interpret
the results of the testing. In addition, larger codebases may
also have more potential vulnerabilities, which can make it
harder to identify and prioritize the most important areas to
test.
Second, the project is currently limited to C/C++ codebases
that use the GTest framework. While this framework is
widely used and provides many useful features for testing,
there are many other popular testing frameworks and pro-
gramming languages that are not supported by the current
implementation. For example, other popular testing frame-
works for C/C++ include Boost.Test [24], CppUnit, and
Catch, while other popular programming languages include
Java, Python, and C#. In order to make the project more
useful and applicable to a wider range of codebases, fur-
ther improvements would be needed to support these other
frameworks and languages.
Third, given the time constraints and scope of the project,
not all edge cases and corner cases have been covered by
the fuzz-worthiness algorithm. This means that there may be
some cases where the algorithm fails to identify potentially
vulnerable areas of the code, leading to incomplete testing.
In order to improve the effectiveness of the algorithm, more
time and resources would be needed to test a wider range
of inputs and scenarios. This could include testing the code
with different combinations of inputs, testing the code with
inputs of different sizes and complexity, and testing the code
with inputs that are designed to stress the most vulnerable
areas of the code.
Fourth, the fuzz-worthiness algorithm used in this project is
a qualitative, weights-based system that adaptively changes
its weights based on previously seen code. This means that
the output produced by the algorithm may vary slightly from
one run to the next, depending on the input code and the
previous weights used by the algorithm. This variability is a
natural result of the relativistic nature of the algorithm, but
it may make it difficult to reproduce results and compare
the effectiveness of different versions of the algorithm. To
address this limitation, the algorithm could be modified
to use more stable and consistent weights, or to provide
additional mechanisms for comparing and evaluating the
results of different runs of the algorithm.
Overall, while this project represents an interesting and
potentially useful approach to automated software fuzzing,
it is limited in its scope and effectiveness by the constraints
described above. In order to make the project more use-
ful and applicable to a wider range of codebases, further
improvements and enhancements would be needed to ad-
dress these limitations. This could include optimizing the
algorithm for larger codebases, supporting additional testing
frameworks and programming languages, and improving the
stability and reproducibility of the algorithm.
8. Conclusion
In conclusion, this project has developed an automated
system for generating fuzzing harnesses and scoping out
fuzz-worthy functions in order to improve the efficiency
and scalability of the fuzzing process. The system is able
to automatically generate valid harnesses for library APIs
and binary protocols by analyzing unit tests and protocol
definitions, which eliminates the need for manual analy-
sis and reduces the time and effort required for fuzzing.
Additionally, these automatically generated harnesses also
outperformed their manually-written counterparts in-terms
of coverage and bugs found per kLoC which proved to be
another strong reason to introduce automation in the process
of software fuzzing. Moreover, the project also introduces a
metric for determining the fuzz-worthiness of an API, which
can help to identify potential targets for fuzzing and improve
the efficiency of the process.
References
[1] Xiaogang Zhu, Sheng Wen, Seyit Camtepe, and Yang Xiang. Fuzzing:
a survey for roadmap. ACM Computing Surveys (CSUR), 54(11s):1–
36, 2022.
[2] Patrice Godefroid, Adam Kiezun, and Michael Y Levin. Grammar-
based whitebox fuzzing. In Proceedings of the 29th ACM SIGPLAN
conference on programming language design and implementation,
pages 206–215, 2008.
[3] Takahisa Kitagawa, Miyuki Hanaoka, and Kenji Kono. Aspfuzz:
A state-aware protocol fuzzer based on application-layer protocols.
In The IEEE symposium on Computers and Communications, pages
202–208. IEEE, 2010.
[4] Daehee Jang, Ammar Askar, Insu Yun, Stephen Tong, Yiqin Cai,
and Taesoo Kim. Fuzzing@ home: Distributed fuzzing on untrusted
heterogeneous clients. In Proceedings of the 25th International
Symposium on Research in Attacks, Intrusions and Defenses, pages
1–16, 2022.
[5] Zhihui Li, Hui Zhao, Jianqi Shi, Yanhong Huang, and Jiawen Xiong.
An intelligent fuzzing data generation method based on deep adver-
sarial learning. IEEE Access, 7:49327–49340, 2019.
[6] Chieh-Jan Mike Liang, Nicholas D Lane, Niels Brouwers, Li Zhang,
B¨
orje F Karlsson, Hao Liu, Yan Liu, Jun Tang, Xiang Shan, Ranveer
Chandra, et al. Caiipa: Automated large-scale mobile app testing
through contextual fuzzing. In Proceedings of the 20th annual
international conference on Mobile computing and networking, pages
519–530, 2014.
[7] Mikhail Yakshin. Kaitai struct, Mar 2021.
[8] Jared DeMott. The evolving art of fuzzing. Def Con, 14, 2006.
[9] Michal Zalewski. American fuzzy lop, 2017.
[10] Mingrui Zhang, Jianzhong Liu, Fuchen Ma, Huafeng Zhang, and
Yu Jiang. Intelligen: Automatic driver synthesis for fuzz testing. In
2021 IEEE/ACM 43rd International Conference on Software Engi-
neering: Software Engineering in Practice (ICSE-SEIP), pages 318–
327. IEEE, 2021.
[11] Andrea Fioraldi, Daniele Cono D’Elia, and Emilio Coppa. Weizz:
Automatic grey-box fuzzing for structured binary formats. In Pro-
ceedings of the 29th ACM SIGSOFT International Symposium on
Software Testing and Analysis, pages 1–13, 2020.
[12] Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. Superion:
Grammar-aware greybox fuzzing. In 2019 IEEE/ACM 41st Interna-
tional Conference on Software Engineering (ICSE), pages 724–735.
IEEE, 2019.
[13] Xiaogang Zhu and Marcel B¨
ohme. Regression greybox fuzzing. In
Proceedings of the 2021 ACM SIGSAC Conference on Computer and
Communications Security, pages 2169–2182, 2021.
[14] Manuel Gonz´
alez-Hidalgo, Sebastia Massanet, Arnau Mir, and Daniel
Ruiz-Aguilera. Edge image aggregation method using ordered
weighted averaging functions. In 2016 IEEE International Conference
on Fuzzy Systems (FUZZ-IEEE), pages 1355–1362. IEEE, 2016.
[15] Bj¨
orn Mathis, Rahul Gopinath, and Andreas Zeller. Learning input
tokens for effective fuzzing. In Proceedings of the 29th ACM
SIGSOFT International Symposium on Software Testing and Analysis,
pages 27–37, 2020.
[16] Ronald R Yager. On ordered weighted averaging aggregation opera-
tors in multicriteria decisionmaking. IEEE Transactions on systems,
Man, and Cybernetics, 18(1):183–190, 1988.
[17] Hao Zhang, Weiyu Dong, and Liehui Jiang. Zokfuzz: Detection of
web vulnerabilities via fuzzing. In 2022 2nd International Conference
on Consumer Electronics and Computer Engineering (ICCECE),
pages 281–287. IEEE, 2022.
[18] Terry Yin. Lizard: An extensible cyclomatic complexity analyzer.
Astrophysics Source Code Library, pages ascl–1906, 2019.
[19] Ram Chandra Bhushan and DD Yadav. Number of test cases required
in achieving statement, branch and path coverage using ‘gcov’: An
analysis. In 7th International Workshop on Computer Science and
Engineering (WCSE 2017) Beijing, China, pages 176–180, 2017.
[20] AA Evgin, MA Solovev, and VA Padaryan. A model and declarative
language for specifying binary data formats. Programming and
Computer Software, 48(7):469–483, 2022.
[21] Gary Bradski. The opencv library. Dr. Dobb’s Journal: Software
Tools for the Professional Programmer, 25(11):120–123, 2000.
[22] Patrick E McKnight and Julius Najab. Mann-whitney u test. The
Corsini encyclopedia of psychology, pages 1–1, 2010.
[23] Nadim Nachar et al. The mann-whitney u: A test for assessing
whether two independent samples come from the same distribution.
Tutorials in quantitative Methods for Psychology, 4(1):13–20, 2008.
[24] Boris Sch¨
aling. The boost C++ libraries. Boris Sch ¨
aling, 2011.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Fuzzing technologies have evolved at a fast pace in recent years, revealing bugs in programs with ever increasing depth and speed. Applications working with complex formats are however more difficult to take on, as inputs need to meet certain format-specific characteristics to get through the initial parsing stage and reach deeper behaviors of the program. Unlike prior proposals based on manually written format specifications, in this paper we present a technique to automatically generate and mutate inputs for unknown chunk-based binary formats. We propose a technique to identify dependencies between input bytes and comparison instructions, and later use them to assign tags that characterize the processing logic of the program. Tags become the building block for structure-aware mutations involving chunks and fields of the input. We show that our techniques performs comparably to structure-aware fuzzing proposals that require human assistance. Our prototype implementation WEIZZ revealed 16 unknown bugs in widely used programs.
Article
Full-text available
Fuzzing (Fuzz testing) can effectively identify security vulnerabilities in software by providing a large amount of unexpected input to the target program. An important part of fuzzing test is the fuzzing data generation. Numerous traditional methods to generate fuzzing data have been developed, such as model-based fuzzing data generation and random fuzzing data generation. These techniques require the specification of the input data format or analyze the input data format by manual reverse engineering. In this paper, we introduce an approach using Wasserstein generative adversarial networks (WGANs), a deep adversarial learning method, to generate fuzzing data. This method does not require defining the input data format. To the best of our knowledge, this study is the first to use a WGAN-based method to generate fuzzing data. Industrial security has been an important and pressing issue globally. Network protocol fuzzing plays a significant role in ensuring the safety and reliability of industrial control systems (ICSs). Thus, the proposed method is significant for ICS testing. In the experiment, we use an industrial control protocol such as the Modbus-TCP protocol and EtherCAT protocol as our test target. Results indicate that this approach is more intelligent and capable than the methods used in previous studies. In addition, owing to its design, this model can be trained within a short time, which is computationally light and practical.
Article
Fuzz testing (fuzzing) has witnessed its prosperity in detecting security flaws recently. It generates a large number of test cases and monitors the executions for defects. Fuzzing has detected thousands of bugs and vulnerabilities in various applications. Although effective, there lacks systematic analysis of gaps faced by fuzzing. As a technique of defect detection, fuzzing is required to narrow down the gaps between the entire input space and the defect space. Without limitation on the generated inputs, the input space is infinite. However, defects are sparse in an application, which indicates that the defect space is much smaller than the entire input space. Besides, because fuzzing generates numerous test cases to repeatedly examine targets, it requires fuzzing to perform in an automatic manner. Due to the complexity of applications and defects, it is challenging to automatize the execution of diverse applications. In this paper, we systematically review and analyze the gaps as well as their solutions, considering both breadth and depth. This survey can be a roadmap for both beginners and advanced developers to better understand fuzzing.