Conference PaperPDF Available

Test Suites for Benchmarks of Static Analysis Tools

Authors:
  • Toyota Research Institute Advanced Development

Abstract and Figures

This paper proposes test suites for the benchmarks of static analysis tools. In our test suites, a wide variety of common defects are intentionally implemented. In addition, we also propose several criteria for the evaluation of static analysis tools. By using them, we can compare the performance of static analysis tools in a quantitative manner. Our test suites are available in the public domain. Moreover, this paper provides the evaluation results of the top performance tools as references. We would like to anticipate the further improvement of static analysis tools leveraged by our test suites and reference data.
Content may be subject to copyright.
Test Suites for Benchmarks of Static Analysis Tools
Shinichi SHIRAISHI, Veena MOHAN, and Hemalatha MARIMUTHU
Toyota InfoTechnology Center, U.S.A., Inc.
465 Bernardo Avenue
Mountain View, CA 94043
sshiraishi, vmohan, hmarimuthu@us.toyota-itc.com
Abstract—This paper proposes test suites for the benchmarks
of static analysis tools. In our test suites, a wide variety of
common defects are intentionally implemented. In addition, we
also propose several criteria for the evaluation of static analysis
tools. By using them, we can compare the performance of static
analysis tools in a quantitative manner. Our test suites are
available in the public domain. Moreover, this paper provides
the evaluation results of the top performance tools as references.
We would like to anticipate the further improvement of static
analysis tools leveraged by our test suites and reference data.
Keywords-static analysis, software assurance, automotive soft-
ware
I. INTRODUCTION
The E/E (electrical and electronic) components are gaining
dominance in automotive systems. For example, most contem-
porary vehicles have more than 50 ECUs (Electronic Control
Units) [1]. This implies that the software part of existing
automotive systems has already been large. Moreover, ADAS
(Advanced Driving Assistance Systems) are turning into a
competitive area among car makers and Tier 1 suppliers.
Obviously, this kind of new system is based on large-scale
software.
On the other hand, static analysis tools are getting more
and more common in software development. For example,
in order to reduce runtime errors of software, the use of
static analysis is clearly recommended in an automotive safety
standard called ISO26262 [2]. Based on such a strong demand,
more than 100 vendors are working on static analysis tools,
including free as well as commercial tools.
From an industrial viewpoint, we would like to choose the
best tools in a quantitative manner. Toward this end, we chose
the following steps:
1) Develop test suites (codebases) where a wide variety of
defects are intentionally created;
2) Apply static analysis tools to the test suites and collect
the analysis results;
3) Define evaluation criteria and compare the performance
of the tools.
Based on our comprehensive evaluation of various tools, we
decided to provide the evaluation results of the top per-
formance tools as reference for which we obtained explicit
permission to publish the evaluation results from their vendors.
In this paper, we also provide the details of our latest test suites
Software
Dependability
Standards
Comformance
Correctness of
Software Hazard Mitigation
Non-Functional
Requirements
Satisfaction
Free from
Exception
Functional
Requirements
Satisfaction
Test
Results
Profiling
Results
(Dynamic
Analysis)
Results of
Static
Analysis
Claim
Evidence
Legend:
Fig. 1. Assurance Case of Software Dependability in GSN.
and they are available in the public domain. Therefore, anyone
can replicate our evaluation results by using them if necessary.
II. TE ST SUITES BUILDING
A. Requirements for Static Analysis
The main objective of using static analysis tools is the
realization of high-level software assurance. Figure 1 shows
a top-level software assurance case in GSN (Goal Structuring
Notation) [3]. The assurance case indicates that we should
expect static analysis tools to eliminate defects that can lead
to runtime exceptions. This kind of defects is a common
problem among all the types of software. Therefore, we will
try to define a set of common defects as test suites, which
enable us to evaluate the detection performance of the common
defects. In addition to the detection performance, we also need
to consider false alarms that waste the time and effort of
engineers. In summary, we have the following questions about
the performance of static analysis tools:
How many common defects can be detected by static
analysis tools? I.e., the more the better.
How many false alarms from static analysis tools need
to be ruled out? I.e., the less the better.
B. Specification
To enumerate the common defects that should be detected
by static analysis tools, we surveyed existing literatures such
as [4], [5], and [6]. In addition, we also investigated the
brochures of well-known tools. From this survey, we found
51 different types of defects shown in Table I (called as
defect subtypes, hereafter). The 51 defect subtypes can be
TABLE I
SPE CIFI CATI ON OF TE ST SUITES - DEFE CT SUBTYPES
# Defect Subtype Defect Type # of Variations
w/ Defects w/o Defects
1 Static buffer overrun Static memory 54 54
2 Static buffer underrun Static memory 13 13
3 Cross thread stack access Stack-related 6 6
4 Stack overflow Stack-related 7 7
5 Stack underrun Stack-related 7 7
6 Invalid memory access to already freed area Resource management 17 17
7 Memory allocation failure Resource management 16 16
8 Memory leakage Resource management 18 18
9 Return of a pointer to a local variable Resource management 2 2
10 Uninitialized memory access Resource management 15 15
11 Double free Resource management 12 12
12 Free non dynamically allocated memory Resource management 16 16
13 Free NULL pointer Pointer-related 14 14
14 Bad cast of a function pointer Pointer-related 15 15
15 Dereferencing a NULL pointer Pointer-related 17 17
16 Incorrect pointer arithmetic Pointer-related 2 2
17 Uninitialized pointer Pointer-related 16 16
18 Wrong arguments passed to a function pointer Pointer-related 18 18
19 Comparison NULL with function pointer Pointer-related 2 2
20 Power related errors Numerical 29 29
21 Integer sign lost because of unsigned cast Numerical 19 19
22 Division by zero Numerical 16 16
23 Bit shift bigger than integral type or negative Numerical 17 17
24 Integer precision lost because of cast Numerical 19 19
25 Data overflow Numerical 25 25
26 Data underflow Numerical 12 12
27 Unintentional end less loop Misc 9 9
28 Useless assignment Misc 1 1
29 Bad extern type for global variable Misc 6 6
30 Non void function does not return value Misc 4 4
31 Uninitialized variable Misc 15 15
32 Contradict conditions Inappropriate code 10 10
33 Dead code Inappropriate code 13 13
34 Return value of function never checked Inappropriate code 16 16
35 Improper error handling Inappropriate code 4 4
36 Improper termination of block Inappropriate code 4 4
37 Redundant conditions Inappropriate code 14 14
38 Unused variable Inappropriate code 7 7
39 Assign small buffer for structure Dynamic memory 11 11
40 Memory copy at overlapping areas Dynamic memory 2 2
41 Dynamic buffer overflow Dynamic memory 32 32
42 Dynamic buffer underrun Dynamic memory 39 39
43 Deletion of data structure sentinel Dynamic memory 3 3
44 Live lock Concurrency 1 1
45 Locked but never unlock Concurrency 9 9
46 Race condition Concurrency 8 8
47 Long lock Concurrency 3 3
48 Unlock without lock Concurrency 8 8
49 Dead lock Concurrency 5 5
50 Double lock Concurrency 4 4
51 Double release Concurrency 6 6
Subtotal 638 638
Total 1,276
roughly classified into 9 defect types shown in Table II. We
implemented all the defect types and defect subtypes with lots
of variations by considering different kinds of complexity (e.g.,
see Listing 1). Eventually, we created 638 variations of the 51
different types of defects as shown in the columns entitled ”w/
Defects” in Tables I and II.
Listing 1. Static Buffer Overrun Across Multiple Functions.
int overrun_st_017_func_001 ()
{
return 5;
}
void overrun_st_017 ()
{
int buf[5];
buf[overrun_st_017_func_001()] = 1;/Tool should detect this line as
error/ /ERROR: buffer overrun /
}
In addition to the implementation of codebases with defects,
we also prepared codebases without defects shown in the
rightmost columns of Tables I and II. The latter codebases
are similar to the former ones; however, there is no defects in
there. While the codebases with defects are used for evaluating
true positives from the analysis results, the codebases without
TABLE II
SPE CIFI CATI ON OF TE ST SUITES - DEFE CT TY PES
# Defect Type Defect Subtype # of Subtypes # of Variations in Total
w/ Defects w/o Defects
1 Static Memory Defects Static buffer overrun, etc. 2 67 67
2 Stack-related Defects Stack overflow, etc. 5 20 20
3 Dynamic Memory Defects Dynamic buffer overrun, etc. 3 87 87
4 Numerical Defects Division by zero, data overflow, etc. 7 137 137
5 Resource Management Defects Double free, etc. 7 96 96
6 Pointer-related Defects NULL pointer dereference, etc. 7 84 84
7 Concurrency Defects Dead lock, live lock, etc. 8 44 44
8 Inappropriate Code Dead code, etc. 7 68 68
9 Misc Unintentional infinite loop, etc. 5 35 35
Subtotal 51 638 638
Total 51 1,276
defects are used for evaluating false positives.
Our test suites are quite simple as shown in Listing 1;
however, we prepared a wide variety of defects while con-
sidering the characteristics of automotive software, e.g., the
intensive use of global variables, etc. Thus, we can be-
lieve that our test suites are appropriate enough to evaluate
tools for the purpose of automotive software development.
Moreover, our test suites are available in the public do-
main (see https://github.com/regehr/itc-benchmarks). Hence,
anybody can use them if necessary.
III. EVALUATI ON CRITERIA
A. Primary Evaluation Criteria
In the previous section, we have already built the test suites
that implement the common defects. Now, we need evaluation
criteria when we apply static analysis tools to the test suites.
Table III shows the primary evaluation criteria. The rates listed
in the fourth column of Table III are calculated using the
following equations together with the statistics of the test
suites in Table II:
DR: Detection Rate [%]
=# of Detected Variations
# of Variations w/ Defects
×100,(1)
FPR: False Positive Rate [%]
=# of Detected Variations
# of Variations w/o Defects
×100.(2)
In the calculation above, we focus only on the implemented
defects, i.e., the commented lines and commented defect
subtypes in Listing 1. Therefore, even if the tools show alerts
elsewhere or alerts for different types of defects, we will
discard them.
B. Advanced Evaluation Criteria
By using the primary criteria shown in Eqs. (1) and (2),
we can understand the tool performance from two different
TABLE III
PRIMARY EVA LUATI ON CRI TE RIA O F ANALYS IS RE SULT S.
Test Suites Analysis Result True / False Rate
w/ Defects Unsafe (Positive) True Detection Rate (DR)
w/o Defects Unsafe (Positive) False False Positive Rate (FPR)
Productivity
Price
+
++
+
+
+
Regression Line:
Productivity = α"x Price + β
β: Oset
Tool a
Tool b
Tool c
Tool d
Tool eTool f
α: Inclination
Fig. 2. Cost Efficiency: Productivity vs. Price.
viewpoints. Here, we will try to integrate these two different
criteria into a unified one for further evaluation.
1) Productivity: In order to evaluate the both aspects (true
positives and false positives) at the same time, we use the
following metric to calculate the performance of tools:
ME #1Productivity =pDR ×(100 FPR).(3)
Regarding ME #1 (productivity), if the ultimate tools exist,
they can detect all defects in codebases, and there is no false
positive in the analysis results. In this case, ME #1 shows 100.
2) Cost Efficiency: To further discuss the performance of
the tools, we try to combine ME #1 (productivity) and the
pricing information of the tools. Tool vendors have different
licensing policies, e.g., annual license, perpetual license, float-
ing license, etc. Therefore, we merely use the minimum license
fee of each tool. Although we cannot compare the prices in
a straightforward manner, we dare to compare the prices with
ME #1.
TABLE IV
BENCHMARKS OF STATIC ANA LYSI S TOOL S [%].
Defect Type
GrammaTech MathWorks
CodeSonar Code Prover Bug Finder
DR 100-FPR DR 100-FPR DR 100-FPR
Static Memory Defects 100 100 97 100 97 100
Dynamic Memory Defects 89 100 92 95 90 100
Stack-Related Defects 0 100 60 70 15 85
Numerical Defects 48 100 55 99 41 100
Resource Management Defects 61 100 20 90 55 100
Pointer-Related Defects 52 96 69 93 69 100
Concurrency Defects 70 77 0 100 0 100
Inappropriate Code 46 99 1 97 28 94
Misc 69 100 83 100 69 100
Figure 2 shows hypothetical relationship between the pro-
ductivity of several tools and their prices. By applying the
linear regression analysis to this graph, we can obtain the
following equation (see the regression line in Fig. 2):
Productivity =α×Price +β. (4)
By observing Fig. 2, we can easily understand that Tool a,
e, and f are above the averaged cost-efficiency. For further
quantitative discussion on the cost efficiency, we can calculate
the following metric:
ME #2Cost Efficiency =Productivity β
Price .(5)
Combining Eqs. (4), (5), and Fig. 2, we can determine the
cost-efficiency ranking, i.e., the more the better.
C. Benchmarks of Top Performance Tools
This section provides some of the actual evaluation results
of static analysis tools. We conducted a comprehensive evalu-
ation by using the latest versions of lots of static analysis tools
as of December 2013. In our evaluation, we identified three top
performance tools: CodeSonar by GrammaTech; Code Prover
and Bug Finder by MathWorks. These tool vendors agreed to
publish the evaluation results. Thus, we decided to provide the
actual evaluation results of only these top performance tools
here.
Table IV shows the comparison of the detection rates and
false positive rates of the three tools. Figure 3 shows the
comparison of the productivity values of the tools. As shown
in the table and figure, well-known defects, e.g., static and
dynamic memory defects were successfully detected by any
of the three tools. However, there are some difficult defects
even for the top performance tools. For example, although
CodeSonar is better at the concurrency defects than Code
Prover, they are still difficult to detect. On the contrary,
Code Prover is better at stack-related defects than CodeSonar;
however, they are still difficult to detect. Thus, we would like
to anticipate extensive improvement of the tools in their weak
areas.
IV. CONCLUSION
In this paper, we discussed the benchmarks of static analysis
tools. For this purpose, we developed test suites and evaluation
criteria. We also provided some of the actual evaluation results
of the top performance tools as references. The results showed
that there are still some difficult defects to detect even for the
top performance tools.
The proposed test suites are available in the public domain.
Thus, anyone can use and improve them if necessary. We
would like to anticipate better tool performance leveraged by
our test suites in the future.
V. AC KN OWLEDGEMENT
GrammaTech and MathWorks agreed to publish the evalua-
tion results of their tools. We would like to express our great
appreciation to their understanding.
REFERENCES
[1] M. Aoyama, “Computing for the next-generation automobile,Computer,
vol. 45, no. 6, pp. 32–37, 2012.
[2] “ISO 26262 road vehicles – functional safety –,” ISO, November 2011.
[3] T. Kelly and R. Weaver, “The goal structuring notation a safety argument
notation,” in Proc. of Dependable Systems and Networks 2004 Workshop
on Assurance Cases, 2004.
[4] K. Kratkiewicz, “Using a diagnostic corpus of C programs to evaluate
buffer overflow detection by static analysis tools,” p. 5, 2005.
[5] P. E. Black, M. Kass, M. Koo, and E. Fong, “Source code security analysis
tool functional specification version 1.1,” vol. Special Publication 500-
268 v1.1, NIST, February 2011.
[6] V. Okun, A. Delaitre, and P. E. Black, “Report on the third static analysis
tool exposition (SATE 2010),” vol. Special Publication 500-283, NIST,
October 2011.
0"
10"
20"
30"
40"
50"
60"
70"
80"
90"
100"
Sta/c"memory"
defects"
Dynamic"
memory"defects"
Stack"related"
defects"
Numerical"
defects"
Resource"
management"
defects"
Pointer"related"
defects"
Concurrency"
defects"
Inappropriate"
code"
Misc"defects"
0"
10"
20"
30"
40"
50"
60"
70"
80"
90"
100"
Sta/c"memory"
defects"
Dynamic"
memory"defects"
Stack"related"
defects"
Numerical"
defects"
Resource"
management"
defects"
Pointer"related"
defects"
Concurrency"
defects"
Inappropriate"
code"
Misc"defects"
(a) CodeSonar - GrammaTech (b) Code Prover - MathWorks (c) Bug Finder - MathWorks
Fig. 3. Comparison of Productivity.
... Problem and State-of-Practice. Existing studies measure the effectiveness of static code analyzers mainly on synthetic benchmark datasets [13,16,28,33,38,40,45,56,61,63,68]. These are datasets that contain software bugs added either automatically by so-called bug injection engines, as e.g. in the LAVA-M dataset [34], or manually by security experts such as in the Juliet Test Suite [12,23]. ...
... However, the injected synthetic bugs are relatively easy to spot [25,37], as they are usually inserted in the form of syntactic code changes to a single instruction (e.g., off-by-one array access). Many evaluations [13,28,38,40,61,63] performed on such benchmarks thereby report detection rates around 80%-for certain types of vulnerabilities even 100%-for some of the analyzers studied. Infer [10], for example, a static analyzer also used in our benchmark, detects in the Juliet Test Suite for C/C++ on average 79% of the vulnerabilities [63] across four different Common Weakness Enumeration (CWE) categories. ...
... For our evaluation, we searched for benchmark datasets with C programs that contain a representative and diverse set of well-documented security vulnerabilities. Existing work [13,16,28,33,38,40,45,56,61,63,68] thereby mostly utilize programs that include synthetic software bugs [8,34,61] such as those in the widespread Juliet Test Suite [12]. However, these bugs do not necessarily reflect the complexity of real-world vulnerabilities [25,37]. ...
Conference Paper
Full-text available
Static code analysis is often used to scan source code for security vulnerabilities. Given the wide range of existing solutions implementing different analysis techniques, it is very challenging to perform an objective comparison between static analysis tools to determine which ones are most effective at detecting vulnerabilities. Existing studies are thereby limited in that (1) they use synthetic datasets, whose vulnerabilities do not reflect the complexity of security bugs that can be found in practice and/or (2) they do not provide differentiated analyses w.r.t. the types of vulnerabilities output by the static analyzers. Hence, their conclusions about an analyzer's capability to detect vulnerabilities may not generalize to real-world programs. In this paper, we propose a methodology for automatically evaluating the effectiveness of static code analyzers based on CVE reports. We evaluate five free and open-source and one commercial static C code analyzer(s) against 27 software projects containing a total of 1.15 million lines of code and 192 vulnerabilities (ground truth). While static C analyzers have been shown to perform well in benchmarks with synthetic bugs, our results indicate that state-of-the-art tools miss in-between 47% and 80% of the vulnerabilities in a benchmark set of real-world programs. Moreover, our study finds that this false negative rate can be reduced to 30% to 69% when combining the results of static analyzers, at the cost of 15 percentage points more functions flagged. Many vulnerabilities hence remain undetected, especially those beyond the classical memory-related security bugs.
... A few benchmarking studies (with precision and recall quantified against ground truth) exist [4,14,37], yet they targeted commercial tools exclusively. Without knowing the technical details of these tools, it is difficult to explain their performance differences or insufficiency, which is necessary for distilling insights into the design of future advanced techniques. ...
... The first was formed by samples from the Software-Analysis-Benchmark introduced in [37], which includes 638 positive (i.e., vulnerable) samples and 638 Table 1. These 520 C/C++ programs formed the first dataset actually used in our study. ...
... We recently conducted a preliminary study [30] of the same five open-source memory error vulnerability detectors against the Software-Analysis-Benchmark suite [37], addressing research questions corresponding to the first three in this paper. Since the benchmarks used were all manually crafted, which did not demonstrate statistically significant and large differences for most of the tool pairs compared (see Table 3). ...
Article
Full-text available
Context Memory error vulnerabilities have been consequential and several well-known, open-source memory error vulnerability detectors exist, built on static and/or dynamic code analysis. Yet there is a lack of assessment of such detectors based on rigorous, quantitative accuracy and efficiency measures while not being limited to specific application domains. Objective Our study aims to assess and explain the strengths and weaknesses of state-of-the-art memory error vulnerability detectors based on static and/or dynamic code analysis, so as to inform tool selection by practitioners and future design of better detectors by researchers and tool developers. Method We empirically evaluated and compared five state-of-the-art memory error vulnerability detectors against two benchmark datasets of 520 and 474 C/C++ programs, respectively. We conducted case studies to gain in-depth explanations of successes and failures of individual tools. Results While generally fast, these detectors had largely varied accuracy across different vulnerability categories and moderate overall accuracy. Complex code (e.g., deep loops and recursions) and data (e.g., deeply embedded linked lists) structures appeared to be common, major barriers. Hybrid analysis did not always outperform purely static or dynamic analysis for memory error vulnerability detection. Yet the evaluation results were noticeably different between the two datasets used. Our case studies further explained the performance variations among these detectors and enabled additional actionable insights and recommendations for improvements. Conclusion There was no single most effective tool among the five studied. For future research, integrating different techniques is a promising direction, yet simply combining different classes of code analysis (e.g., static and dynamic) may not. For practitioners to choose right tools, making various tradeoffs (e.g., between precision and recall) might be inevitable.
... Arusoaie et al. [32] benchmark 11 distinct open source C/C++ SATs using 638 test cases from a test suite created by Toyota [33]. Their results show that Clang has the highest detection rate, with a value of 0.358. ...
Article
Full-text available
Security vulnerabilities are present in most software systems, especially in projects with a large codebase, with several versions over the years, developed by many developers. Issues with memory management, in particular buffer overflow, are among the most frequently exploited vulnerabilities in software systems developed in C/C++. Nevertheless, most buffer overflow vulnerabilities are not detectable by vulnerability detection tools and static analysis tools (SATs). To improve vulnerability detection, we need to better understand the characteristics of such vulnerabilities and their root causes. In this study, we analyze 159 vulnerable code units from three representative projects (i.e., Linux Kernel, Mozilla, and Xen). First, the vulnerable code is characterized using the Orthogonal Defect Classification (ODC), showing that most buffer overflow vulnerabilities are related to missing or incorrect checking (e.g., missing if construct around statement or incorrect logical expression used as branch condition). Then, we run two widely used C/C++ Static Analysis Tools (SATs) (i.e., CppCheck and Flawfinder) on the vulnerable and neutral (after the vulnerability fix) versions of each code unit, showing the low effectiveness of this type of tool in detecting buffer overflow vulnerabilities. Finally, we characterize the vulnerable and neutral versions of each code unit using software metrics, demonstrating that, although such metrics are frequently used as indicators of software quality, there is no clear correlation between them and the existence of buffer overflow in the code. As a result, we highlight a set of observations that should be considered to improve the detection of buffer overflow vulnerabilities.
Article
Usage of Deep Learning (DL) methods is ubiquitous. It is common in the DL/Artificial Intelligence domain to use 3rd party software. TensorFlow is one of the most popular Machine Learning (ML) platforms. Every software product is a subject to security failures which often result from software vulnerabilities. In this paper, we focus on threats related to 6 common types of threats in TensorFlow implementation. We identify them using Common Weakness Enumeration. We analyze more than 100 vulnerability instances. We focus on vulnerabilities’ severity, impact on confidentiality, integrity and availability, as well as possible results of exploitation. We also use Orthogonal Defect Classification (ODC). The results show that a majority of vulnerabilities are caused by missing/incorrect checking statements, however some fixes require more advanced algorithmic changes. Static Analysis Tools tested in our study show low effectiveness in detecting known vulnerabilities in TensorFlow, but we provide some recommendations based on the obtained alerts to improve overall code quality. Further analysis of vulnerabilities helped us to understand and characterize different vulnerability types and provide a set of observations. We believe that these observations can be useful for the creators of new static analysis tools as a source of inspiration and to build the test cases. We also aim to draw the programmers’ attention to the prevalence of vulnerabilities in deep learning applications and a low effectiveness of automatic tools to find software vulnerabilities in such products.
Thesis
In system and software security, one of the first criteria before applying an analysis methodology is to distinguish according to the availability or not of the source code. When the software we want to investigate is present in binary form, the only possibility that we have is to extract some information from it by observing its machine code, performing what is commonly referred to as Binary Analysis (BA). The artisans in this sector are in charge of mixing their personal experience with an arsenal of tools and methodologies to comprehend some intrinsic and hidden aspects of the target binary, for instance, to discover new vulnerabilities or to detect malicious behaviors. Although this human-in-the-loop configuration is well consolidated over the years, the current explosion of threats and attack vectors such as malware, weaponized exploits, etc. implicitly stresses this binary analysis model, demanding at the same time for high accuracy of the analysis as well as proper scalability over the binaries to counteract the adversarial actors. Therefore, despite the many advances in the BA field over the past years, we are still obliged to seek novel solutions. In this thesis, we take a step more on this problem, and we try to show what current paradigms lack to increase the automation level. To accomplish this, we isolated three classical binary analysis use cases, and we demonstrated how the pipeline analysis benefits from the human intervention. In other words, we considered three human-in-the-loop systems, and we described the human role inside the pipeline with a focus on the types of feedback that the analyst ``exchanges'' with her toolchain. These three examples provided a full view of the gap between current binary analysis solutions and ideally more automated ones, suggesting that the main feature at the base of the human feedback corresponds to the human skill at comprehending portions of binary code. This attempt to systematize the human role in modern binary analysis approaches tries to raise the bar towards more automated systems by leveraging the human component that, so far, is still unavoidable in the majority of the scenarios. Although our analysis shows that machines cannot replace humans at the current stage, we cannot exclude that future approaches will be able to fill this gap as well as evolve tools and methodologies to the next level. Therefore, we hope with this work to inspire future research in the field to reach always more sophisticated and automated binary analysis techniques.
Article
Bug benchmarks are used in development and evaluation of debugging approaches, e.g. fault localization and automated repair. Quantitative performance comparison of different debugging approaches is only possible when they have been evaluated on the same dataset or benchmark. However, benchmarks are often specialized towards usage for certain debugging approaches in their contained data, metrics, and artifacts. Such benchmarks cannot be easily used on debugging approaches outside their scope as such approach may rely on specific data such as bug reports or code metrics that are not included in the dataset. Furthermore, benchmarks vary in their size w.r.t. the number of subject programs and the size of the individual subject programs. For these reasons, we have performed a systematic literature review where we have identified 73 benchmarks that can be used to evaluate debugging approaches. We compare the different benchmarks w.r.t. their size and the provided information such as bug reports, contained test cases, and other code metrics. This comparison is intended to help researchers to quickly identify all suitable benchmarks for evaluating their specific debugging approaches. Furthermore, we discuss reoccurring issues and challenges in selection, acquisition, and usage of such bug benchmarks, i.e., data availability, data quality, duplicated content, data formats, reproducibility, and extensibility. Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board.
Chapter
While modern-day static analysis tools are capable of finding standard vulnerabilities as well as complex patterns, implementing those tools is expensive regarding both development time and runtime performance. During the last years, domain specific languages like Datalog have gained popularity as they simplify the development process of analyses and rule sets dramatically. Similarly, intermediate representations like LLVM-IR are used to facilitate static source code analysis. In this paper, we present VANDALIR, a vulnerability analyzer and detector based on Datalog and LLVM-IR. VANDALIR is a static source code analyzer that allows to define and customize detection rules in a high-level, declarative way. We implement VANDALIR as a comprehensive static analysis tool, aiming to simplify vulnerability detection by a new combination of modern technologies. Besides the novel design of VANDALIR, we present a predefined detection rule set covering stack-based memory corruption, double free and format string vulnerabilities. As we show, our rule set achieves a detection rate of over 90% on test cases from the Juliet Test Suite, outperforming well-established vulnerability scanners such as the Clang Static Analyzer. Furthermore, we evaluated VANDALIR on open source projects and could reproduce existing vulnerabilities as well as identify previously unknown vulnerabilities.
Article
Full-text available
In Europe, over recent years, the responsibility for ensuring system safety has shifted onto the developers and operators to construct and present well reasoned arguments that their systems achieve acceptable levels of safety. These arguments (together with supporting evidence) are typically referred to as a "safety case". This paper describes the role and purpose of a safety case. Safety arguments within safety cases are often poorly communicated. This paper presents a technique called GSN (Goal Structuring Notation) that is increasingly being used in safety-critical industries to improve the structure, rigor, and clarity of safety arguments. The paper also describes a number of extensions, based upon GSN, which can be used to assist the maintenance, construction, reuse and assessment of safety cases. The aim of this paper is to describe the current industrial use and research into GSN such that its applicability to other types of Assurance Case, in addition to safety cases, can also be considered.
Article
Innovative computational technologies developed by automotive companies and research institutes in Japan are making cars greener, smarter, and more connected.
Article
A corpus of 291 small C-program test cases was developed to evaluate static and dynamic analysis tools designed to detect buffer overflows. The corpus was designed and labeled using a new, comprehensive buffer overflow taxonomy. It provides a benchmark to measure detection, false alarm, and confusion rates of tools, and also suggests areas for tool enhancement. Experiments with five tools demonstrate that some modern static analysis tools can accurately detect overflows in simple test cases but that others have serious limitations. For example, PolySpace demonstrated a superior detection rate, missing only one detection. Its performance could be enhanced if extremely long run times were reduced, and false alarms were eliminated for some C library functions. ARCHER performed well with no false alarms whatsoever. It could be enhanced by improving inter-procedural analysis and handling of C library functions. Splint detected significantly fewer overflows and exhibited the highest false alarm rate. Improvements in loop handling and reductions in false alarm rate would make it a much more useful tool. UNO had no false alarms, but missed overflows in roughly half of all test cases. It would need improvement in many areas to become a useful tool. BOON provided the worst performance. It did not detect overflows well in string functions, even though this was a design goal.
ISO 26262 road vehicles – functional safety –
" ISO 26262 road vehicles – functional safety –, " ISO, November 2011.
Source code security analysis tool functional specification version 1
  • P E Black
  • M Kass
  • M Koo
  • E Fong
P. E. Black, M. Kass, M. Koo, and E. Fong, " Source code security analysis tool functional specification version 1.1, " vol. Special Publication 500- 268 v1.1, NIST, February 2011.
Report on the third static analysis tool exposition
  • V Okun
  • A Delaitre
  • P E Black
V. Okun, A. Delaitre, and P. E. Black, "Report on the third static analysis tool exposition (SATE 2010)," vol. Special Publication 500-283, NIST, October 2011.
The goal structuring notation a safety argument notation
  • T Kelly
  • R Weaver
T. Kelly and R. Weaver, "The goal structuring notation a safety argument notation," in Proc. of Dependable Systems and Networks 2004 Workshop on Assurance Cases, 2004.
Source code security analysis tool functional specification version 1.1," vol. Special Publication 500-268 v1.1, NIST
  • P E Black
  • M Kass
  • M Koo
  • E Fong
P. E. Black, M. Kass, M. Koo, and E. Fong, "Source code security analysis tool functional specification version 1.1," vol. Special Publication 500-268 v1.1, NIST, February 2011.