Nicolas Harrand’s research while affiliated with KTH Royal Institute of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (26)


Coverage-Based Debloating for Java Bytecode
  • Article
  • Full-text available

July 2022

·

56 Reads

·

14 Citations

ACM Transactions on Software Engineering and Methodology

·

·

Nicolas Harrand

·

Software bloat is code that is packaged in an application but is actually not necessary to run the application. The presence of software bloat is an issue for security, for performance, and for maintenance. In this paper, we introduce a novel technique for debloating, which we call coverage-based debloating. We implement the technique for one single language: Java bytecode. We leverage a combination of state-of-the-art Java bytecode coverage tools to precisely capture what parts of a project and its dependencies are used when running with a specific workload. Then, we automatically remove the parts that are not covered, in order to generate a debloated version of the project. We succeed to debloat 211 library versions from a dataset of 94 unique open-source Java libraries. The debloated versions are syntactically correct and preserve their original behavior according to the workload. Our results indicate that 68.3%68.3\% of the libraries’ bytecode and 20.3%20.3\% of their total dependencies can be removed through coverage-based debloating. For the first time in the literature on software debloating, we assess the utility of debloated libraries with respect to client applications that reuse them. We select 988 client projects that either have a direct reference to the debloated library in their source code or which test suite covers at least one class of the libraries that we debloat. Our results show that 81.5%81.5\% of the clients, with at least one test that uses the library, successfully compile and pass their test suite when the original library is replaced by its debloated version.

Download

Characteristics of the 161 repositories considered as study subjects.
RQ2 experimental results per rule. SORALDBOT generates 54 patches that fix 80 newly introduced violations over 350 days. The numbers in the last row represent the number of unique items in the corresponding column. For example, there are 21 unique projects with new violations of at least one of the considered rules.
Sorald: Automatic Patch Suggestions for SonarQube Static Analysis Violations

January 2022

·

101 Reads

·

9 Citations

IEEE Transactions on Dependable and Secure Computing

·

Nicolas Yves Maurice Harrand

·

Simon Larsen

·

[...]

·

Previous work has shown that early resolution of issues detected by static code analyzers can prevent major costs later on. However, developers often ignore such issues for two main reasons. First, many issues should be interpreted to determine if they correspond to actual flaws in the program. Second, static analyzers often do not present the issues in a way that is actionable. To address these problems, we present Sorald: a novel system that uses metaprogramming templates to transform the abstract syntax trees of programs and suggests fixes for static analysis warnings. Thus, the burden on the developer is reduced from interpreting and fixing static issues, to inspecting and approving full fledged solutions. Sorald fixes violations of 10 rules from SonarJava, one of the most widely used static analyzers for Java. We evaluate Sorald on a dataset of 161 popular repositories on Github. Our analysis shows the effectiveness of Sorald as it fixes 65% (852/1,307) of the violations that meets the repair preconditions. Overall, our experiments show it is possible to automatically fix notable violations of the static analysis rules produced by the state-of-the-art static analyzer SonarJava.


Fig. 2. Generic architecture for a Library Substitution Framework that targets a set of applications and a library reservoir. Bridges abstract the API used by the clients from the concrete implementation; the facade defines a common API for all libraries; a wrapper maps the abstract API to the concrete functions of an existing library of the reservoir.
Fig. 7. Distribution of the number of successful substitution per client. We regroup the 329 clients of our dataset based on the number of JSON library substitutions (out of 20) that passed their test suite.
Automatic Diversity in the Software Supply Chain

November 2021

·

84 Reads

·

1 Citation

Despite its obvious benefits, the increased adoption of package managers to automate the reuse of libraries has opened the door to a new class of hazards: supply chain attacks. By injecting malicious code in one library, an attacker may compromise all instances of all applications that depend on the library. To mitigate the impact of supply chain attacks, we propose the concept of Library Substitution Framework. This novel concept leverages one key observation: when an application depends on a library, it is very likely that there exists other libraries that provide similar features. The key objective of Library Substitution Framework is to enable the developers of an application to harness this diversity of libraries in their supply chain. The framework lets them generate a population of application variants, each depending on a different alternative library that provides similar functionalities. To investigate the relevance of this concept, we develop ARGO, a proof-of-concept implementation of this framework that harnesses the diversity of JSON suppliers. We study the feasibility of library substitution and its impact on a set of 368 clients. Our empirical results show that for 195 of the 368 java applications tested, we can substitute the original JSON library used by the client by at least 15 other JSON libraries without modifying the client's code. These results show the capacity of a Library Substitution Framework to diversify the supply chain of the client applications of the libraries it targets.


API beauty is in the eye of the clients: 2.2 million Maven dependencies reveal the spectrum of client-API usages

November 2021

·

38 Reads

·

13 Citations

Journal of Systems and Software

Hyrum’s law states a common observation in the software industry: “With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody”. Meanwhile, recent research results seem to contradict this observation when they state that “for most APIs, there is a small number of features that are actually used”. In this work, we perform a large scale empirical study of client API relationships in the Maven ecosystem, in order to investigate this seeming paradox between the observations in industry and the research literature. We study the 94 most popular libraries in Maven Central, as well as the 829,410 client artifacts that declare a dependency to these libraries and that are available in Maven Central, summing up to 2.2M dependencies. Our analysis indicates the existence of a wide spectrum of API usages, with enough clients, most API types end up being used at least once. Our second key observation is that, for all libraries, there is a small set of API types that are used by the vast majority of its clients. The practical consequences of this study are two-fold: (i) it is possible for API maintainers to find an essential part of their API on which they can focus their efforts; (ii) API developers should limit the public API elements to the set of features for which they are ready to have users.



A comprehensive study of bloated dependencies in the Maven ecosystem

May 2021

·

1,669 Reads

·

82 Citations

Empirical Software Engineering

Build automation tools and package managers have a profound influence on software development. They facilitate the reuse of third-party libraries, support a clear separation between the application’s code and its external dependencies, and automate several software development tasks. However, the wide adoption of these tools introduces new challenges related to dependency management. In this paper, we propose an original study of one such challenge: the emergence of bloated dependencies. Bloated dependencies are libraries that are packaged with the application’s compiled code but that are actually not necessary to build and run the application. They artificially grow the size of the built binary and increase maintenance effort. We propose DepClean, a tool to determine the presence of bloated dependencies in Maven artifacts. We analyze 9,639 Java artifacts hosted on Maven Central, which include a total of 723,444 dependency relationships. Our key result is as follows: 2.7% of the dependencies directly declared are bloated, 15.4% of the inherited dependencies are bloated, and 57% of the transitive dependencies of the studied artifacts are bloated. In other words, it is feasible to reduce the number of dependencies of Maven artifacts to 1/4 of its current count. Our qualitative assessment with 30 notable open-source projects indicates that developers pay attention to their dependencies when they are notified of the problem. They are willing to remove bloated dependencies: 21/26 answered pull requests were accepted and merged by developers, removing 140 dependencies in total: 75 direct and 65 transitive.


The Behavioral Diversity of Java JSON Libraries

April 2021

·

69 Reads

JSON is a popular file and data format that is precisely specified by the IETF in RFC 8259. Yet, this specification implicitly and explicitly leaves room for many design choices when it comes to parsing and generating JSON. This yields the opportunity of diverse behavior among independent implementations of JSON libraries. A thorough analysis of this diversity can be used by developers to choose one implementation or to design a resilient multi-version architecture. We present the first systematic analysis and comparison of the input / output behavior of 20 JSON libraries, in Java. We analyze the diversity of architectural choices among libraries, and we execute each library with well-formed and ill-formed JSON files to assess their behavior. We first find that the data structure selected to represent JSON objects and the encoding of numbers are the main design differences, which influence the behavior of the libraries. Second, we observe that the libraries behave in a similar way with regular, well-formed JSON files. However, there is a remarkable behavioral diversity with ill-formed files, or corner cases such as large numbers or duplicate data.


Characteristics of the 161 repositories considered as study subjects.
RQ2 experimental results per rule. SORALDBOT generates a proper number of patches (19) that are effective and fix all newly introduced violations over 60 days (25 out of 25).
The pull requests submitted to real open-source projects to collect developer feedback about SORALD.
Sorald: Automatic Patch Suggestions for SonarQube Static Analysis Violations

March 2021

·

179 Reads

Previous work has shown that early resolution of issues detected by static code analyzers can prevent major cost later on. However, developers often ignore such issues for two main reasons. First, many issues should be interpreted to determine if they correspond to actual flaws in the program. Second, static analyzers often do not present the issues in a way that makes it apparent how to fix them. To address these problems, we present Sorald: a novel system that adopts a set of predefined metaprogramming templates to transform the abstract syntax trees of programs to suggest fixes for static issues. Thus, the burden on the developer is reduced from both interpreting and fixing static issues, to inspecting and approving solutions for them. Sorald fixes violations of 10 rules from SonarQube, one of the most widely used static analyzers for Java. We also implement an effective mechanism to integrate Sorald into development workflows based on pull requests. We evaluate Sorald on a dataset of 161 popular repositories on Github. Our analysis shows the effectiveness of Sorald as it fixes 94\% (1,153/1,223) of the violations that it attempts to fix. Overall, our experiments show it is possible to automatically fix violations of static analysis rules produced by the state-of-the-art static analyzer SonarQube.


Coverage-Based Debloating for Java Bytecode

August 2020

·

172 Reads

Software bloat is code that is packaged in an application but is actually not used and not necessary to run the application. The presence of bloat is an issue for software security, for performance, and for maintenance. In recent years, several works have proposed techniques to detect and remove software bloat. In this paper, we introduce a novel technique to debloat Java bytecode through dynamic analysis, which we call trace-based debloat. We have developed JDBL, a tool that automates the collection of accurate execution traces and the debloating process. Given a Java project and a workload, JDBL generates a debloated version of the project that is syntactically correct and preserves the original behavior, modulo the workload. We evaluate the feasibility and the effectiveness of trace-based debloat with 395 open-source Java libraries for a total 10M+ lines of code. We demonstrate that our approach significantly reduces the size of these libraries while preserving the functionalities needed by their clients.


Java Decompiler Diversity and its Application to Meta-decompilation

May 2020

·

151 Reads

During compilation from Java source code to bytecode, some information is irreversibly lost. In other words, compilation and decompilation of Java code is not symmetric. Consequently, decompilation, which aims at producing source code from bytecode, relies on strategies to reconstruct the information that has been lost. Different Java decompilers use distinct strategies to achieve proper decompilation. In this work, we hypothesize that the diverse ways in which bytecode can be decompiled has a direct impact on the quality of the source code produced by decompilers. In this paper, we assess the strategies of eight Java decompilers with respect to three quality indicators: syntactic correctness, syntactic distortion and semantic equivalence modulo inputs. Our results show that no single modern decompiler is able to correctly handle the variety of bytecode structures coming from real-world programs. The highest ranking decompiler in this study produces syntactically correct, and semantically equivalent code output for 84%, respectively 78%, of the classes in our dataset. Our results demonstrate that each decompiler correctly handles a different set of bytecode classes. We propose a new decompiler called Arlecchino that leverages the diversity of existing decompilers. To do so, we merge partial decompilation into a new one based on compilation errors. Arlecchino handles 37.6% of bytecode classes that were previously handled by no decompiler. We publish the sources of this new bytecode decompiler.


Citations (16)


... Soto-Valero et al. [49,50] found that 57% of Java dependencies are bloated and 89.2% of them remain bloated over time. To tackle this issue, several studies have been done using reachability [51,52], coverage [53], and dynamic analysis [54] to debloat applications. This has been found to impact the security of the applications [55]- [58]. ...

Reference:

Less Is More: A Mixed-Methods Study on Security-Sensitive API Calls in Java for Better Dependency Selection
Coverage-Based Debloating for Java Bytecode

ACM Transactions on Software Engineering and Methodology

... In this work, we compare LLMs based on the quality of their generated source code from different aspects. Both LLMs and quality assurance are already integral parts of software engineering both in academic [10], [11] and industrial fields [12]. ...

Sorald: Automatic Patch Suggestions for SonarQube Static Analysis Violations

IEEE Transactions on Dependable and Secure Computing

... While the format is thoroughly specified in RFC 8259, the specification leaves significant room for choice when implementing a specific library to process JSON. Harrand et.al [5] had observed a remarkable behavioral diversity between java json libraries with ill-formed files, or corner cases such as large numbers or duplicate data. ...

The Behavioral Diversity of Java JSON Libraries
  • Citing Conference Paper
  • October 2021

... These Maven artifacts will be the clients that we use for analysis. Hence, for the purpose of this research, we will examine the dependencies specified in these clients and assess the individual impact of updating each dependency, following a similar approach as conducted in a prior study [21]. ...

API beauty is in the eye of the clients: 2.2 million Maven dependencies reveal the spectrum of client-API usages
  • Citing Article
  • November 2021

Journal of Systems and Software

... As a novel security concern, the multiple dependencies at the micro-level for most software and applications have been understood in terms of the lack of transparency and oversight by purchasing entities (Ellison et al. 2010). While, however, the risks inherent in these dependencies have received some attention, they have been investigated so far primarily in terms of vulnerability to mistakes (Cox 2019) or malicious attacks (Ohm et al. 2020;Harrand et al. 2021). We argue, however, that locating distributed digital supply chains also illustrates how technological developments and evolutions work in tandem with digital infrastructures to shape strategic dependencies. ...

Automatic Diversity in the Software Supply Chain

... Previous research has extensively explored vulnerability sources (Cox et al. 2015;Gkortzis et al. 2021; Wang et al. Fig. 1 The overview of our study approach 2020; Xia et al. 2013), vulnerability propagation (Liu et al. 2022;Wu et al. 2023), and the scope of vulnerability influence (Decan et al. 2018;Imtiaz et al. 2021;Soto-Valero et al. 2021;Zapata et al. 2018). However, existing works either perform coarse-grained packagelevel analysis, which can produce many false positives, or requires manual effort, which is not scalable. ...

A comprehensive study of bloated dependencies in the Maven ecosystem

Empirical Software Engineering

... Based on the templates defined in Section 3.1, TemVUR determines which templates should be applied to the buggy statements. Unlike template-based APR approaches at the source level [30], [32], [41], which typically rely on ASTbased matching, it is challenging to construct an AST from bytecode without decompilation [54]. Therefore, TemVUR performances a bytecode-based matching approach. ...

Java decompiler diversity and its application to meta-decompilation

Journal of Systems and Software

... This file has details about the top 999 Java artifacts published on Maven Central, as well as the dependencies between them. We retrieve this data file from previous work [32], [33]. We use GEPHI to produce a graph from this data, and to manipulate its layout, as illustrated in Figure 4. Finally, we export the resulting graph in PDF, PNG, and SVG formats, before exiting the application. ...

The Maven Dependency Graph: A Temporal Graph-Based Representation of Maven Central

... Decompilation or lifting from low-level binary code to a structured, high-level representation is a problem with a substantial history and practical significance in a variety of settings [Cifuentes 1994;Hamilton and Danicic 2009;Harrand et al. 2019;Kruegel et al. 2004]. In the context of programmable blockchains, decompilation has found a new application domain, with difficult technological considerations but intense demand. ...

The Strengths and Behavioral Quirks of Java Bytecode Decompilers

... The idea is to determine the parts of the test suite relevant to each hot method, so that only those rather than the whole suite need to be run during the GI search. A similar approach was used by Harrand et al. (2019) when selecting a location to make edits. They suggested edits should only be made in areas covered by a test case otherwise dead code will be edited. ...

A journey among Java neutral program variants

Genetic Programming and Evolvable Machines