Article

On the Impact of Inter-language Dependencies in Multi-language Systems: Empirical case study on Java Native Interface Applications

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Nowadays, developers are often using multiple programming languages to exploit the advantages of each language and to reuse code. However, dependency analysis across multi-language is more challenging compared to mono-language systems. In this paper, we introduce two approaches for multi-language dependency analysis: S-MLDA (Static Multi-language Dependency Analyzer) and H-MLDA (Historical Multi-language Dependency Analyzer), which we apply on ten open-source multi-language systems to empirically analyze the prevalence of the dependencies across languages i.e., inter-language dependencies and their impact on software quality and security. Our main results show that: the more inter-language dependencies, the higher the risk of bugs and vulnerabilities being introduced, while this risk remains constant for intra-language dependencies; the percentage of bugs within inter-language dependencies is three times higher than the percentage of bugs identified in intra-language dependencies; the percentage of vulnerabilities within inter-language dependencies is twice the percentage of vulnerabilities introduced in intra-language dependencies.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Modern software systems, such as Apache Ambari and Spark, are usually written in multiple programming languages (PLs). One main reason for adopting multiple PLs is to reuse existing code with required functionalities (Grichi et al., 2021). Another main reason is to take advantages of specific PLs to implement certain features, to meet various software quality needs, and to improve software development efficiency (Grichi et al., 2021;Ray et al., 2014;Mayer and Bauer, 2015;Kochhar et al., 2016;Mayer, 2017;Mayer et al., 2017;Abidi et al., 2019a). ...
... One main reason for adopting multiple PLs is to reuse existing code with required functionalities (Grichi et al., 2021). Another main reason is to take advantages of specific PLs to implement certain features, to meet various software quality needs, and to improve software development efficiency (Grichi et al., 2021;Ray et al., 2014;Mayer and Bauer, 2015;Kochhar et al., 2016;Mayer, 2017;Mayer et al., 2017;Abidi et al., 2019a). Nowadays, multiprogramming-language (MPL) software development is increasingly prevalent with the technology advances (Kontogiannis et al., 2006;Kochhar et al., 2016;Abidi et al., 2021). ...
... In 2020, Grichi et al. performed a case study on the impact of interlanguage dependencies in MPL systems (Grichi et al., 2021). They found that the risk of bug introduction gets higher when there are more interlanguage dependencies, while this risk remains constant for intralanguage dependencies; the percentage of bugs found in interlanguage dependencies is three times larger than the percentage of bugs identified in intralanguage dependencies. ...
Article
Full-text available
Context: Modern software systems (e.g., Apache Spark) are usually written in multiple programming languages (PLs). There is little understanding on the phenomenon of multi-programming-language commits (MPLCs), which involve modified source files written in multiple PLs. Objective: This work aims to explore MPLCs and their impacts on development difficulty and software quality. Methods: We performed an empirical study on eighteen non-trivial Apache projects with 197,566 commits. Results: (1) the most commonly used PL combination consists of all the four PLs, i.e., C/C++, Java, JavaScript, and Python; (2) 9% of the commits from all the projects are MPLCs, and the proportion of MPLCs in 83% of the projects goes to a relatively stable level; (3) more than 90% of the MPLCs from all the projects involve source files in two PLs; (4) the change complexity of MPLCs is significantly higher than that of non-MPLCs; (5) issues fixed in MPLCs take significantly longer to be resolved than issues fixed in non-MPLCs in 89% of the projects; (6) MPLCs do not show significant effects on issue reopen; (7) source files undergoing MPLCs tend to be more bug-prone; and (8) MPLCs introduce more bugs than non-MPLCs. Conclusions: MPLCs are related to increased development difficulty and decreased software quality.
... Intuitively, this prevalence has to do with the benefits of combining the best functional capabilities of different languages [3,16,54]. Yet the decisions on language selection may not fully depend on functionality considerations: As earlier works initially suggested [4,40,46], the decisions may come with security [35,90] consequences. ...
... Of the few works on multi-language software, most were focused on prevalence [53,85] and good/bad practices of developers [3], or only limited to JNI (Java-C) programs [40,46,90]. Grichi et al. [35] reported likely greater security risks of multilingual code than singlelanguage ones, but still for JNI code only and based on only 10 projects. Individual languages were found to have little association with bug proneness [10,75]. ...
... Earlier studies revealed cross-language links as possible points of high risks in multi-language systems in general [53] and interlanguage dependencies as contributors to the vulnerabilities of JNI programs in particular [35]. We thus examine the effects of language interfacing on the vulnerability proneness of multilingual code that uses corresponding interfacing mechanisms, as a way to justify/explain the security relevance of language selection. ...
Conference Paper
Full-text available
Software construction using multiple languages has long been a norm, yet it is still unclear if multilingual code construction has significant security implications and real security consequences. This paper aims to address this question with a large-scale study of popular multi-language projects on GitHub and their evolution histories, enabled by our novel techniques for multilingual code characterization. We found statistically significant associations between the proneness of multilingual code to vulnerabilities (in general and of specific categories) and its language selection. We also found this association is correlated with that of the language interfacing mechanism, not that of individual languages. We validated our statistical findings with in-depth case studies on actual vulnerabilities, explained via the mechanism and language selection. Our results call for immediate actions to assess and defend against multilingual vulnerabilities, for which we provide practical recommendations.
... Modern software systems, such as Apache Spark and Ambari, are usually written in multiple programming languages (PLs). One of the main reasons for adopting multiple PLs is to reuse existing code with required functionalities [1]. Another main reason is to take advantages of specific PLs to implement certain features, to meet various software quality needs, and to improve software development efficiency [1], [2], [3], [4], [5], [6], [7]. ...
... One of the main reasons for adopting multiple PLs is to reuse existing code with required functionalities [1]. Another main reason is to take advantages of specific PLs to implement certain features, to meet various software quality needs, and to improve software development efficiency [1], [2], [3], [4], [5], [6], [7]. Nowadays, multi-programming-language (MPL) software development is increasingly prevalent with the technology advances [8], [4], [9]. ...
... In 2020, Grichi et al. performed a case study on the impact of interlanguage dependencies in MPL systems [1]. They found that the risk of bug introduction gets higher when there are more interlanguage dependencies, while this risk remains constant for intralanguage dependencies; the percentage of bugs found in interlanguage dependencies is three times larger than the percentage of bugs identified in intralanguage dependencies. ...
Preprint
Full-text available
Modern software systems, such as Spark, are usually written in multiple programming languages (PLs). Besides benefiting from code reuse, such systems can also take advantages of specific PLs to implement certain features, to meet various quality needs, and to improve development efficiency. In this context, a change to such systems may need to modify source files written in different PLs. We define a multi-programming-language commit (MPLC) in a version control system (e.g., Git) as a commit that involves modified source files written in two or more PLs. To our knowledge, the phenomenon of MPLCs in software development has not been explored yet. In light of the potential impact of MPLCs on development difficulty and software quality, we performed an empirical study to understand the state of MPLCs, their change complexity, as well as their impact on open time of issues and bug proneness of source files in real-life software projects. By exploring the MPLCs in 20 non-trivial Apache projects with 205,994 commits, we obtained the following findings: (1) 9% of the commits from all the projects are MPLCs, and the proportion of MPLCs in 80% of the projects goes to a relatively stable level; (2) more than 90% of the MPLCs from all the projects involve source files written in two PLs; (3) the change complexity of MPLCs is significantly higher than that of non-MPLCs in all projects; (4) issues fixed in MPLCs take significantly longer to be resolved than issues fixed in non-MPLCs in 80% of the projects; and (5) source files that have been modified in MPLCs tend to be more bug-prone than source files that have never been modified in MPLCs. These findings provide practitioners with useful insights on the architecture design and quality management of software systems written in multiple PLs.
... Modern software systems, such as Apache Spark and Ambari, are usually written in multiple programming languages (PLs). One of the main reasons for adopting multiple PLs is to reuse existing code with required functionalities [1]. Another main reason is to take advantages of specific PLs to implement certain features, to meet various software quality needs, and to improve software development efficiency [1], [2], [3], [4], [5], [6], [7]. ...
... One of the main reasons for adopting multiple PLs is to reuse existing code with required functionalities [1]. Another main reason is to take advantages of specific PLs to implement certain features, to meet various software quality needs, and to improve software development efficiency [1], [2], [3], [4], [5], [6], [7]. Nowadays, multi-programming-language (MPL) software development is increasingly prevalent with the technology advances [8], [4], [9]. ...
... In 2020, Grichi et al. performed a case study on the impact of interlanguage dependencies in MPL systems [1]. They found that the risk of bug introduction gets higher when there are more interlanguage dependencies, while this risk remains constant for intralanguage dependencies; the percentage of bugs found in interlanguage dependencies is three times larger than the percentage of bugs identified in intralanguage dependencies. ...
Conference Paper
Full-text available
Modern software systems, such as Spark, are usually written in multiple programming languages (PLs). Besides benefiting from code reuse, such systems can also take advantages of specific PLs to implement certain features, to meet various quality needs, and to improve development efficiency. In this context, a change to such systems may need to modify source files written in different PLs. We define a multi-programming-language commit (MPLC) in a version control system (e.g., Git) as a commit that involves modified source files written in two or more PLs. To our knowledge, the phenomenon of MPLCs in software development has not been explored yet. In light of the potential impact of MPLCs on development difficulty and software quality, we performed an empirical study to understand the state of MPLCs, their change complexity, as well as their impact on open time of issues and bug proneness of source files in real-life software projects. By exploring the MPLCs in 20 non-trivial Apache projects with 205,994 commits, we obtained the following findings: (1) 9% of the commits from all the projects are MPLCs, and the proportion of MPLCs in 80% of the projects goes to a relatively stable level; (2) more than 90% of the MPLCs from all the projects involve source files written in two PLs; (3) the change complexity of MPLCs is significantly higher than that of non-MPLCs in all projects; (4) issues fixed in MPLCs take significantly longer to be resolved than issues fixed in non-MPLCs in 80% of the projects; and (5) source files that have been modified in MPLCs tend to be more bug-prone than source files that have never been modified in MPLCs. These findings provide practitioners with useful insights on the architecture design and quality management of software systems written in multiple PLs.
... Yet cross-language bugs are not limited to one language combination (e.g, Java-C) or one interfacing mechanism (e.g., JNI) [46], albeit the only few prior relevant works available all targeted that particular case (i.e., Java-C with JNI) [11,36,37,43]. For instance, recently Li et al. [48] demonstrated multiple cases of high-severity security vulnerabilities of different kinds that happen across Python and C code in popular open-source projects such as NumPy [58]. ...
Conference Paper
Full-text available
Analyzing multilingual code holistically is key to systematic quality assurance of real-world software which is mostly developed in multiple computer languages. Toward such analyses, state-of-the-art approaches propose an almost-fully language-agnostic methodology and apply it to dynamic dependence analysis/slicing of multilingual code, showing great promises. We investigated this methodology through a technical analysis followed by a replication study applying it to 10 real-world multilingual projects of diverse language combinations. Our results revealed critical practicality (i.e., having the levels of efficiency/scalability, precision, and extensibility to various language combinations for practical use) challenges to the methodology. Based on the results, we reflect on the underlying pitfalls of the language-agnostic design that leads to such challenges. Finally, looking forward to the prospects of dynamic analysis for multilingual code, we identify a new research direction towards better practicality and precision while not sacrificing extensibility much, as supported by preliminary results. The key takeaway is that pursuing fully language-agnostic analysis may be both impractical and unnecessary, and striving for a better balance between language independence and practicality may be more fruitful.
Preprint
Recent data breaches raise awareness about security vulnerabilities and their frequent presence in all types of software systems. Indeed, security bugs are one of the principal causes of security vulnerabilities in software systems as they can be exploited to gain unauthorized access within an information system. In this paper, we revisited one of the strategies in analyzing and predicting security bugs based on the use of software quality metrics. We conducted an empirical study on twelve open-source software systems to show that it was possible to develop an effective predictive model of security bugs. Indeed, we observed that Random Forest succeeded in predicting security bugs with an accuracy of 99.98. In addition, we find that the metric values of the files involving security bugs are higher on average of 2,3 times than the metric values of the files without security bugs. This finding is indicating that utilizing machine learning models and a combination of software metrics can be tapped to create a better forecasting model thereby aiding software developers in developing secure software systems to avoid and potentially spot security vulnerabilities in an early stage.
Article
Full-text available
As a software project ages, its source code is modified to add new features, restructure existing ones, and fix defects. These source code changes often induce changes in the build system, i.e., the system that specifies how source code is translated into deliverables. However, since developers are often not familiar with the complex and occasionally archaic technologies used to specify build systems, they may not be able to identify when their source code changes require accompanying build system changes. This can cause build breakages that slow development progress and impact other developers, testers, or even users. In this paper, we mine the source and test code changes that required accompanying build changes in order to better understand this co-change relationship. We build random forest classifiers using language-agnostic and language-specific code change characteristics to explain when code-accompanying build changes are necessary based on historical trends. Case studies of the Mozilla C++ system, the Lucene and Eclipse open source Java systems, and the IBM Jazz proprietary Java system indicate that our classifiers can accurately explain when build co-changes are necessary with an AUC of 0.60-0.88. Unsurprisingly, our highly accurate C++ classifiers (AUC of 0.88) derive much of their explanatory power from indicators of structural change (e.g., was a new source file added?). On the other hand, our Java classifiers are less accurate (AUC of 0.60-0.78) because roughly 75% of Java build co-changes do not coincide with changes to the structure of a system, but rather are instigated by concerns related to release engineering, quality assurance, and general build maintenance.
Article
Full-text available
Context Most companies, independently of their size and activity type, are facing the problem of managing, maintaining and/or replacing (part of) their existing software systems. These legacy systems are often large applications playing a critical role in the company’s information system and with a non-negligible impact on its daily operations. Improving their comprehension (e.g., architecture, features, enforced rules, handled data) is a key point when dealing with their evolution/modernization. Objective The process of obtaining useful higher-level representations of (legacy) systems is called reverse engineering (RE), and remains a complex goal to achieve. So-called Model Driven Reverse Engineering (MDRE) has been proposed to enhance more traditional RE processes. However, generic and extensible MDRE solutions potentially addressing several kinds of scenarios relying on different legacy technologies are still missing or incomplete. This paper proposes to make a step in this direction. Method MDRE is the application of Model Driven Engineering (MDE) principles and techniques to RE in order to generate relevant model-based views on legacy systems, thus facilitating their understanding and manipulation. In this context, MDRE is practically used in order to 1) discover initial models from the legacy artifacts composing a given system and 2) understand (process) these models to generate relevant views (i.e., derived models) on this system. Results Capitalizing on the different MDRE practices and our previous experience (e.g., in real modernization projects), this paper introduces and details the MoDisco open source MDRE framework. It also presents the underlying MDRE global methodology and architecture accompanying this proposed tooling. Conclusion MoDisco is intended to make easier the design and building of model-based solutions dedicated to legacy systems RE. As an empirical evidence of its relevance and usability, we report on its successful application in real industrial projects and on the concrete experience we gained from that.
Conference Paper
Full-text available
Requirements traceability (RT) links requirements to the corresponding source code entities, which implement them. Information Retrieval (IR) based RT links recovery approaches are often used to automatically recover RT links. However, such approaches exhibit low accuracy, in terms of precision, recall, and ranking. This paper presents an approach (CoChaIR), complementary to existing IR-based RT links recovery approaches. CoChaIR leverages historical co-change information of files to improve the accuracy of IR-based RT links recovery approaches. We evaluated the effectiveness of CoChaIR on three datasets, i.e., iTrust, Pooka, and SIP Communicator. We compared CoChaIR with two different IR-based RT links recovery approaches, i.e., vector space model and Jensen-Shannon divergence model. Our study results show that CoChaIR significantly improves precision and recall by up to 12.38% and 5.67% respectively; while decreasing the rank of true positive links by up to 48% and reducing false positive links by up to 44%.
Article
Full-text available
Type safety is a promising approach to enhancing soft- ware security. Programs written in type-safe programming languages such as Java are type-safe by construction. How- ever, in practice, many complex applications are heteroge- neous, i.e., they contain components written in different lan- guages. The Java Native Interface (JNI) allows type-safe Java code to interact with unsafe C code. When a type-safe language interacts with an unsafe language in the same ad- dress space, in general, the overall application becomes un- safe. In this work, we propose a framework called Safe Java Native Interface (SafeJNI) that ensures type safety of het- erogeneous programs that contain Java and C components. We identify the loopholes of using JNI that would permit C code to bypass the type safety of Java. The proposed SafeJNI system fixes these loopholes and guarantees type safety when native C methods are called. The overall ap- proach consists of (i) retro-fitting the native C methods to make them safe, and (ii) developing an enhanced system that captures additional invariants that must be satisfied to guarantee safe interoperation. The SafeJNI framework is implemented through a combination of static and dynamic checks on the C code. We have measured our system's effectiveness and per- formance on a set of benchmarks. During our experiments on the Zlib open source compression library, our system identified one vulnerability in the glue code between Zlib and Java. This vulnerability could be exploited to crash a large number of commercially deployed Java Virtual Ma- chines (JVMs). The performance impact of SafeJNI on Zlib, while considerable, is less than reimplementing the C code
Conference Paper
Full-text available
Code reviews with static analysis tools are today recommended by several security development processes. Developers are expected to use the tools' output to detect the security threats they themselves have introduced in the source code. This approach assumes that all developers can correctly identify a warning from a static analysis tool (SAT) as a security threat that needs to be corrected. We have conducted an industry experiment with a state of the art static analysis tool and real vulnerabilities. We have found that average developers do not correctly identify the security warnings and only developers with specific experiences are better than chance in detecting the security vulnerabilities. Specific SAT experience more than doubled the number of correct answers and a combination of security experience and SAT experience almost tripled the number of correct security answers.
Conference Paper
Full-text available
Change impact analysis aims at identifying software artefacts that are being affected by a change. It helps developers to assess their change efforts and perform more adequate changes. Several approaches have been proposed to aid in impact analysis. However, to the best of our knowledge, none of these approaches have been used to study the scope of changes in a program. We present a metaphor inspired by seismology and propose a mapping between the concepts of seismology and change propagation, to study the scope of change propagation. We perform three case studies on Pooka, Rhino, and Xerces-J to observe change propagation. We use ANOVA and Duncan statistical tests to assess the statistically significance of our observations, which show that changes propagate to a limited scope.
Conference Paper
Full-text available
It is well known that the use of native methods in Java defeats Java's guarantees of safety and security, which is why the default policy of Java applets, for example, does not allow loading non-local native code. However, there is already a large amount of trusted native C/C++ code that comprises a significant portion of the Java Develop- ment Kit (JDK). We have carried out an empirical secu- rity study on a portion of the native code in Sun's JDK 1.6. By applying static analysis tools and manual inspec- tion, we have identified in this security-critical code pre- viously undiscovered bugs. Based on our study, we de- scribe a taxonomy to classify bugs. Our taxonomy pro- vides guidance to construction of automated and accurate bug-finding tools. We also suggest systematic remedies that can mediate the threats posed by the native code.
Conference Paper
Full-text available
The literature describes several approaches to identify the artefacts of programs that change together to reveal the (hidden) dependencies among these artefacts. These approaches analyse historical data, mined from version control systems, and report co-changing artefacts, which hint at the causes, consequences, and actors of the changes. We introduce the novel concepts of macro co-changes (MCC), i.e., of artefacts that co-change within a large time interval, and of dephase macro co-changes (DMCC), i.e., macro co-changes that always happen with the same shifts in time. We describe typical scenarios of MCC and DMCC and we use the Hamming distance to detect approximate occurrences of MCC and DMCC. We present our approach, Macocha, to identify these concepts in large programs. We apply Macocha and compare it in terms of precision and recall with UML Diff (file stability) and association rules (co-changing files) on four systems: Argo UML, Free BSD, SIP, and XalanC. We also use external information to validate the (approximate) MCC and DMCC found by Macocha. We thus answer two research questions showing the existence and usefulness of theses concepts and explaining scenarios of hidden dependencies among artefacts.
Article
Full-text available
Software developers are often faced with modification tasks that involve source which is spread across a code base. Some dependencies between source code, such as those between source code written in different languages, are difficult to determine using existing static and dynamic analyses. To augment existing analyses and to help developers identify relevant source code during a modification task, we have developed an approach that applies data mining techniques to determine change patterns - sets of files that were changed together frequently in the past - from the change history of the code base. Our hypothesis is that the change patterns can be used to recommend potentially relevant source code to a developer performing a modification task. We show that this approach can reveal valuable dependencies by applying the approach to the Eclipse and Mozilla open source projects and by evaluating the predictability and interestingness of the recommendations produced for actual modification tasks on these systems.
Conference Paper
Nowadays, it is common to see software development teams combine multiple programming languages when developing a new software system. Most non-trivial software systems are developed using components written in different languages and technologies. This mixed-language programming approach allows developers to reuse existing code and libraries instead of implementing the code from scratch. It also allows them to leverage the strengths and benefits of each language. However, poor integration of different components of multi-language systems can lead to inconsistencies, dependency issues, and even bugs. To help developers make proper use of components and libraries written in different programming languages, researchers and practitioners have formulated multiple catalogs of design practices for multi-language systems. However, there is no evidence that these design practices are considered relevant by developers. Through this paper, we aim to assess the perception of developers, regarding the relevance of multi-language design practices and their impact on software quality. To achieve this goal, we extracted a set of good and bad practices related to multi-language systems that are discussed in the literature and developers' documentation. We surveyed 93 developers about their usage of the practices and their perception of the benefits of the practices for some quality. Our results show that the proposed design practices are not equally prevalent in the Industry. Among the practices frequently used by developers, some have a positive impact on the understandability, reusability, and simplicity of multi-language systems. We summarise the reported strengths and limitations of the studied design practices and provide guidelines for practitioners. We also formulate recommendations for researchers interested in improving the quality of multi-language systems. CCS CONCEPTS • Software and its engineering → Language types;
Conference Paper
The use of the Java Native Interface (JNI) allows taking advantage of the existing libraries written in different programming languages for code reuse, performance, and security. Despite the importance of JNI in development, practices on its usages are not well studied yet. In this paper, we investigated the usage of JNI in 100 open-source systems collected from OpenHub and Github, around 8k of source code files combined between Java and C/C++, including the Java class libraries part of the JDK v9. We identified the state of the practice in JNI systems by semi-automatically and manually analyzing the source code. Our qualitative analysis shows eleven JNI practices where they are mainly related to loading libraries, implementing native methods, exception management, return types, and local/global references management. Basing on our findings, we provided some suggestions and recommendations to developers to facilitate the debugging tasks of JNI in multi-language systems, which can also help them to deal with the Java and C memory. CCS CONCEPTS • Software and its engineering → Language types.
Conference Paper
Software quality becomes a necessity and no longer an advantage. In fact, with the advancement of technologies, companies must provide software with good quality. Many studies introduce the use of design patterns as improving software quality and discuss the presence of occurrences of design defects as decreasing software quality. Code smells include low-level problems in source code, poor coding decisions that are symptoms of the presence of anti-patterns in the code. Most of the studies present in the literature discuss the occurrences of design defects for mono-language systems. However, nowadays most of the systems are developed using a combination of several programming languages, in order to use particular features of each of them. As the number of languages increases, so does the number of design defects. They generally do not prevent the program from functioning correctly, but they indicate a higher risk of future bugs and makes the code less readable and harder to maintain. We analysed open-source systems, developers' documentation, bug reports, and programming language specifications and extracted bad practices related to multi-language systems. We encoded these practices in the form of code smells. We report in this paper 12 code smells.
Conference Paper
During software maintenance, program slicing is a useful technique to assist developers in understanding the impact of their changes. While different program-slicing techniques have been proposed for traditional software systems, program slicing for dynamic web applications is challenging since the client-side code is generated from the server-side code and data entities are referenced across different languages and are often embedded in string literals in the server-side program. To address those challenges, we introduce WebSlice, an approach to compute program slices across different languages for web applications. We first identify data-flow dependencies among data entities for PHP code based on symbolic execution. We also compute SQL queries and a conditional DOM that represents client-code variations and construct the data flows for embedded languages: SQL, HTML, and JavaScript. Next, we connect the data flows across different languages and across PHP pages. Finally, we compute a program slice for a given entity based on the established data flows. Running WebSlice on five real-world, open-source PHP systems, we found that, out of 40,670 program slices, 10% cross languages, 38% cross files, and 13% cross string fragments, demonstrating the potential benefit of tool support for cross-language program slicing in dynamic web applications.
Conference Paper
Developers are often faced with a natural language change request (such as a bug report) and tasked with identifying all code elements that must be modified in order to fulfill the request (e.g., fix a bug or implement a new feature). In order to accomplish this task, developers frequently and routinely perform change impact analysis. This formal demonstration paper presents ImpactMiner, a tool that implements an integrated approach to software change impact analysis. The proposed approach estimates an impact set using an adaptive combination of static textual analysis, dynamic execution tracing, and mining software repositories techniques. ImpactMiner is available from our online appendix http://www.cs.wm.edu/semeru/ImpactMiner/
Article
Software maintenance accounts for the largest part of the costs of any program. During maintenance activities, developers implement changes (sometimes simultaneously) on artifacts in order to fix bugs and to implement new requirements. To reduce this part of the costs, previous work proposed approaches to identify the artifacts of programs that change together. These approaches analyze historical data, mined from version control systems, and report change patterns, which lead at the causes, consequences, and actors of the changes to source code files. They also introduce so-called change patterns that describe some typical change dependencies among files. In this paper, we introduce two novel change patterns: the asynchrony change pattern, corresponding to macro co-changes (MC), that is, of files that co-change within a large time interval (change periods) and the dephase change pattern, corresponding to dephase macro co-changes (DC), that is, MC that always happens with the same shifts in time. We present our approach, that we named Macocha, to identify these two change patterns in large programs. We use the k-nearest neighbor algorithm to group changes into change periods. We also use the Hamming distance to detect approximate occurrences of MC and DC. We apply Macocha and compare its performance in terms of precision and recall with UMLDiff (file stability) and association rules (co-changing files) on seven systems: ArgoUML, FreeBSD, JFreeChart, Openser, SIP, XalanC, and XercesC developed with three different languages (C, C++, and Java). These systems have a size ranging from 532 to 1693 files, and during the study period, they have undergone 1555 to 23,944 change commits. We use external information and static analysis to validate (approximate) MC and DC found by Macocha. Through our case study, we show the existence and usefulness of these novel change patterns to ease software maintenance and, potentially, reduce related costs.
Article
As a software system evolves, programmers make changes that sometimes cause problems. We analyze CVS archives for fix-inducing changes—changes that lead to problems, indicated by fixes. We show how to automatically locate fix-inducing changes by linking a version archive (such as CVS) to a bug database (such as BUGZILLA). In a first investigation of the MOZILLA and ECLIPSE history, it turns out that fix-inducing changes show distinct patterns with respect to their size and the day of week they were applied.
Conference Paper
When interacting with version control systems, developers often commit unrelated or loosely related code changes in a single transaction. When analyzing the version history, such tangled changes will make all changes to all modules appear related, possibly compromising the resulting analyses through noise and bias. In an investigation of five open-source Java projects, we found up to 15% of all bug fixes to consist of multiple tangled changes. Using a multi-predictor approach to untangle changes, we show that on average at least 16.6% of all source files are incorrectly associated with bug reports. We recommend better change organization to limit the impact of tangled changes.
Conference Paper
Today software systems are built with heterogeneous languages such as Java, C, C++, XML, Perl or Python just to name a few. This introduces new challenges both in the software analysis domain and program evolution as programmers are forced to cope with a variety of programming paradigms and languages. We believe that there is the need of views supporting developers to effectively cope with complexity and to facilitate program comprehension and analysis of such heterogeneous systems. Furthermore, the heterogeneity of the systems is not limited to the language but also impacts the components licensing. In fact, licensing is another type of heterogeneity introduced by the large reuse of open source code. This also introduces challenges such how to legally combine different licenses in the same system and how the change of the software can create a violation of licenses.
Conference Paper
Software dependency analysis is an important step in determining the potential impact of changes. Existing tool support for conducting dependency analysis does not sufficiently support systems written in more than one language. Tools based on semantic analyses are expensive to create for combinations of multiple languages, while lexical tools provide poor accuracy and rely heavily on developer skill. This paper reports on an investigation into the application of a series of incrementally-better island grammars to an industrial, open-source polylingual system to determine the cost-to-accuracy relationship involved in developing and applying island grammars for dependency analysis. The results of our study suggest the effort-cost in writing richer island grammars rises faster than the resulting accuracy.
Conference Paper
We apply data mining to version histories in order to guide programmers along related changes: "Programmers who changed these functions also changed. . . ". Given a set of existing changes, such rules (a) suggest and predict likely further changes, (b) show up item coupling that is indetectable by program analysis, and (c) prevent errors due to incomplete changes. After an initial change, our ROSE prototype can correctly predict 26% of further files to be changed - and 15% of the precise functions or variables. The topmost three suggestions contain a correct location with a likelihood of 64%.
Conference Paper
In this paper, we propose a formal model and a platform for software change management. The model is based on graphs rewriting, and deals with both multi-language source codes and heterogeneous database schemas. These are represented by software components linked by meaningful relationships. The change impact analysis is done, using a knowledge-based system, that includes impact propagation rules preserving the software consistency. This is implemented by an integrated platform including a multilanguage parsing tool, and a soft-ware change management module
Static code analysis of multilanguage software systems
  • A Shatnawi
  • H Mili
  • M Abdellatif
  • Y.-G Guéhéneuc
  • N Moha
  • G Hecht
  • G E Boussaidi
  • J Privat
A. Shatnawi, H. Mili, M. Abdellatif, Y.-G. Guéhéneuc, N. Moha, G. Hecht, G. E. Boussaidi, and J. Privat, "Static code analysis of multilanguage software systems," arXiv preprint arXiv:1906.00815, 2019.
Safe java native interface
  • G Tan
  • A W Appel
  • S Chakradhar
  • A Raghunathan
  • S Ravi
  • D Wang
G. Tan, A. W. Appel, S. Chakradhar, A. Raghunathan, S. Ravi, and D. Wang, "Safe java native interface," in Proceedings of IEEE International Symposium on Secure Software Engineering, vol. 97, 2006, p. 106.
Static code analysis of multilanguage software systems
  • shatnawi