Project

DISSE: Data-Intensive Software System Evolution

Goal: The main goal of this project is to study and understand the database usage practices in data-intensive software systems (DISS), how these practices evolve over time, and how to improve upon these practices. During the proposed project we will develop a theoretical framework that will be empirically validated and calibrated through statistical analysis of the interaction between programs (source and executable code) and databases, in order to identify particular trends, trend breaks, usage patterns, bad practices and good practices, and in order to identify the most popular technologies and paradigm switches at a given point in time. We will also study the co-evolution between programs and databases, in order to identify evolutionary patterns and migration scenarios to different database technologies, and to provide recommendations when carrying out particular database evolution tasks.

Methods: Statistical Analysis, Software Visualization, Empirical Research, static program analysis, software repository mining, dynamic program analysis

Date: 1 July 2013 - 30 June 2017

Updates
0 new
1
Recommendations
0 new
0
Followers
0 new
10
Reads
1 new
151

Project log

Tom Mens
added 2 research items
Software development projects frequently rely on testing-related libraries to test the functionality of the software product automatically and efficiently. Many such libraries are available for Java, and developers face a hard time deciding which libraries are most appropriate for their project, or when to migrate to a competing library. We empirically analysed the usage of eight testing-related libraries in 4,532 open source Java projects hosted on GitHub. We studied how frequently specific (pairs of) libraries are used over time. We also identified if and when library usages are replaced by competing ones during a project's lifetime. We found that some libraries are considerably more popular than their competitors, while some libraries become more popular over time. We observed that many projects tend to use multiple libraries together. We also observed permanent and temporary migrations between competing libraries. These findings may pave the way for recommendation tools that allow project developers to choose the most appropriate library for their needs, and to be informed of better alternatives.
Open source cloud computing solutions, such as CloudStack and Eucalyptus, have become increasingly popular in recent years. Despite this popularity, a better understanding of the factors influencing user adoption is still under active research. For example, increased project agility may lead to solutions that remain competitive in a rapidly evolving market, while keeping the software quality under control. Like any software system that is subject to frequent evolution, cloud computing solutions are subject to errors and quality problems, which may affect user experience and require frequent bug fixes. While prior comparisons of cloud platforms have focused most often on their provided services and functionalities, the current paper provides an empirical comparison of CloudStack and Eucalyptus, focusing on quality-related software development aspects. More specifically, we study the change history of the source code and its unit tests, as well as the history of bugs in the Jira issue tracker. We found that CloudStack has a high and more rapidly increasing test coverage than Eucalyptus. CloudStack contributors are more likely to participate in development and testing. We also observed differences between both projects pertaining to the bug life cycle and bug fixing time.
Mathieu Goeminne
added 2 research items
This article presents an empirical study of how the use of relational database access technologies in open source Java projects evolves over time. Our observations may be useful to project managers to make more informed decisions on which technologies to introduce into an existing project and when. We selected 2,457 Java projects on GitHub using the low-level JDBC technology and higher-level object relational mappings such as Hibernate XML configuration files and JPA annotations. At a coarse-grained level, we analysed the probability of introducing such technologies over time, as well as the likelihood that multiple technologies co-occur within the same project. At a fine-grained level, we analysed to which extent these different technologies are used within the same set of project files. We also explored how the introduction of a new database technology in a Java project impacts the use of existing ones. We observed that, contrary to what could have been expected, object-relational mapping technologies do not tend to replace existing ones but rather complement them.
This article presents an empirical study of how the use of relational database access technologies in open source Java projects evolves over time. Our observations may be useful to project managers to make more informed decisions on which technologies to introduce into an existing project and when. We selected 2,457 Java projects on GitHub using the low-level JDBC technology and higher-level object relational mappings such as Hibernate XML configuration files and JPA annotations. At a coarse-grained level, we analysed the probability of introducing such technologies over time, as well as the likelihood that multiple technologies co-occur within the same project. At a fine-grained level, we analysed to which extent these different technologies are used within the same set of project files. We also explored how the introduction of a new database technology in a Java project impacts the use of existing ones. We observed that, contrary to what could have been expected, object-relational mapping technologies do not tend to replace existing ones but rather complement them.
Tom Mens
added a research item
This document contains the final report of the research activities carried out by research partners Université de Mons and Université de Namur in the context of F.R.S.-FNRS research project T.0022.13 entitled “Empirical Analysis of the Co-Evolution and Social Interaction in Data-Intensive Software Systems”. During a four-year period from July 2013 till June 2017, two postdoctoral researchers and two PhD students were employed on the project. Several other researchers, not funded by the project, also actively contributed to the research goals. The project was highly successful, resulting in common case studies, automated software analysis tools, empirical studies and associated data, and numerous peer-reviewed scientific publications reporting on all the above.
Tom Mens
added an update
Tom Mens
added a research item
This chapter presents the research advancements in the field of data-intensive software system evolution, 5 years after the publication of our IEEE Computer column presenting the challenges in this field. We present the state-of-the-art in this research domain, and re- port on research on the evolution of open source Java projects relying on relational database technologies. We empirically analyse how the use of Java database technologies evolves over time. We report on a coarse-grained source-code analysis carried out over several thousands of Java projects, and complement this by a fine-grained longitudinal analysis of the co-evolution between database schema changes and source code changes within three large Java projects. The presented results are a first step towards a recommendation system supporting developers in writing database-centered code.
Mathieu Goeminne
added 2 research items
Since the 70’s, software development has experienced an exponential growth. The number of developed software products, their size and their complexity has become so important that understanding their functioning and managing their evolution have become very hard today. Open source software (OSS) does not escape from this growth and the problems it raises. For more than a decade, OSS systems have been the subject to an increasing interest from the academic community, individuals and the software industry at large, and their development is booming because of their low cost of use (OSS systems are generally freely available), their low barriers to entry for the developers, their low cost of development (they may be built by reusing other OSS systems), and the large quantity of easily available historical data. Contrary to the traditional commercial and proprietary software, OSS is typically developed by a group of persons dispersed all over the world. This geographical distribution forces contributors to use tools allowing an asynchronous communication and an information exchange over big space scales. The public availability of the historical data being handled by these tools facilitates the analysis of OSS evolution. Initially, empirical analysis of OSS projects evolution was limited to the study of source code evolution only. Later, other software development artefacts have been taken into account as well. For instance, the first analyses of OSS project mailing lists date to 2002 [157]. However, the main factor that drives the evolution of a software project is the people contributing to it. Hence, in order to better comprehend how OSS projects evolve, one needs to gain a better insight in the socio-technical aspects that surrounding them. In order to get a more accurate model of the interaction between the project contributors one needs to consider development artefacts that contain information about its social aspects, such as bug reports, e-mail discussions and version commits. Frequently, collections of different projects are developed and evolve together in the same environment. We refer to these collections as software ecosystems. Since the contributors to the projects belonging to these ecosystems work together towards a common goal, they tend to form de facto communities. It is therefore important to study the social aspects not only at the level of individual projects, but also at the level of the entire ecosystem. The goal of this dissertation is to understand the evolution of the social aspects in open source ecosystems. More precisely, we study how contributors to open source ecosystems can be grouped in different communities that evolve and collaborate in different ways. In doing so, we provide evidence that contributors have specificities that are not taken into account by today’s analysis tools. Becoming aware of these specificities opens up new research and practically relevant questions on how new automated tools can be designed and used to offer better support to the ecosystem’s contributors in their activities. The contributions of this dissertation are manifold. We developed an application framework that allows us to empirically study the evolution of software ecosystems. Focusing on the GNOME ecosystem, we designed a systematic approach for detecting the multiple accounts used by contributors to access the software repositories and used it to gain a better insight in the communities belonging to the ecosystem. We defined objective criteria according to which these contributors can be categorised. In the GNOME history we observed a power law behaviour between the number of contributors and their contributions, in term of commits submitted, mails sent and bug reports handled. With further statistical analyses we established correlations and trends between the contributors’ effort, their favourite means of communication and the activity types in which they are involved. For example, we observed that the contributors tend to restrict themselves to a limited number of activity types, but the more active a contributor is, the more he tends to spread his effort over different types of activity. When studying the evolution of GNOME contributors, we observed a tendency of specialisation towards less activity types. We also observed that, during the last years, the effort in each of the studied activity types is decreasing.
We present a dataset of the open source software ecosystem Gnome from a social point of view. We have collected historical data about the contributors to all Gnome projects stored on git.gnome.org, taking into account the problem of identity matching, and associating different activity types to the contributors. This type of information is very useful to complement the traditional, source-code related information one can obtain by mining and analyzing the actual source code. The dataset can be obtained at https://bitbucket.org/mgoeminne/sgl-flossmetric-dbmerge.
Loup Meurice
added 2 research items
Understanding the links between application programs and their database is useful in various contexts such as migrating information systems towards a new database platform, evolving the database schema, or assessing the overall system quality. In the case of Java systems, identifying which portion of the source code accesses which portion of the database may prove challenging. Indeed, Java programs typically access their database in a dynamic way. The queries they send to the database server are built at runtime, through String concatenations, or Object-Relational Mapping frameworks like Hibernate and JPA. This paper presents a static analysis approach to program-database links recovery, specifically designed for Java systems. The approach allows developers to automatically identify the source code locations accessing given database tables and columns. It focuses on the combined analysis of JDBC, Hibernate and JPA invocations. We report on the use of our approach to analyse three real-life Java systems.
Most modern relational DBMS have the ability to monitor and enforce referential integrity constraints (RICs). In contrast to new applications, however, heavily evolved legacy information systems may not make use of this important feature, if their design predates its availability. The detection of RICs in legacy systems has been a long-term research topic in the DB reengineering community and a variety of different methods have been proposed, analyzing schema, application code and data. However, empirical evidence on their application for reengineering large-scale industrial systems is scarce and all too often "problems" (case studies) are carefully selected to fit a particular "solution" (method), rather than the other way around. This paper takes a different approach. We analyze in detail the issues posed in reengineering a complex, mission-critical information system to support RICs. In our analysis, we find that many of the assumptions typically made in DB reengineering methods do not readily apply. Based on our findings, we design a process and tools for detecting RICs in context of our real-world problem and provide preliminary results on their effectiveness.
Tom Mens
added 2 project references
Anthony Cleve
added 10 project references
Tom Mens
added 4 project references
Tom Mens
added a project goal
The main goal of this project is to study and understand the database usage practices in data-intensive software systems (DISS), how these practices evolve over time, and how to improve upon these practices. During the proposed project we will develop a theoretical framework that will be empirically validated and calibrated through statistical analysis of the interaction between programs (source and executable code) and databases, in order to identify particular trends, trend breaks, usage patterns, bad practices and good practices, and in order to identify the most popular technologies and paradigm switches at a given point in time. We will also study the co-evolution between programs and databases, in order to identify evolutionary patterns and migration scenarios to different database technologies, and to provide recommendations when carrying out particular database evolution tasks.