Conference PaperPDF Available

Analysis of Open Source Software Development Iterations by Means of Burst Detection Techniques

Authors:

Abstract and Figures

A highly efficient bug fixing process and quick release cycles are considered key properties of the open source software development methodology. In this paper, we study the relation between code activities (such as lines of code added per commit), bug fixing activities, and software release dates in a subset of open source projects. To study the phenomenon, we gathered a large data set about the evolution of 5 major open source projects. We compared activities by means of a burst detection technique to discover temporal peaks in time-series. We found quick adaptation of issue tracking activities in proximity of releases, and a distribution of coding activities across releases. Results show the importance of the application type/domain for the evaluation of the development process.
Content may be subject to copyright.
Analysis of Open Source Software
Development Iterations by means of Burst
Detection Techniques
Bruno Rossi, Barbara Russo, and Giancarlo Succi
CASE – Center for Applied Software Engineering
Free University of Bolzano-Bozen
Via Della Mostra 4, 39100 Bolzano, Italy
{brrossi, brusso,gsucci}@unibz.it,
WWW home page: http://www.case.unibz.it
Abstract. A highly efficient bug fixing process and quick release cycles are
considered key properties of the open source software development
methodology. In this paper, we study the relation between code activities (such
as lines of code added per commit), bug fixing activities, and software release
dates in a subset of open source projects. To study the phenomenon, we
gathered a large data set about the evolution of 5 major open source projects.
We compared activities by means of a burst detection technique to discover
temporal peaks in time-series. We found quick adaptation of issue tracking
activities in proximity of releases, and a distribution of coding activities across
releases. Results show the importance of the application type/domain for the
evaluation of the development process.
1 Introduction
The availability of source code and large communities of motivated and restless
developers are two key factors at the base of the success of the open source software
movement. Large interest has thus gathered the open source development
methodology, very often compared critically to traditional software development
practices in the quest for an answer to the growing number of failing software
projects. Such methodology is mostly based on informal and distributed practices
that seem to perform fairly well in domains constituted by turbulent and
continuously changing requirements [11]. Practices such as web-based
collaborations, peer reviews, short cycle iterations, and quick releases - among
others - hamper the project management overhead and lead to a leaner development
process.
In particular, considering both the bug fixing process, and version release cycles,
many researchers claim that the open source methodology allows a faster bug-fixing
process and higher release velocity than proprietary software [1, 6, 13].
Well-known empirical studies in this context are controversial. The Apache
project was found to be very reactive to bug fixing requests, as well to provide many
2 Bruno Rossi, Barbara Russo, and Giancarlo Succi
iterative software releases [9]. This conclusion was found in contrast with the
development process of the Mozilla web browser, where the process was found
equivalent to traditional development practices in terms of adaptability and release
cycles [10]. For the FreeBSD system, similar in terms of number of core developers
and reporters of failures to Apache, not enough evidence could be collected to
confirm or reject the same hypotheses [5]. FreeBSD in the operating system domain,
and Apache in the web server domain, were found to provide higher velocity in bug-
fixing activities than other equivalent proprietary applications. In contrast with this
view, Gnome, in the graphical user interfaces domain, was found to be less efficient
in terms of bug-fixing speed compared to a proprietary application [8]. What appears
from these studies is that, indeed, open source software has a faster bug fixing
process but specific to particular applications and domains.
With this paper, we investigate the reactivity of the open source software
development process when fixing code defects and when approaching version
releases. We used a burst detection technique in time-series analysis to compare the
evolution of peak activities during the projects’ development.
The paper is structured as follows, in Section 2 we propose the research question,
in Section 3 we propose the heuristic for project selection, and the data collection
process, Section 4 is devoted to the method, Section 5 proposes the analysis of the
datasets, and Sections 6,7,8 propose respectively discussion about the results,
limitations, future works, and conclusions.
2 Research Question
Our general research question is to evaluate whether there is an increase in
activities involving open source software repositories during version releases and
bug-fixing activities. To better investigate the research question, we set-up five
different hypotheses connected to our research question (Table 1).
Table 1. Low-level hypotheses under investigation
Hypothesis Rationale
H1. There is an increase in code-related
activities as there is an increase in the creation
of bug-reports
immediate response from the
developers as soon as large numbers of
bug reports are inserted into issue
tracking systems
H2. There is an increase in code-related
activities as there is an increase in the closed
bug-reports
a late reaction to users requests will
lead to more code activities
Analysis of Open Source Software Development Iterations by means of Burs
t
Detection Techniques
3
Hypothesis Rationale
H3. There is an increase in bug opening
activities in the proximity of a software release
date
a rapid increase in bug reports opened
as there is a software release
H4. There is an increase in bug closing activities
in the proximity of a software release date
a rapid increase in the bug closing
process as a release date is
approaching
H5. There is an increase in code-related
activities in the proximity of a software release
code activities intensify as a release
date is approaching
The first two hypotheses refer specifically to the velocity of the bug-fixing
process. We consider coding activities of developers correlated both to the bug-
opening process (H1), and to the bug-closing process (H2). We expect that a fast
bug-fixing process adapts quickly to the opened bug reports, whereas late reaction to
users requests will lead to a correlation among coding activities and bug-closing
activities.
The open source community provides constant feedback to developers. We
expect this effect to lead to an increase in bug reports opened in coincidence of a
software release (H3). We also presume an increase in bug closing activities in
correspondence to a software release (H4), and a synchronization of code activities
with release dates, showing greater coding effort in proximity of version releases
(H5).
3 Project Selection
To gather knowledge about the current open source landscape, we mined the
Ohloh (https://www.ohloh.net/) repository. The repository is not only a large
collection of metrics and evolutionary data about open source projects, but also a
social networking opportunity for developers. As of November 2008, there are
20.590 projects listed including aggregated reports on the website. We used the
Ohloh API (https://www.ohloh.net/api) to acquire data from the projects, focusing
then the analysis on a subset of 5 projects.
We considered OpenOffice.org, office automation suite, KDE, and Gnome,
window managers, Wine-project, a cross-platform implementation of the Microsoft
Windows API, and the Eclipse platform for integrated software development. We
selected these projects, apart for being rather popular, for the fact that are part of the
open source projects with large number of source lines of code (SLOCs) and number
of yearly contributing developers. Our heuristic for project selection was
complemented by the fact that we limited the number of yearly contributors to a
4 Bruno Rossi, Barbara Russo, and Giancarlo Succi
maximum of 500 developers. This boundary excluded thus projects such as the Linux
Kernel 2.6, and the Android project that we considered as outliers compared to other
open source projects. We then selected specifically projects based on expectations of
obtaining interesting and useful results [2]. According to this rationale, we preferred
projects not part of the main cluster according to SLOCs and yearly contributors, and
with a wide diffusion in the community (Figure 1).
The projects selected range from 76 yearly contributors (Eclipse) to 480 (KDE),
with a size in terms of SLOCs from 1,6M (Wine-Project) to 16,4M (Gnome). Main
languages for 4 out of 5 projects are C and C++, with Java as the main language
only for the Eclipse project. Information from the Ohloh repository encompasses a
period from 3 to 15 years (Table 2) .
Table 2.
Descriptive statistics for the projects
Project 12 months
contributors
SLOCs Main
Language
Information
start date
Information end
date
OpenOffice.org 149 8.721.053 C/C++ 2000-07-01 2008-09-01
KDE 480 4.530.775 C 2005-05-01 2008-09-01
Gnome 406 16.467.663 C/C++ 1997-01-01 2007-11-01
Wine 218 1.644.154 C 1993-06-01 2008-09-01
Eclipse 76 7.352.744 Java 2001-04-01 2008-09-01
Fig. 1. Projects considered in terms of total number of SLOCs and yearly contributors
(9.020 projects from the Ohloh repository)
Analysis of Open Source Software Development Iterations by means of Burs
t
Detection Techniques
5
3.1 Process of Data Collection
For the analysis we used 3 different sources of information, respectively for code
activities, bug reports, and release dates.
First, we retrieved the activity level for each project; we gathered three indicators
of development activities during a time-span of one month:
code-activities = LOCs added + LOCs removed;
comments-activities = comments added + comments removed;
blanks-activities = blanks added + blanks removed;
The rationale behind this choice is that for our analysis we needed a pure
indicator of activity inside the projects, not an indicator about the growth of the
projects in time. In that case considering only additions of lines would have been
more appropriate for the focus of analysis.
Second, we retrieved information about opened and closed bug reports from the
issue tracking systems of the selected projects; for each project we evaluated the bug
reports that were opened and closed at a certain date, by considering the closing and
opening dates of bugs tagged as CLOSED, FIXED, and not marked as enhancements.
Third, to retrieve the release dates of each project, we used mainly the official
website of each project, and where not sufficient, we relied on the integration from
third party sources such as blogs, wikis, etc..
In Figure 2, we show the typical result of aggregation of the three different data
sources, by representing on the same time-line the aggregation of code-activities
(code, comments, and blanks), bug-reports closed, and release dates. The figure
refers to the OpenOffice.org project.
Fig. 2. OpenOffice.org activities, release dates, and bug reports closing dates (log scale)
For the OpenOffice.org application, we see a constant trend in all the activities,
characterized by some periods where there are bursts, followed by periods of reduced
6 Bruno Rossi, Barbara Russo, and Giancarlo Succi
activities. After the first analysis, we decided to drop the indicators comments-
activities, and blanks-activities for the reason that they were highly correlated with
the code-activities indicator by running non-parametric correlation analysis (Table
3). For this reason in the remaining of this paper we refer to code activity simply as
lines of code added + lines of code removed.
Table 3.
Spearman Rank Order correlation between code-activities, comments-activities, and
blanks-activities, significant at 0.01 two-tailed
Correlation with Code Activities
Project Code_comments Code_blanks
OpenOffice.org 0.7147 0.70408
KDE 0.9064 0.9702
Gnome 0.7821 0.8615
Wine 0.8930 0.9286
Eclipse 0.8578 0.8159
Aggregated data collected for the projects (Table 4), shows the differences and
similarities of the projects in terms of yearly code activities, total commits, and total
bug reports closed. The KDE project is the monthly most active project (588
KSLOCs per month), followed by Gnome(443), Eclipse (276), OpenOffice.org (245),
and Wine (35).
Table 4. Aggregated data for projects
considered during the analysis
Project Code Activity
(KSLOCs)
Total commits Total bug reports
closed in the period
Number of
months
OpenOffice.org 24.284K 168.121 12.692 99
KDE 24.111K 92.433 5.824 41
Gnome 62.475K 258.989 4.812 141
Wine 6.560K 50.148 5.559 185
Eclipse 24.898K 167.893 32.350 90
4 Method of Analysis
Once we gathered all information from the three different sources, we started
investigating the relation between code activities, releases, and bug-fixing activities.
We used a burst detection technique, the same technique used in [12] to identify
similarities between temporal online queries and similar to the one used in [3, 4] to
analyze developers’ behavior.
Analysis of Open Source Software Development Iterations by means of Burs
t
Detection Techniques
7
The technique identifies in a systematic way peaks in the temporal evolution and
compares them with peaks in other time-series. Given a time-series defined as a
sequence of time-points t=(t1,t2,…,tn), a burst or peak is defined as a set of points that
exceed the observed behavior in other points of the time-series. More formally, the
approach is as follows:
1. calculate the moving average MAl of the time-series, where l>0 is the lag of the
moving average;
2. calculate the cut-off value for a given l as mean(MAl)+x*std(MAl); this gives a
threshold to use for peak detection. In our case we considered x=0.5 as an
appropriate value for the detection technique applied to our dataset; we must
consider that higher x values will increase the cut-off level and thus lead to a
detection of only the strongest peaks in the time-series;
3. determine bursts with MAi
l>cut-off, where i is the time interval considered;
4. project the bursts on a temporal line; this is to identify the time points
corresponding to the bursts;
5. compare the overlap of the bursts for the different activities on the temporal line;
Figure 3 illustrates visually the steps 1-4, by showing a random dataset, the
moving average, the cut-off point, and the peaks identified.
15-Feb-2009 18-Mar-2009 18-A pr-2009 19-May-2009
0
5
10
15
20
25
30
35
random data
MA
l
cut-off
Fig. 3. Burst-detection method. Random dataset with peaks, moving averages, cut-off line,
and bursts areas identified
The cut-off line gives an indication of the data points to consider as part of a
peak. Once peaks have been identified, they are plotted on a line to ease the
comparison with peaks identified in other time series. For this reason we needed
then a metric to compare the identified bursts (Figure 4). To compare different
regions of activities we defined the following metrics:
8 Bruno Rossi, Barbara Russo, and Giancarlo Succi
number of peaks identified by the approach, we identify all peaks as t
i; this gives
a first evaluation of the burstiness of the time-series;
number of intersections between peaks, computed as BA tt '' : gives a raw
indication about the intersection of peaks, but it is not a reliable measure as we
can have a hypothetical time-series with peaks that span over the entire period
that gets a perfect intersection with any time-series with at least one peak. For this
reason we defined the next metric;
a measure of recall, defined as
A
BA
t
tt
'
)''( , that gives information about the
specific coverage of peaks that intersect between the two time-series;
Fig. 4. Overlapping of bursts areas detected between two time-series
The whole process has been implemented as a Java application that mines the
projects’ repository and generates CSV files. Such files are then the input of a Matlab
(http://www.mathworks.com) script that handles calculation and plotting of the
overlapping regions. Information about version releases and bug reports is still saved
manually to CSV files and then used as an additional data source for the script.
From this point we will label code-activities as CA, the activities of bug reports
opening as BO, and the bug reports closing activities as BC. We will also label the
peaks for opening bugs activities as t
BO, closing bugs activities as t
BC, and code-
activities as t
CA.
5 Analysis
We run burst detection on the CA, BO, and BC time-series for all projects. We
used a window size of l=8, as we found heuristically that parameter to work better
with our dataset than a shorter window of l=2, as peaks identified for BC and CA
were respectively +25,9%, and +25%. A better fitting was also confirmed visually
by inspection of the generated burst regions.
5.1 Bug Reports Opening and Closing versus Code-activities
First we run a comparison between CA, and BC (Table5). In 3 out 5 projects, the
process has a higher burstiness for code-related activities rather than bug-closing
activities.
Analysis of Open Source Software Development Iterations by means of Burs
t
Detection Techniques
9
Table 5. Comparison between code activities and bug closing activities, l=8, x=0.5
Project
t’BC t’CA
OpenOffice.org 28 20 0.00
KDE 8 12 0.62
Gnome 8 57 0.87
Wine 9 46 0.0
Eclipse 34 25 0.0
In only 2 projects, KDE and Gnome, there is an increase in CA that seems related
to BC activities. This specifically means that for these two projects, peaks in code
activities are highly correlated to the activities of closing bug reports in the same
period.
We run the same comparison, this time taking into account the BO activities
(Table 6). In two projects, Gnome, and Wine, peaks in the bug reports opening are
related with peaks in code-activities, while for the other projects the behavior is
different.
Table 6. Comparison between code activities and bug opening activities, l=8, x=0.5
Project
t’BO t’CA
OpenOffice.org 23 20 0.09
KDE 11 12 0.09
Gnome 29 57 0.86
Wine 40 46 0.45
Eclipse 35 25 0.0
According to the relation between code, and bug reports activities, we can
categorize the projects in 4 categories: in the first category there is no apparent
connection in bursts between time-series BC-CA, and BO-CA (OpenOffice.org,
Eclipse), in second category there is a relation in bursts BC-CA (KDE), in the third
category there is a relation BO-CA (Wine) and in the final category there is a strong
relation in bursts between time-series BC-CA, and BO-CA(Gnome).
5.2 Software Releases versus Code, Bug-closing, Bug-opening Activities
As a next step, we compared software releases with CA, BO, and BC activities
(Table 7). We wanted to evaluate the behavior of developers in proximity of
software releases. In all projects, and with different degrees, releases are related to
BO
CABO
t
tt
'
)''(
BC
CABC
t
tt
'
)''(
10 Bruno Rossi, Barbara Russo, and Giancarlo Succi
R
tR BO
'
increases in bug reports opening, bug-closing, and to a minor extent to code-
activities.
Table 7. Comparison of software releases with bug reports closing activities and code
activities, l=8, x=0.5
Project
OpenOffice.org 0.38 0.31 0.23
KDE 0.33 0.33 0.33
Gnome 0.5 0.06 0.5
Wine 1.0 0.22 0.09
Eclipse 1.0 1.0 0.0
Bug reports opening activities are related to new version releases: as a new
version is released there are peaks in the generation of bug reports. Especially for the
Wine and Eclipse projects this effect seems relevant.
Also bug closing activities are stronger in presence or in proximity of a release
date. This effect seems particularly strong for the Eclipse project.
Comparing the two effects, we can state that in presence of a release, bug
opening activities are more bursty than bug closing activities, this can be an
indication that while bug closing activities are more gradual in time, bug opening
activities are more subject to bursts at a software release date.
Peaks in code activities are also correlated to the proximity of a release date. This
means that in proximity of a release date there is a burst in code development
activities. The Wine project, and, again the Eclipse project do not follow this
behavior.
A question that still remains open is how much skewed are the peaks between
code activities and bug closing activities. One approach would be to consider the
distance between lagged time series, but this would compare all periods without
focusing on the peaks. Remaining in the context of our approach, it would mean to
find k, periods of lags, such that )( BAk ttMax , with time-series lagging as validation.
We leave this step as future work, as for the significance of the result we need a finer
granularity of data analysis based on data points on a daily scale.
6 Discussion
Falling back to our initial research hypotheses we can state the following:
H1. There is an increase in code-related activities as there is an increase in the
creation of bug-reports; we did not get enough evidence to support this hypothesis,
for only two projects out of five we derived some evidence of a relation of code
R
tR BC
'
R
tR CA
'
Analysis of Open Source Software Development Iterations by means of Burs
t
Detection Techniques
11
activities and bug reports creation; according to our rationale, this means that the
projects do not adapt quickly to the new reports that are issued;
H2. There is an increase in code-related activities as there is an increase in the
closed bug-reports: also in this case, we could not find evidence to support the
hypothesis. Only for two out of five projects there is indication that in coincidence
with peak code activities there is a peak activity in the closed bug reports;
H3. There is an increase in bug opening activities in the proximity of a software
release date: we report that this hypothesis is supported by all the projects, with two
out of five projects where the behavior is particularly evident;
H4. There is an increase in bug closing activities in the proximity of a software
release date: we can state that for mostly of the projects analyzed there are bursts of
bug closing activities in coincidence of software releases;
H5. There is an increase in code-related activities in the proximity of a software
release: for three out of five projects this holds, there is a more or less limited
coincidence of peaks in code activities with software releases;
In accordance with our initial observation about many and contrasting empirical
results about the speed of the development process to adhere to bug-fixing requests,
a generalization of results across projects is difficult to obtain (Table 8).
Table 8. Comparison of projects according to hypothesis of higher development speed (+
supports the hypothesis, - is against the research hypothesis)
Project H1 H2 H3 H4 H5
OpenOffice.org - - + + +
KDE - + + + +
Gnome + + + - +
Wine + - ++ + -
Eclipse - - ++ ++ -
We see that each application has a different pattern in answering our hypotheses,
we suspect that the reason is due to the different type of application and domain.
Our interpretation of the general findings is that open source projects examined
are subject to limited peaks in code development activities during the early phases of
the bug reporting process (H1). As soon as an user/developer issues a bug report, the
activity starts to be frenetic in order to solve the issue. Solving many issues is then
incremental: there are limited bursts in activities as the bug reports are closed (H2).
The fixing of defects in code is a process that is distributed in time.
From another viewpoint, release dates are connected with peaks in the bug
opening process and the closing process (H3/H4): approaching a software release,
bug reports will be closed with a bursting activity, as well with a new release the
users will tend to increase their reporting activities. Confirming H1, bursts of code
12 Bruno Rossi, Barbara Russo, and Giancarlo Succi
activities are not detected – or slightly detected - when there is a code release (H5).
This seems to confirm a more distributed effort during the development process.
7 Limitations and Future Works
Limitations of current work are threefold:
the major limitation is the monthly granularity of the time series considered, a
finer granularity is opportune in consideration of the technique used;
another limitation is the restricted number of projects analyzed, although we
considered relevant projects. We plan to extend the analysis to a larger number of
projects;
we did not consider in this work intensity of peaks. Adding this information, once
normalized, to the calculation of peaks’ distance gives more information about the
actual similarities of peaks;
Future work will go in the direction of addressing these limitations, in particular
extending the current analysis to other projects, and, more important, to collect this
information throughout single projects data-mining. Results from the queries of the
Ohloh repository will be used as a relevant source for support and validation.
Moreover, other aspects such as the impact of the development of open source
components that are the basis for the development of a complex system and the
techniques used to improve their quality through an open testing process [7] can be
investigated.
8 Conclusions
The open source development methodology is considered to be highly efficient in
the bug fixing process and to propose generally quick version release cycles. We
proposed an empirical investigation of this assertion studying 5 large and well-
known open source projects by studying temporal evolution of source code activity,
issue tracking repositories activities, and release dates. By using temporal burst-
detection to evaluate peaks in time-series, we compared the periods of highest
activity in the different time-series.
We found that peaks or bursts in code activities are not related to peaks in bug
reports closing activities inside issue tracking systems, instead we found peaks in
bug reports opening/closing to be synchronized with version release cycles. Code
activities seem more distributed across version releases.
Our conclusions show that the open source development methodology quickly
adapts to the changing environment, but also that such velocity depends on the
application and specifically on the projects’ domain.
Analysis of Open Source Software Development Iterations by means of Burs
t
Detection Techniques
13
References
[1]
Challet D., Du, Y.,L. Closed Source versus Open Source in a Model of Software Bug Dynamics,
Cond-Mat/0306511, June. 2003, http://arxiv.org/pdf/cond-mat/0306511.
[2]
Christensen, C.M. The ongoing process of building a theory of disruption. Journal of Product
Innovation Management, Vol. 23, Issue 1, 2006, pp.39-55.
[3] Coman I., Sillitti A. An Empirical Exploratory Study on Inferring Developers’ Activities from
Low-Level Data. Proceedings, 19th International Conference on Software Engineering and
Knowledge Engineering (SEKE 2007), Boston, MA, USA, 9 - 11 July 2007.
[4] Coman I., Sillitti A. Automated Identification of Tasks in Development Sessions. Proceedings 16th
IEEE International Conference on Program Comprehension (ICPC 2008), Amsterdam, The
Netherlands, 10 - 13 June 2008.
[5] Dinh-Trong, T., Bieman, J. Open source software development: a case study of FreeBSD. Software
Metrics, 2004. Proceedings. 10th International Symposium on, 2004, pp. 96-105.
[6] Feller, J., Fitzgerald, B. Understanding Open Source Software Development, Addison-Wesley
Professional, 2001.
[7] Gross H.G., Melideo M., Sillitti A. Self Certification and Trust in Component Procurement,
Journal of Science of Computer Programming, Elsevier, Vol. 56, 2005, pp. 141 - 156.
[8] Kuan, J. Open Source Software as Lead User’s Make or Buy Decision: a Study of Open and Closed
Source Quality, Stanford University, 2002.
[9] Mockus, A., Fielding, R., Herbsleb, J. A case study of open source software development: the
Apache server. Proceedings of the 22nd international conference on Software Engineering,
Limerick, Ireland: ACM, 2000, pp. 263-272.
[10] Mockus, A., Fielding, R., Herbsleb, J. Two case studies of open source software development:
Apache and Mozilla, ACM Trans. Softw. Eng. Methodol., vol. 11, Jul. 2002, pp. 309-346.
[11] Scacchi, W. Is open source software development faster, better, and cheaper than software
engineering, 2nd ICSE Workshop on Open Source Software Engineering, 2002.
[12] Vlachos, M., Meek, C., Vagena, Z., Gunopulos, D. Identifying similarities, periodicities and bursts
for online search queries, Proceedings of the 2004 ACM SIGMOD international conference on
Management of data, Paris, France: ACM, 2004, pp. 131-142.
[13] Weinstock, C.,B., Hissam, S.,A. Making Lightning Strike Twice? in: J. Feller, B. Fitzgerald, S.
Hissam, K. Lakhani (Eds.), Perspectives on Free and Open Source Software, MIT Press,
Cambridge, MA, 2005, pp. 93– 106.
... Many tools have been proposed to support developers in their maintenance tasks and mitigate such risk. The majority of such approaches base their algorithms on the syntactic changes of lines of code, which are useful to understand the changes within individual software components like classes (e.g., [2]), but they are less effective to understand the interactions among such components. Recently, graph theory has been used to identify the principles regulating the evolution of components' dependencies [3][4][5]. ...
... The number of inward (outward) edges of a node is the in-(out-)degree or simply degree if no direction for the edge is considered. Call graphs possess three major properties: (1) scale-free distribution for node degrees, (2) short average path lengths, and (3) high degree of clustering, [13,[18][19][20][21]. Properties (2) and (3) define the so-called small-world networks. Small-world networks tend to contain highly connected sub-structures (e.g., closed motifs). ...
... In particular, motifs 36, 38, 108, 110, and 238 are included in at least one release of any system and all such motifs but motif 238 are common among all releases of all systems, Fig. 10. In addition, the Z-score plot over releases of motifs 110 or / and 108 dominates all other plots 2 The tools we developed are available at https://goo.gl/gm6NaW. in the SUTs, Fig. 10. Looking at each system, we can see that Fig. 10. ...
Conference Paper
Components' interactions in software systems evolve over time increasing in complexity and size. Developers might have hard time to master such complexity during their maintenance activities incrementing the risk to make mistakes. Understanding changes of such interactions helps developer plan their re-factoring activities. In this study, we propose a method to study the occurrence of motifs in call graphs and their role in the evolution of a system. In our settings, motifs are patterns of class calls that can arise for many reasons as, for example, by implementing design choices. By mining motifs of the call graph obtained from each system's release, we were able to profile the evolution of 68 releases of five open source systems and show that 1) systems have common motifs that occur non-randomly and persistently over their releases, 2) motifs can be used to describe the evolution of calls, compare systems and eventually reveal releases that underwent major changes, 3) there are no specific motif types that include design patterns in all systems under study, but each system has motifs that likely include them, motifs that do not include them at all, and motifs that include a design pattern and occur only once in every release. Some of the findings resemble the ones for biological / physical systems and, as such, path the way to study the evolution of call graphs as dynamical systems (i.e., as system regulated by analytic functions).
... Rossi et al. [8] used ohloh.net repository to fetch evolutionary data and metrics of five open source software (OpenOffice.org, ...
... For investigating the relation between the attributes (user, LOC, commits, and contributors) of single and multiple contributor OSS projects, we used the burst detection technique [8]. This technique pinpoints the peaks in the time series and compares them with peaks in other time series. ...
... Once the peaks (data points exceeding cutoff) are identified for the time series, they are plotted on a line to make the easy comparison with the peaks of other time series. For this evaluation, we assume the same metrics as used in [8] to compare the identified bursts. More explanation about the metrics can be found in [8]; here we give the definitions only. ...
... The occurrence of such peaks is known as a burst. Rossi et al. [8] use burst detection technique to identify relationship between different activities of OSS projects. ...
... Rossi et al. [8] used ohloh.net repository to fetch evolutionary data and metrics of five OSS (OpenOffice.org, ...
... For investigating the relation between the projects change-type activity and project attributes (number of committers, number of files and number of lines), we used the burst detection technique [8]. This technique pinpoints the peaks in the time series and compares them with peaks in other time series. ...
Article
Software evolution refers to the phenomenon of continuous software change and growth after its initial development. A version control system records all information about these changes. Several research studies in the past have studied the historical records of changes of open source software (OSS) projects and found them useful for understanding the software evolution process. However, most of them investigate the distributions of changes types, change size, and change effort in an isolated manner. There is no work, to the best of our knowledge, which takes a combined view of various dimensions of a change. This study examines the change activity in 106 OSS projects from three points of view: change purpose (type), change size, and change effort. The common patterns in change type, change size, and change effort are highlighted using the burst detection technique. The burst detection technique helps in identifying the peaks in the time series and compares them with the peaks of other time series. The results indicate that the change-Type activity of OSS projects is significantly related with change effort, and change size for high and moderate-Activity clusters. Though for low-Activity cluster, this commonality of patterns is not there for all types of changes.
... Table 3 presents a list of problems/challenges detected by applying data analysis techniques in software processes. The problems of the techniques are related to the [8,21], Bayesian Network [9,24], Case-Based Reasoning [34], Data Farming [34], J48 [44] Project Effort Estimation: Analogy-Based Estimation [29,32], Association Rules [26], Artificial Neural Network [14,16,27,29,32,33,38], Bayesian Network [23,46], Bayesian Regression [41], Case-Based Reasoning [14], Clustering [35,37], Fuzzy Analogy [31], Fuzzy Clustering [33], Genetic Algorithm [33], K-means [43], Linear Regression [27,29,41], Log Linear Regression [38], Radial Basis Function Networks [27], Regression Analysis [7,18], Regression Trees [27,41], Support Vector Machines [27], Support Vector Regression [41], CART [14,26,29], Fuzzy Logic [38], OLS Regression [43] Software Requirements: Genetic Algorithm [12] Programming: Sequential pattern mining algorithm: MG-FSM [48] Software Testing: Artificial Neural Network [45], Association Rules [50], C4.5 [11], Clustering [39,40], Decision Trees [6,11,17], Linear Discriminant Analysis [20], Genetic Algorithm [13,25], J48 [17], Logistic Regression [11,15,17], Naive Bayes Classifier [11,15,17,20,28], N-gram [50], Random Forest [15,20,45], Support Vector Machine [17,20,45], Time-Series Analysis [42], K-Nearest Neighbor [20,28] Software Maintenance: Stacked Generalization [19], C4.5 [22], AdaBoost [36], Artificial Neural Network [36], Bagging [36], Bayesian Network [22,36], J48 [36], LogitBoost [36], Naive Bayes Classifier [20,22,30,36], Nnge [36], Random Forest [20,36], Regression Analysis [36], Support Vector Machine [20,22], K-Nearest Neighbor [20], Decision Tree [20] Software Development/Team communication: Clustering [5] Software Quality: Bayesian Network [10], Random Forest [10], Subgraph Mining [51], C4.5 [10], Naïve Bayes Classifier [10] Software Reuse: Association Rules [47] dependence of the datasets [18,20,27,29,[31][32][33], defined parameters of the software project (i.e. amount of projects, project size, development effort) [20,33,34,38,39,41,51], some techniques require combined techniques to improve performance [27,29,36,43], inaccurate performance estimates if model validation techniques are wrongly selected [15], training data selection [17,22,36], building of specialized models [24], low performance compared to other techniques [19,22,33,48], performance of defect prediction models by scattered datasets [15,18], prediction ability [14], and evaluations of some techniques by a set of metrics when assignment bugs automatically [19]. ...
... Random Forest: It is used for classifying issues reports from software development processes: [10,15,20,36,45] Sequential pattern mining algorithm: MG-FSM: Sequential patterns from large-scale datasets are used as a means of identifying deficiencies in IDE usability by mining frequent usage patterns, due to flaws in design, gaps in developer knowledge in using the IDE: [48] Stacked Generalization: It can yield a higher prediction accuracy than using individual general purpose classifiers when classifying bug reports, especially if it is trained with more recent data, reaching prediction accuracies from 50% to 89% [19] Subgraph Mining: It promises a strong reduction of time spent on defect localization in software engineering projects: [51] Support Vector Machines: The predictors models based on SVM algorithm trend to detect more defectprone modules, and analyze issue reports achieving accuracy of 75-83% depending on the project. SVM with certain kernels can achieve high performance: [17,20,22,27,45] Time-Series Analysis: It allows studying temporal evolution of source code activity, issue tracking repositories activities, and release dates: [42] c. Construct validity: At the beginning of this research was difficult to define the aspects that this study should include in order to recover relevant studies. ...
... Based on the review, some studies have focused on the comparison of characteristics of OSS with CSS from their evolutionary behavior point of view. They pointed that the evolution of OSS and CSS may or may not vary in terms of • Growth rate (Paulson et al., 2004); Robles et al., 2003;Capiluppi et al., 2004b;Xie et al., 2009;Neamtiu et al., 2013;Ferreira et al., 2011 Bug-fixing process and release frequency (Rossi et al., 2009) Unlike CSS systems, OSS systems do not have a constrained growth due to increasing complexity as they evolve. However, software evolution is a discontinuous phenomenon in both the cases. ...
Chapter
Many studies have been conducted to understand the evolution process of Open Source Software (OSS). The researchers have used various techniques for understanding the OSS evolution process from different perspectives. This chapter reports a meta-data analysis of the systematic literature review on the topic in order to understand its current state and to identify opportunities for the future. This research identified 190 studies, selected against a set of questions, for discussion. It categorizes the research studies into nine categories. Based on the results obtained from the systematic review, there is evidence of a shift in the metrics and methods for OSS evolution analysis over the period of time. The results suggest that there is a lack of a uniform approach to analyzing and interpreting the results. There is need of more empirical work using a standard set of techniques and attributes to verify the phenomenon governing the OSS projects. This will help to advance the field and establish a theory of software evolution.
... Based on the review, some studies have focused on the comparison of characteristics of OSS with CSS from their evolutionary behavior point of view. They pointed that the evolution of OSS and CSS may or may not vary in terms of • Growth rate (Paulson et al., 2004); Robles et al., 2003;Capiluppi et al., 2004b;Xie et al., 2009;Neamtiu et al., 2013;Ferreira et al., 2011 Bug-fixing process and release frequency (Rossi et al., 2009) Unlike CSS systems, OSS systems do not have a constrained growth due to increasing complexity as they evolve. However, software evolution is a discontinuous phenomenon in both the cases. ...
Chapter
Many studies have been conducted to understand the evolution process of Open Source Software (OSS). The researchers have used various techniques for understanding the OSS evolution process from different perspectives. This chapter reports a meta-data analysis of the systematic literature review on the topic in order to understand its current state and to identify opportunities for the future. This research identified 190 studies, selected against a set of questions, for discussion. It categorizes the research studies into nine categories. Based on the results obtained from the systematic review, there is evidence of a shift in the metrics and methods for OSS evolution analysis over the period of time. The results suggest that there is a lack of a uniform approach to analyzing and interpreting the results. There is need of more empirical work using a standard set of techniques and attributes to verify the phenomenon governing the OSS projects. This will help to advance the field and establish a theory of software evolution.
... At a process level, source code review processes in companies nowadays are similar to the processes adopted in open source projects [6], [9], [10], with a lot of variation in the process steps [6], [11]. Furthermore, source code reviews can be expensive: knowledgeable understanding of large code changes by reviewers is a time-consuming process, and finding the most knowledgeable code reviewers for the source code parts to be reviewed can also be very labor-intensive for developers. ...
Conference Paper
Full-text available
Context: Software code reviews are an important part of the development process, leading to better software quality and reduced overall costs. However, finding appropriate code reviewers is a complex and time-consuming task. Goals: In this paper, we propose a large-scale study to compare performance of two main source code reviewer recommendation algorithms (RevFinder and a Naive Bayes-based approach) in identifying the best code reviewers for opened pull requests. Method: We mined data from Github and Gerrit repositories, building a large dataset of 51 projects, with more than 293K pull requests analyzed, 180K owners and 157K reviewers. Results: Based on the large analysis, we can state that i) no model can be generalized as best for all projects, ii) the usage of a different repository (Gerrit, GitHub) can have impact on the the recommendation results, iii) exploiting sub-projects information available in Gerrit can improve the recommendation results.
Chapter
Many studies have been conducted to understand the evolution process of Open Source Software (OSS). The researchers have used various techniques for understanding the OSS evolution process from different perspectives. This chapter reports a meta-data analysis of the systematic literature review on the topic in order to understand its current state and to identify opportunities for the future. This research identified 190 studies, selected against a set of questions, for discussion. It categorizes the research studies into nine categories. Based on the results obtained from the systematic review, there is evidence of a shift in the metrics and methods for OSS evolution analysis over the period of time. The results suggest that there is a lack of a uniform approach to analyzing and interpreting the results. There is need of more empirical work using a standard set of techniques and attributes to verify the phenomenon governing the OSS projects. This will help to advance the field and establish a theory of software evolution.
Article
Full-text available
The dictum of “Release early, release often.” by Eric Raymond as the Linux modus operandi highlights the importance of release management in open source software development. However, there are very few empirical studies addressing release management in this context. It is already known that most open source software communities adopt either a feature-based or time-based release strategy. Both have their own advantages and disadvantages that are also context-specific. Recent research reports that many prominent open source software projects have overcome a number of recurrent problems by moving from feature-based to time-based release strategies. In this longitudinal case study, we address the release management practices of OpenStack, a large scale open source project developing cloud computing technologies. We discuss how the release management practices of OpenStack have evolved in terms of chosen strategy and timeframes with close attention to processes and tools. We discuss the number of practical and managerial issues related to release management within the context of large and complex software ecosystems. Our findings also reveal that multiple release management cycles can co-exist in large and complex software ecosystems such as OpenStack.
Article
Full-text available
Contributions to open source software are motivated by many different incentives, some of which provide the basis of supporting institutions, and others of which help mitigate free-riding. In this paper, I consider the idea that programmers are driven to write software for their own use. Own use as a motivation helps explain patterns of open source founding and participation at the industry level. The analysis also predicts that when open source programs are founded, software quality will surpass that of comparable closed source programs. This quality prediction is tested, and supported, using bug resolution rates as a proxy for quality, or quality improvement.
Conference Paper
Full-text available
Task switching occurs frequently during the work of software developers. While there are already several approaches aiming at assisting developers in recovering their contexts of previous tasks, they generally rely on the developer to identify the beginning of each task. We propose a new technique for automatically splitting a development session into task-related subsections, based on the interaction of the developer with the IDE. The technique also shows potential benefits for automatic concern detection and for suggestions for code investigation. We present the technique, the results of a study conducted for its initial validation, and we discuss the additional potential benefits under investigation.
Conference Paper
Full-text available
We present several methods for mining knowledge from the query logs of the MSN search engine. Using the query logs, we build a time series for each query word or phrase (e.g., 'Thanksgiving' or 'Christmas gifts') where the elements of the time series are the number of times that a query is issued on a day. All of the methods we describe use sequences of this form and can be applied to time series data generally. Our primary goal is the discovery of semantically similar queries and we do so by identifying queries with similar demand patterns. Utilizing the best Fourier coefficients and the energy of the omitted components, we improve upon the state-of-the-art in time-series similarity matching. The extracted sequence features are then organized in an efficient metric tree index structure. We also demonstrate how to efficiently and accurately discover the important periods in a time-series. Finally we propose a simple but effective method for identification of bursts (long or short-term). Using the burst information extracted from a sequence, we are able to efficiently perform 'query-by-burst' on the database of time-series. We conclude the presentation with the description of a tool that uses the described methods, and serves as an interactive exploratory data discovery tool for the MSN query database.
Conference Paper
Full-text available
According to its proponents, open source style software development has the capacity to compete successfully, and perhaps in many cases displace, traditional commercial development methods. In order to begin investigating such claims, we examine the development process of a major open source application, the Apache web server. By using email archives of source code change history and problem reports we quantify aspects of developer participation, core team size, code ownership, productivity, defect density, and problem resolution interval for this OSS project. This analysis reveals a unique process, which performs well on important measures. We conclude that hybrid forms of development that borrow the most effective techniques from both the OSS and commercial worlds may lead to high performance software processes
Article
According to its proponents, open source style software development has the capacity to compete successfully, and perhaps in many cases displace, traditional commercial development methods. In order to begin investigating such claims, we examine data from two major open source projects, the Apache web server and the Mozilla browser. By using email archives of source code change history and problem reports we quantify aspects of developer participation, core team size, code ownership, productivity, defect density, and problem resolution intervals for these OSS projects. We develop several hypotheses by comparing the Apache project with several commercial projects. We then test and refine several of these hypotheses, based on an analysis of Mozilla data. We conclude with thoughts about the prospects for high-performance commercial/open source process hybrids.
Article
A unique opportunity presented to examine the process of building the theory of disruption, is discussed. The development of the theory of disruption is built in two major stages such as the descriptive and the normative stage. The descriptive stage of theory building is a preliminary stage, because researchers generally must pass through it before developing normative theory. The theory of building process will create the opportunity for scholars to improve the crispness of definitions, the salience of the categorization scheme, and the methods for measuring the phenomena and the outcomes of interest.
Article
Component-based software engineering is typically perceived as application development in which existing individual software components are assembled and integrated in order to make up the final product. The main recent technological advances in this field therefore mainly focus on the integration step. This encompasses the syntactic and semantic mapping between components, the development of component wrappers and adapters, and the validation of all pair-wise component interactions. Additionally, prior to integration, components have to be located on a component market place, evaluated for their fitness for the purpose, and selected according to non-functional requirements. These activities are typically referred to as component procurement. Component brokerage platforms provide the support for these early phases of component assembly, and they are indispensable for strengthening the software component market. Although such platforms are good at the provision of components, they are not so good at their certification.This article proposes the combination of two contrasting technologies, component brokerage at one end of the component technology spectrum, and built-in contract testing at its other end, that combined, may alleviate the efforts involved in component certification. This is achieved through the access mechanisms that built-in contract testing provides for components, and additional tester components through which customers can themselves assess the quality of a candidate component that is coming from a broker. Every such extended component is added to the certification according to well-defined standards, that are provided by a third party on behalf of the supplier.
Conference Paper
A common claim is that open source software development produces higher quality software at lower cost than traditional commercial development To validate such claims, researchers have conducted case studies of "successful" open source development projects. This case study of the FreeBSD project provides further understanding of open source development. The FreeBSD development process is fairly well-defined with proscribed methods for determining developer responsibilities, dealing with enhancements and defects, and for managing releases. Compared to the Apache project, FreeBSD uses a smaller set of core developers that implement a smaller portion of the system, and uses a more well-defined testing process. FreeBSD and Apache have a similar ratio of core developers to (1) people involved in adapting and debugging the system, and (2) people who report problems. Both systems have similar defect densities, and the developers are also users in both systems.