Conference PaperPDF Available

Determinism and evolution

Authors:

Abstract and Figures

It has been proposed that software evolution follows a Self-Organized Criticality (SOC) dynamics. This fact is supported by the presence of long range correlations in the time series of the number of changes made to the source code over time. Those long range correlations imply that the current state of the project was determined time ago. In other words, the evolution of the software project is governed by a sort of determinism. But this idea seems to contradict intuition. To explore this apparent contradiction, we have performed an empirical study on a sample of 3, 821 libre (free, open source) software projects, finding that their evolution projects is short range correlated. This suggests that the dynamics of software evolution may not be SOC, and therefore that the past of a project does not determine its future except for relatively short periods of time, at least for libre software.
Content may be subject to copyright.
Determinism and Evolution
Israel Herraiz
Universidad Rey Juan Carlos
Madrid, Spain
herraiz@gsyc.es
Jesus M. Gonzalez-
Barahona
Universidad Rey Juan Carlos
Madrid, Spain
jgb@gsyc.es
Gregorio Robles
Universidad Rey Juan Carlos
Madrid, Spain
grex@gsyc.es
ABSTRACT
It has been proposed that software evolution follows a Self-
Organized Criticality (SOC) dynamics. This fact is sup-
ported by the presence of long range correlations in the
time series of the number of changes made to the source
code over time. Those long range correlations imply that
the current state of the project was determined time ago. In
other words, the evolution of the software project is governed
by a sort of determinism. But this idea seems to contradict
intuition. To explore this apparent contradiction, we have
performed an empirical study on a sample of 3,821 libre
(free, open source) software projects, finding that their evo-
lution projects is short range correlated. This suggests that
the dynamics of software evolution may not be SOC, and
therefore that the past of a project does not determine its
future except for relatively short periods of time, at least for
libre software.
Categories and Subject Descriptors
D.2.7 [Software Engineering]: Distribution, Maintenance,
and Enhancement—Reverse engineering, Version control;
D.2.9 [Software Engineering]: Management—Life cycle,
Software configuration management, Time estimation
General Terms
Theory
Keywords
software evolution, time series analysis, self-organized criti-
cality, long term process, short term process
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MSR’08, May 10-11, 2008, Leipzig, Germany.
Copyright 2008 ACM 978-1-60558-024-1/08/05 ...$5.00.
1. INTRODUCTION
Libre1software development has been traditionally a source
of strange cases of software evolution. The first to report one
of those were Godfrey and Tu [8, 9]. Their findings suggested
that the classical Lehman’s laws of software evolution [15]
were not fulfilled in the case of Linux, because it was evolv-
ing at a growing rate (which in fact was still growing 5 years
later [21]).
Those cases raised the question of whether libre software
evolves differently than propietary software, and whether
the laws of software evolution are a valid approach for an
universal theory of software evolution.
Although these questions have been addressed many times
[14], most of the findings and models exposed on those works
have failed to provide the theoretical background needed for
a proper and universal theory of software evolution. One
study that addressed the problem was Wu, in his PhD the-
sis [26], who among other interesting findings proposed that
the evolution of libre software was governed by a Self Orga-
nized Criticality (SOC) dynamics.
This conclusion was supported by the presence of long
range correlated time series in a set of 11 projects. Re-
gardless the suitability of the selected projects for this kind
of study, the limited amount of cases studies, or even the
methodology used, we find the idea of long range correlated
processes in software evolution as contrary to common intu-
ition. Long range correlation would mean that the current
state of the project is determined (or at least, heavily in-
fluenced) by events that took place long time ago. In other
words, the evolution of libre software is governed by a sort
of determinism.
In order to explore if this kind of dynamics is a property
of libre software, we have selected a large (3,821) sample of
projects, performing an analysis similar to the one by Wu.
We have studied the daily time series of changes, focusing
on deciding whether their profile were short or long range
correlated.
The projects were obtained out of the whole population of
projects stored in SourceForge.net, a well known hosting ser-
vice for libre software projects, that provides a web-based in-
tegrated development environment. The data was obtained
using the CVSAnalY SourceForge dataset2, maintained by
our research group.
1In this paper we will use the term “libre software” to re-
fer both to “free software”, as defined by the Free Software
Foundation, and “open source software”, as defined by the
Open Source Initiative.
2http://libresoft.es/Results/CVSAnalY SF
1
The rest of the paper is structured as follows. The next
section presents some background on the question of soft-
ware evolution, and the particular case of libre software evo-
lution. Section three describes the data sources used to ex-
tract the sample of projects that have been studied in this
work, followed by a description of the methodology used, in
section four. In the next section, the results are shown and
discussed. The sixth section includes a brief analysis of the
sensitivity of the results. The next section includes a sum-
mary of the results, to highlight the findings of this study.
The eighth section discusses some threats to the validity of
this study. The ninth section discusses the conclusion based
on the empirical findings of this work. Finally, the last sec-
tion includes some acknowledgments of the research projects
that have funded this work.
2. RELATED WORK
Software evolution started as field of research thirty years
ago, with the seminal work done by Lehman. His aim is
still to obtain a theory of software evolution. The laws of
software evolution (first formulated 1985 [15], and revised
during the nineties [17]) were a first step towards this direc-
tion. These laws were the base for some evolution models
made by Turski [24, 25].
However, both the laws and the assumptions made to ob-
tain the mentioned models, have not been verified in some
case studies. The most notable is the case of the Linux ker-
nel [8, 9], that evolves at a growing rate. One of the laws
of software evolution is precisely that software evolves at a
declining rate because of increasing complexity. That case
in particular was labeled by Lehman as an anomaly [16].
The truth is that the quest for a theory of software evo-
lution is still open. In a recent book [18], Lehman and
Fernandez-Ramil come over the question of what are the
requirements of a theory of software evolution.
Many have tried to obtain a model for software evolu-
tion. There are two main approaches that we label as the
statistical and the physical approaches. Among the physical
approaches, as well as the works by Turski already cited,
we mention [1, 6, 22]. These are three different approaches
that have tried to model how software evolves based on some
theoretical assumptions, and building a predictor model on
top of them. However, to our knowledge, no empirical vali-
dation at large scale has been done (using those models, or
any other similar model).
The statistical approach has been more deeply studied.
We can cite again the works by Godfrey and Tu about the
evolution of Linux, or the update of that work by Robles et
al. [21]. Some studies have tried to use time series analysis
(the same approach that we use in this paper). For instance,
[13] used ARIMA models to predict the monthly number of
changes in a software project (that was the winner model
on the past MSR Challenge [10]). The same approach has
been used by other authors [2, 5]. Indeed, although time
series analysis is not yet the most popular approach in the
empirical software research community, it was proposed as
early as 1985 [28, 29, 30].
Nowadays, the most popular statistical techniques for em-
pirical studies of software research seem to be regression
analysis and principal component analysis (PCA). As an ex-
ample, besides Godfrey and Robles et al., we cite [7, 14] for
studies that use regression analysis and [19] for a study that
use PCA to model the evolution of some operating systems.
In this sense, the PhD thesis by Wu [26] uses time series
analysis in order to evaluate the kind of dynamics that drive
software evolution. His approach is based on the notion of
the Hurst exponent, whose value may be used to classify a
process as long term or short term correlated. The empirical
study is based on 11 libre software projects. The conclusions
are that the 11 projects present long term correlations when
the time series of the daily number of changes are studied.
We recommend to read that thesis for full details on the
notion of long term correlated processes, and for all those
interested on a possible model for the evolution of libre soft-
ware.
His approach was also recently presented in the IEEE In-
ternational Conference on Software Maintenance [27].
3. DATA SOURCE
In this study we have considered the evolution of 3,821
software projects, that are hosted in SourceForge.net (SF.net).
This site is a hosting service that offers a web-based inte-
grated development environment for libre software projects.
SF.net is being object of intense study by the FLOSSMole
research project [12]. FLOSSMole parses the web pages of
SF.net, and creates a database including information regard-
ing all the projects hosted in this site. This information is
still mainly metadata (such as the license, number of devel-
opers, number of releases, etc) rather than fine granularity
information such as change records, source code metrics or
defects data.
The database of FLOSSMole has been used by our re-
search group to create a database containing the whole change
history of the project. This has been done for all the pro jects
stored in FLOSSMole that fulfill the requirement of having
a CVS repository3. The last dataset available was obtained
in June 2006, and it is publicly available.
That dataset was the result of joining the databases cre-
ated by the CVSAnalY tool, that was executed directly on
the CVS repositories of the projects in SF.net. Among other
information, that database contains a register for each com-
mit in the CVS repository for all the projects stored in
SF.net (including additional information such as the date
of the commit, author, files affected, etc). Therefore, we
could easily measure the daily number of changes for each
project using the CVSAnalY dataset.
We selected the projects according to the criteria shown in
section 4. After the selection process, the sample contained
3,821 projects.
Table 1 shows some statistical properties of the selected
projects. The number of developers is the value stored in
the FLOSSMole database. The project page in SF.net con-
tains a field indicating the number of developers that work
in the project. That field is increased each time a new devel-
oper joins the project. That field is not related to the CVS.
Although joining a project is a prerequisite to get access to
the CVS, being a developer in the project in SF.net does
not imply any participation in the CVS. In other words,
that number is an upper bound for the actual number of
developers working in the CVS.
Regarding the values labeled as “SF.net age” and “CVS
age”, those columns indicate the age of the projects in months.
3The database is known to contain some errors though, but
that affects only to a few projects. We will ignore the influ-
ence of those few projects in the global results.
2
The first value is the number of months that the project has
been stored in SF.net. The second value is the difference in
months between the dates of the last and first commit in the
CVS.
The next two columns (SLOC and number of files) indi-
cate the size of the project in Source Lines of Code (SLOC)
and in number of source code files. Basically, SLOC ex-
clude blank and comment lines. Both values, SLOC and
number of files, were obtained using the tools SLOCCount4.
The measured sources were the last checkout (obtained on
February 2008) of the CVS repository of each project.
The last column indicates the number of commits that
have occurred in the CVS repository.
All the values approximately correspond to June 2006 (the
exact dates differ from project to project).
4. METHODOLOGY
The methodology had three main steps:
1. Data retrieval
We downloaded the CVSAnalY dataset, and obtained
all the data necessary for the analysis from that database.
2. Time series analysis
After gathering all the changes history, we calculated
the daily number of changes for each project. For each
project we obtained a time series that required further
processing. We had to apply a smoothing procedure to
remove some noise from the data. After removing the
noise, we calculated the autocorrelation coefficients, in
order to find out whether the project was a short or
long term process.
3. Statistical analysis
In order to quantify how many of the projects were de-
scribed as short memory and how many as long mem-
ory, we used regression analysis and then analyzed the
distribution of these regressions.
These steps are detailed in the following subsections.
4.1 Data retrieval
Once we downloaded the CVSAnalY dataset, we had to
select the sample of projects for this study. There are two
main points about the requirements that the project must
fulfill: we need active projects, and the projects must be old
enough to study their evolution.
About the first requirement, we used the criterion sug-
gested by Capiluppi and Michlmayr [4]. These authors pro-
pose that projects with very few developers are probably
still in an early stage in their history. Their research has
shown that the behaviors of projects at different stages of
evolution are different. There are three stages in the evolu-
tion of projects. The transition from one phase to another
is gained when the project achieves to attract more devel-
opers. The first phase in the history of the project is called
by Capiluppi and Michlmayr the cathedral phase. The next
phase is a transition phase to the third, that is called the
bazaar phase. Those labels are used to match the terms
used by Raymond in his paper about the different dynamics
of propietary and libre software [20].
Summarizing, and using the terminology of the mentioned
paper, we wanted to select all those projects that were likely
4Available at http://www.dwheeler.com/sloccount
Figure 1: Theoretical profiles of the autocorrelation
coefficients of long term and short term processes.
This diagram shows the mathematical relationship
among the coefficients and the time lags. That equa-
tion may be used to obtain the Hurst exponent in
the case of long term processes. Note the logarith-
mic scale of the axis.
to be in the bazaar phase. The empirical criterion that we
used was to select all the projects with at least 3 developers.
We used the value stored in the database of FLOSSMole,
that corresponds to the field mentioned in table 1.
The second requirement is having at least one year of his-
tory. In the original work by Wu et al. where they propose
the SOC as the dynamics mechanism for the evolution of
libre software, they study a year of changes that was ex-
tracted from the whole history of the projects. In order to
compare our results with the results of the mentioned work,
we decided to study, at least, one year of history of the
projects.
We have measured the age of the project in the CVS repos-
itory as the difference between the dates of the last and first
commits. Table 1 shows the values of age for the sample of
projects that have been studied. There are two columns:
SF.net age, as explained above, measures the number of
months since the project was registered, and CVS age is
the value mentioned in this paragraph. Because not all the
projects start to use CVS right from the beginning, SF.net
age is an upper bound for CVS age.
This selection procedure gave us a set of 3,821 projects.
We obtained the daily number of changes for those projects.
We counted only changes made to source code. We iden-
tified changes corresponding to source code thanks to the
information contained in the LibreSoft’s dataset. The tool
used, CVSAnalY, is able of identifying the kind of file that
was changed in each commit, and marks that change in the
database with a determined type (documentation, images,
translation files, source code, etc).
Thus, for each project we had a time series containing a
point for each day. The value was the number of changes
performed on that day.
4.2 Time series analysis
One of the main properties of time series is the autocor-
relation coefficients. These are the linear correlation coeffi-
3
# Developers SF.net age CVS age SLOC # Files # Changes
Min. 3 29 12 10 1 2
Max. 354 92 91 12,028,586 44,174 126,354
Mean 8 64 33 72,903 374 2,417
Median 5 64 29 21,168 142 850
Sd. dev. 12 17 17 341,354 1,165 5,626
Table 1: Statistical properties of the sample of 3821 projects. SF.net age indicates the number of months that
the project has existed in SF.net. CVS age indicates the number of months that have elapsed between the
first and the last commits in the CVS repository. SLOC measures size in Source Lines of Code (it excludes
blank and comment lines).
Figure 2: Autocorrelation coefficients in a time se-
ries. The coefficients are calculated correlating the
series against the same series but shifted one posi-
tion. If the series has nelements, there are n1coef-
ficients. In the plot, each circle represents a point in
the series. The plot shows how the series is progres-
sively shifted, and how each coefficient is obtaining
by linear correlations of the original series and its
shiftings.
cient among the time series and the series itself but shifted
one position to the future. Thus, we may obtain up to n1
coefficients, where nis the number of lags of the series. For
instance, if the time series was collected daily, each day will
be a lag. Figure 2 shows a diagram that explains how the
coefficients are calculated. For instance, r(1) is calculated
correlating the original series against the series shifted one
position, r(2) is the same but the series is shifted two posi-
tions. Following this procedure iteratively, we obtain n1
coefficients, being nthe number of points of the original time
series (also called lags, as mentioned above).
The autocorrelation coefficients measure the linear pre-
dictability of the series, using only values of the past of the
same series. What is more important, the shape of the plot
of the autocorrelation coefficients gives us information about
the kind of process that we are studying.
Short and long term processes present a very different
profile. Figure 1 shows the theoretical profiles for these pro-
cesses. The plot is in logarithmic scale. Long term processes
present a slow decay of the coefficients. Equation 1 shows
the relationship among autocorrelation coefficients r(k) and
time lags kfor an ideal long term process.
r(k) = Ck2·H1k[1, n 1] (1)
where His the Hurst exponent (explained with detail in
[26]) and nis the number of time lags (or number of points
of the time series). kis an integer.
The ideal long term process present a linear profile in the
logarithmic plot of autocorrelation coefficients against time
lags, the slope of the line may be used to obtain the Hurst
exponent. The usual example of a long term process is the
Nile River, because the floods come in cycles, and given the
data of floods in past years, future floods may be predicted.
In the other hand, short term processes present a fast
decay of the coefficients, being the ideal process a linear re-
lationship among the coefficients and the lags. Equation 2
shows the linear relationship among autocorrelation coeffi-
cients r(k) and time lags kfor an ideal short term process.
r(k) = C·k k [1, n 1] (2)
The typical example of a short memory process is the
stock market, where the value of the index today depends
at most on values of the index during the last days, but very
old results of the index do not affect its current value.
A real process would be somewhere among those extreme
profiles.
In his PhD thesis [26], Wu gives more details about these
same examples of short and long range correlated processes.
In the case of our sample, the gathered data was noisy,
and the profile of the autocorrelation coefficients was not
clear (the values were dispersed, and could be fitted both to
a curve and to a straight line). In order to obtain a clean
profile, we applied kernel smoothing [23]. The procedure to
obtain the profiles is summarized in figure 3. This kernel
smoothing filter makes the series smoother by introducing
some autocorrelation in the data. If too much smoothing is
done, the data will artificially show a pattern due to the au-
tocorrelation added to the data. We discuss the influence of
the degree of smoothing on the results in the section devoted
to the threads to validity. In any case, we have validated this
smoothing procedure in other cases [10, 11], and both times
the results were not affected by the smoothing.
When trying to model (or predict) a time series, the kind
of process (or profile) is important, because one of the pa-
rameters of the model is the memory of the process. The
memory is the number of points in the past that are influ-
encing the current and future values of the series.
Long term processes present a high memory value, while
short term a low value. From our experience [11], software
processes (or at least, the study of the evolution of changes in
libre software projects) present a memory of no more than 1
week (7 points for daily collected series, as in the case of this
study). It is even less in some particular cases. For instance,
for Eclipse [10], we think that the memory is no more than 3
days as we presented at the 2007 MSR Prediction Challenge.
The goal of the challenge consisted in predicting the number
of changes in Eclipse during February, March and April 2007
(submitting obviously the prediction before those months).
Using a different model for each component of Eclipse (this
4
Figure 3: Smoothing and profiles plot process. The
original data was very noisy, and smoothing was
needed in order to obtain a clear profile.
is, using a different memory parameter for each module),
our approach won the challenge and no component had a
memory greater than 3 days [10].
Summarizing, once the autocorrelation coefficients have
been obtained, the plot of those coefficients indicates whether
the process is short or long term. A regression analysis can
help to find out the kind of processes in an automatic way.
We describe that step in the next subsection.
4.3 Statistical analysis
After obtaining the time series, and calculating the auto-
correlation coefficients, we used regression analysis to find
the kind of profile of each project.
We correlated the autocorrelation coefficients against the
time lag, obtaining the Pearson coefficient for each project.
The Pearson coefficient measures the linear correlation be-
tween two variables. The value of the coefficient falls in
the interval [1,1]. The closer its absolute value is to 1,
the stronger the linear relationship is. In other words, if
two variables do not have a linear relationship, the absolute
value of the Pearson coefficient would be much lower than
1.
Therefore, considering equation 2, if the process is close to
an ideal short term process, the Pearson coefficient should
be close to 1. On the other hand, if the process is not short
term, the coefficient should indicate no linear relationship
among the parameters.
We repeated the mentioned procedure for the 3,821 projects.
The result was a set 3,821 Pearson coefficients. As shown
in table 2, the values of the coefficients ranged from 0.32 to
0.99. To quantify how many projects could be classified as
short term or long term, we estimated the probability den-
sity function of the distribution of coefficients, plotted the
boxplot and calculated the quantiles of the sample. This is
explained in the next section.
5. RESULTS
As mentioned before, we obtained 3,821 daily time se-
ries of number of changes. For each series, we calculated
the autocorrelation coefficients r(k) (being kthe time lag),
and after that calculated the absolute value of the Pearson
correlation coefficient of r(k) against k.
This gave us a set of 3,821 values. Table 2 summarizes the
statistical properties of this set. At a first glance, it seems
that the values are distributed around high values (this is
for instance r > 0.8), which indicates that the processes are
more close to a short term process than to a long term one.
Minimum .3235
Maximum .9998
Mean .8429
Median .8534
Sd. dev. .1238
Table 2: Statistical properties of the set of Pearson
correlation coefficients. The values range from very
low (0.32, long term process) to very high (0.99,
short term process). With only this information,
we can not quantify how many projects could be
classified in each category (short or long term).
0.3 0.5 0.7 0.9
Figure 4: Boxplot of the set of Pearson correlation
coefficients. This graphs shows that most of the
cases (main box in the plot) are around high val-
ues (0.85) of the coefficient. In other words, most
of the projects appear to be short term processes.
We can represent the same data in a boxplot, in order
to obtain a more meaningful description of the data. This
boxplot is shown in figure 4. The results are quite clear. Let
us focus on the low values. The boxplot indicates that the
values below 0.5 (approximately) may be considered out-
liers. Most of the values are in the range between 0.75 and
0.95, and the median is around 0.85. Summarizing, most
of the projects present a high Pearson coefficient, indicat-
ing a strong linear relationship. This would indicate that
most of the projects are evolving like short range correlated
processes.
An estimation of the density function would help us to
quantify how many projects are in a certain range of val-
ues (like those approximated ranges mentioned in the above
paragraph). The estimation of such function is shown in fig-
ure 5. As that figure shows, it seems that there is a group
of projects with very high values, and another group around
a value of approximately 0.85 (this is, a group around the
value of the mean or the median). Regarding the low end of
the distribution (this would represent long term correlated
projects), it seems that those pro jects are only a minority.
A more accurate statistical tool to quantify the number
of projects is the quantiles of the sample. Table 3 contains
the quantiles for the Pearson coefficients. For instance, only
40% of the projects present a correlation coefficient lower
than .8178, and only 20% lower than .7394. In other words,
80% of the projects have a coefficient greater than .7394.
According to the results shown in that table, and taking
in account the statistical analysis performed on the Pearson
coefficients, it seems clear that most of the projects are gov-
erned by a short memory dynamics. There are only a few
5
0.5 0.6 0.7 0.8 0.9 1.0
0 1 2 3 4
r
Figure 5: Density function of the set of Pearson cor-
relation coefficients. It seems that there is a group
of projects with coefficients very close to 1, and
another group simmetrycally distributed around a
value of approximately 0.85. Both groups would
correspond to short term processes. Long term pro-
cesses, located in the left tail, are a minority.
Quantile (%) r
0.3235
10 .6739
20 .7394
30 .7807
40 .8178
50 .8534
60 .8906
70 .9312
80 .9783
90 .9932
100 .9998
Table 3: Quantiles of the sample of Pearson corre-
lation coefficients. Less than 40% of the projects
present a value lower than 0.8178.
projects that present a profile that would indicate a long
memory dynamics.
6. SENSITIVITY ANALYSIS
In the previous section we have shown that most of the
Pearson coefficients show high values, meaning that most of
the projects under study are evolving like short term pro-
cesses. However, those projects are very heterogeneous. For
instance, the size of the project varies in a wide range, as
the number of developers, age, or any other parameter.
It could happen for instance that the number of developers
or the size of the project influences how the project evolves.
In order to find out if the results are sensible to some
of the properties of the project, we have performed a brief
sensitivity analysis.
We have considered all the factors that are shown in ta-
ble 1. We have plotted the values of each one of those prop-
erties against the value of the Pearson coefficient calculated
in the previous section. Thus, we can find if there exists pat-
terns when the projects are clustered in homogeneous groups
(for instance, we could find that small projects evolve like
short term processes, but large projects do not).
Figure 6 shows the results of the sensitivity analysis. That
figure contains six plots. Each plot compares the value of
one of the properties shown in table 1 against the value of
the Pearson correlation coefficient. Each point correspond
to a project. Some of the plots have their vertical axis in
logarithmic scale. This is because the kind of distribution
of that property. For instance, size is distributed following
a Pareto-like distribution (there are a few projects that are
very large compared to the rest). That kind of data does not
show well in linear scale, as only some isolated points appear
in a side of the plot, and a set of points grouped in a small
area in the other. In order to make the plots clearer, we have
selected the logarithmic vertical axis for those properties.
The horizontal axis is in linear scale, and shows the Pearson
correlation coefficient. Values of the coefficient close to 1
correspond to short term processes. Values lower than 0.7
may be considered long term processes.
At a first glance, there is no any clear pattern in the data.
In other words, in spite of the heterogeneity of the projects,
the character of the process (short or long term) is not re-
lated to any of the properties.
For instance, let us focus in the case of size. In the range of
the Pearson coefficient from 0.9 to 1.0, there small, medium
and large projects. If we focus in a range of size, we also
find projects with a wide range of values of the Pearson
coefficient.
This behavior is verified with the rest of properties too.
However, the amount of dispersion is different for each prop-
erty. For instance, the dispersion of the points for the case of
CVS age seems to be bigger than the dispersion for the case
of size. This is probably because the statistical distributions
of those two properties are different. In other words, in the
case of size, there are only a few points for very large and
very small projects in the plots, but that is because indeed
there are only a few projects with those values in the sample.
In any case, the statistical analysis described in section 4
should be repeated using smaller and more coherent groups.
The sensitivity analysis shown here is only a first approach to
find out if the results are sensible to the different properties
of the projects.
7. SUMMARY OF RESULTS
Section 5 has shown the raw statistical results. Moreover
section 6 has shown how the results are affected by the dif-
ferent statistical properties of the projects. In this section
we include a brief summary of the main results.
First of all, after analyzing the shape of the profiles of
the time series, the main result is that at least 80% of the
projects can be considered short term processes. It seems
that there are two different groups of short memory projects,
that are statistically different. However, we could not iden-
tify which were the members of each group. Regarding those
projects evolving like long term processes, we have not found
a group with that characteristic. The long term evolving
projects seem to be a marginality compared to the popula-
tion of the rest of the projects.
In those results, we have not made any differentiation by
size or any other characteristic of the projects. For instance,
it could happen that large and small projects present dif-
ferent profiles (long or short term), and hence the kind of
process could be a stage in the evolution rather than an
dynamical property present in the whole life of the project.
6
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
5 10 20 50 100 200
# Developers
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
30 40 50 60 70 80 90
SF.net age (months)
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
20 40 60 80
CVS age (months)
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1e+01 1e+03 1e+05 1e+07
Size (SLOC)
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1 10 100 1000 10000
# Files
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1e+01 1e+02 1e+03 1e+04 1e+05
# Changes
Figure 6: Sensitivity analysis. The plots show the values of each one of the properties shown in table 1
(vertical axis), compared against the Pearson correlation coefficient (horizontal axis). Short term projects
present values close to 1. The vertical axis of some of the plots are in logarithmic scale. The plots show
that there is no pattern when dividing the projects in more homogeneous groups. For all the ranges of the
Pearson coefficients, we find projects with values of the properties in a wide range. In the same way, for all
the ranges in each one of the properties, we can find projects with a wide range of values of the Pearson
coefficients.
7
To find out if any of the properties that we have measured
for the projects (shown in table 1) influenced the results,
we plotted the Pearson coefficients against each one of the
properties. As discussed in section 5, the Pearson coefficient
of short term processes should be close to 1. The plots are
shown in figure 6 and discussed in section 6.
The results show that we can find short and long term
correlated projects in any size range (measured by number
of developers, lines of code, or files), or in any age range
(both for the above mentioned SF.net and CVS ages).
Therefore, although the projects under study do not form
a homogeneous group, the short term evolution dynamics
seem to be present in projects with very different properties,
suggesting that this dynamics may be an universal property.
8. THREATS TO VALIDITY
We have two main concerns that may affect the valid-
ity of our results. The first concern regards the smooth-
ing procedure applied to the time series. This procedure
adds some internal autocorrelation, in order to remove some
noise. This makes the series smoother, and makes it easier
to identify a clear profile when plotting the autocorrelation
coefficients. However, if too much smoothing is done, the in-
ternal autocorrelation may hide the real profile of the data.
We are not yet sure about how this smoothing may affect
the shape of the profile. We have done some tests for some
isolated projects, and have not observed any modification of
the profile (except to make it clearer). We have used the
same procedure in previous studies [10, 11], where the re-
sults were compared against real data, and in spite of using
smoothing, the time series performed well. Therefore we
do not think that the smoothing procedure has changed the
profile of the series under study.
The other main concern regards the source for the se-
lected projects. SourceForge.net (SF.net) is known to con-
tain many dead projects, and it has been reported [3] that
SF.net may not be a representative source of the libre soft-
ware development world. In any case, we have studied a
large sample of projects, all of them with at least one year
of active history, and with a minimum of core developers.
Even if those projects have failed, we think that the sample
is likely to be representative of typical libre software devel-
opment (this is, developed in a community, with a large and
modular source code base, etc).
9. CONCLUSIONS AND FURTHER WORK
Software evolution still lacks a theory that explains how
projects are governed over their history. The classical ap-
proach by Lehman, summarized in the set of the laws of
software evolution has found some exceptions, many of them
in the libre software community.
The rising of these exceptions has fostered the research
on software evolution and libre software, and many mod-
els have appeared. Some of them are based on statistical
considerations and some others on theoretical assumptions.
To date, only statistical models have shown to perform well
when compared against real data; however those models can
not provide an explanation of the processes that are pre-
dicting. One proof of this activity on the software evolution
modelling is the annual MSR Challenge, that proposes to
predict the evolution of some selected case study.
One of the statistical approaches to the quest of a the-
ory of software evolution is the proposal of a Self-Organized
Criticality (SOC) dynamics for software evolution (bounded
to the case of libre software). The original proposal [26,
27] seems to fit well with the concepts of libre software de-
velopment. However, in our opinion, it has a very impor-
tant drawback: it is not intuitive, as it states that software
projects evolve like long term correlated processes. In other
words, the current situation was determined time ago: the
software project is driven by a sort of determinism.
In order to find if this behavior was common in libre
software projects, we have studied the evolution of a large
(3,821) sample of cases. Our results show that, at least,
80% of our projects are very short term correlated projects,
and probably less than 20% could be considered non-short
term processes.
Our study has been made on a heterogeneous set of projects.
The differences among the projects might influence the re-
sults. For instance, it may happen that small projects present
a short memory while large projects present a long memory.
We have tested the sensitivity of our statistical analysis, and
have found that the results are verified also for smaller and
more homogeneous groups. However, the full analysis should
be performed on those smaller groups. We plan to do that
in a further work.
Furthermore, this study is an example of the possibilities
of making large scale empirical research when databases are
documented and publicly available. FLOSSMole [12] and
our dataset are examples of this kind of collaborative re-
search. In any case, that kind of databases and initiatives
should be encouraged. The European Commission-funded
project FLOSSMetrics (which has funded in part this work)
will provide databases with metrics and facts for thousands
of libre software projects. As further work, we plan to re-
peat the analysis shown in this paper with the projects that
FLOSSMetrics will include in its databases. That will help
us to find out if the short memory behavior is verified in
other projects not stored in SF.net, thus overcoming one of
the threats to validity mentioned in the previous section.
Finally, our findings suggest that SOC may not be a good
model for a hypothetical theory of software evolution, and
that the quest for such a theory must go on.
10. ACKNOWLEDGMENTS
This work has been funded in part by the European Com-
mission, through projects FLOSSMetrics, FP6-IST-5-033982,
and Qualipso, FP6-IST-034763, and by the Spanish CICyT,
project SobreSalto, TIN2007-66172.
11. REFERENCES
[1] I. Antoniades, I. Samoladas, I. Stamelos, L. Aggelis,
and G. L. Bleris. Dynamical simulation models of the
Open Source development process. In S. Koch, editor,
Free/Open Source Software Development, pages
174–202. Idea Group Publishing, Hershey, PA, 2004.
[2] G. Antoniol, G. Casazza, M. D. Penta, and E. Merlo.
Modeling clones evolution through time series. In
Proceedings of the International Conference on
Software Maintenance, 2001.
[3] K. Beecher, C. Boldyreff, A. Capiluppi, and S. Rank.
Evolutionary success of open source software: an
investigation into exogenous drivers. In Third
8
International ERCIM Symposium on Software
Evolution. ERCIM, 2007.
[4] A. Capiluppi and M. Michlmayr. Open Source
development, adoption and innovation, chapter From
the Cathedral to the Bazaar: An Empirical Study of
the Lifecycle of Volunteer Community Projects, pages
31–44. IFIP: International Federation for Information
Processing. Springer Boston, 2007.
[5] F. Caprio, G. Casazza, M. D. Penta, and U. Villano.
Measuring and predicting the Linux kernel evolution.
In Proceedings of the International Workshop of
Empirical Studies on Software Maintenance, Florence,
Italy, 2001.
[6] J.-M. Dalle and P. A. David. The allocation of
software development resources in Open
Source production mode. Technical report, SIEPR
Policy paper No. 02-027, SIEPR, Stanford, USA, 2003.
http://siepr.stanford.edu/papers/pdf/02-27.pdf.
[7] A. R. Fasolino, D. Natale, A. Poli, and
A. Alberigi-Quaranta. Metrics in the development and
maintenance of software: an application in a large
scale environment. Journal of Software Maintence:
Research and Practice, 12:343–355, 2000.
[8] M. Godfrey and Q. Tu. Evolution in Open Source
software: A case study. In Proceedings of the
International Conference on Software Maintenance,
pages 131–142, San Jose, California, 2000.
[9] M. Godfrey and Q. Tu. Growth, evolution, and
structural change in open source software. In
Internation Workshop on Principles of Software
Evolution, Vienna, Austria, September 2001.
[10] I. Herraiz, J. M. Gonzalez-Barahona, and G. Robles.
Forecasting the number of changes in Eclipse using
time series analysis. In International Workshop on
Mining Software Repositories. IEEE Computer
Society, 2007.
[11] I. Herraiz, J. M. Gonzalez-Barahona, G. Robles, and
D. M. German. On the prediction of the evolution of
libre software projects. In IEEE International
Conference on Software Maintenance, pages 405–414.
IEEE Computer Society, 2007.
[12] J. Howison, M. Conklin, and K. Crowston.
FLOSSMole: a collaborative repository for FLOSS
research data and analyses. International Journal of
Information Technology and Web Engineering,
1(3):17–26, July-September 2006.
[13] C. F. Kemerer and S. Slaughter. An empirical
approach to studying software evolution. IEEE
Transactions on Software Engineering, 25(4):493–509,
1999.
[14] S. Koch. Evolution of Open Source Software systems -
a large-scale investigation. In Proceedings of the 1st
International Conference on Open Source Systems,
Genova, Italy, July 2005.
[15] M. M. Lehman and L. A. Belady, editors. Program
Evolution. Processes of Software Change. Academic
Press Inc., 1985.
[16] M. M. Lehman, J. F. Ramil, and U. Sandler. An
approach to modelling long-term growth trends in
software systems. In Internation Conference on
Software Maintenance, pages 219–228, Florence, Italy,
November 2001.
[17] M. M. Lehman, J. F. Ramil, P. D. Wernick, D. E.
Perry, and W. M. Turski. Metrics and laws of software
evolution - the nineties view. In METRICS ’97:
Proceedings of the 4th International Symposium on
Software Metrics, page 20, nov 1997.
[18] N. H. Madhavji, J. Fernandez-Ramil, and D. E. Perry,
editors. Software Evolution and Feedback. Theory and
Practice. Wiley, 2006.
[19] Y. Peng, F. Li, and A. Mili. Modeling the evolution of
operating systems: An empirical study. The Journal of
Systems and Software, 80(1):1–15, 2007.
[20] E. S. Raymond. The cathedral and the bazar. First
Monday, 3(3), March 1998.
http://www.firstmonday.dk/issues/issue3 3/raymond/.
[21] G. Robles, J. J. Amor, J. M. Gonzalez-Barahona, and
I. Herraiz. Evolution and growth in large libre
software projects. In Proceedings of the International
Workshop on Principles in Software Evolution, pages
165–174, Lisbon, Portugal, September 2005.
[22] G. Robles, J. J. Merelo, and J. M. Gonzalez-Barahona.
Self-organized development in libre software: a model
based on the stigmergy concept. In Proceedings of the
6th International Workshop on Software Process
Simulation and Modeling (ProSim 2005), St.Louis,
Missouri, USA, May 2005.
[23] R. H. Shumway and D. S. Stoffer. Time Series
Analysis and Applications. With R Examples. Springer
Texts in Statistics. Springer, 2006.
[24] W. M. Turski. Reference model for smooth growth of
software systems. IEEE Transactions on Software
Engineering, 22(8):599–600, 1996.
[25] W. M. Turski. The reference model for smooth growth
of software systems revisited. IEEE Transactions on
Software Engineering, 28(8):814–815, 2002.
[26] J. Wu. Open Source Software evolution and its
dynamics. PhD thesis, University of Waterloo, 2006.
[27] J. Wu, R. Holt, and A. E. Hassan. Empirical evidence
for SOC dynamics in software evolution. In IEEE
International Conference on Software Maintenance,
pages 244–254. IEEE Computer Society, 2007.
[28] C. C. H. Yuen. An empirical approach to the study of
errors in large software under maintenance. In
Proceedings of the International Conference on
Software Maintenance, 1985.
[29] C. C. H. Yuen. A statistical rationale for evolution
dynamics concepts. In Proceedings of the International
Conference on Software Maintenance, 1987.
[30] C. C. H. Yuen. On analyzing maintenance process
data at the global and detailed levels. In Proceedings
of the International Conference on Software
Maintenance, pages 248–255, 1988.
9
... Out of 98 primary studies, Linux, Apache, Mozilla and Eclipse open source projects are commonly selected by 15, 15, 13 and 8 studies as shown in Table 10. The major motivation for the selection of open source case studies is their free availability and continuous [73,120,121,126] Conference (ICSE) [40,60,76,82,83] Conference (SEAA) [74] Conference (CSMR) [36,38,49,59,109,112] Conference (WCE) [75] Conference (SEN) [37,114] Conference (PMSE) [77] Journal (IEE Software) [105,106] Journal (JIPS) [78] Conference (ESME) [43,72,113] Conference (WASET) [79] Journal (JSS) [44,99] Conference (MSR) [80] Journal (IST) [45,65] Conference (SSM) [81] Workshop (IWMSR) [48,53] Conference (ESEM) [85] Workshop (IWPSE) [33,107,110,129] Conference (OOPL) [95] Conference (Testing) [46] Workshop (FSE) [96] Journal (Software Quality) [47] Journal (IJOSSP) [97] Conference (ICSTE) [50] Journal (ESE) [98] Conference (WCRE) [51] Conference (IWPSE) [100] Conference (ICPC) [52] Conference (EIS) [101] Conference (OSS) [55] Conference (EAIT) [102] Conference (ICSEA) [56] Journal (ToSEM) [104] Conference (CSSP) [57] Conference (ICSM) [108] Journal (TMIS) [58] Conference (WCSMR) [109] Journal (SCP) [64] Conference (ISSDM) [111] Conference (ICACT) [66] Journal (IJHIT) [118] Conference (ICEMIS) [119] Conference (ISSRE) [127] Conference (ICACCI) ...
... Herraiz et al. [38] used the age of project metric as "SF.net age" and "CVS age" parameters, they indicate the age of project in months. "SF.net age" is the number of months that the project has been stored in SF.net. ...
... Value series metrics [51] Bug fix metrics [43] Repository metrics [37] Age of project metrics [38] No. of clones [50] Structural S/W metrics [119] [119] ...
Article
Full-text available
Open source software (OSS) evolution is an important research domain, and it is continuously getting more and more attention of researchers. A large number of studies are published on different aspects of OSS evolution. Different metrics, models, processes and tools are presented for predicting the evolution of OSS studies. These studies foster researchers for contemporary and comprehensive review of literature on OSS evolution prediction. We present a systematic mapping that covers two contexts of OSS evolution studies conducted so far, i.e., OSS evolution prediction and OSS evolution process support. We selected 98 primary studies from a large dataset that includes 56 conference, 35 journal and 7 workshop papers. The major focus of this systematic mapping is to study and analyze metrics, models, methods and tools used for OSS evolution prediction and evolution process support. We identified 20 different categories of metrics used by OSS evolution studies and results show that SLOC metric is largely used. We found 13 different models applied to different areas of evolution prediction and auto-regressive integrated moving average models are largely used by researchers. Furthermore, we report 13 different approaches/methods/tools in existing literature for the evolution process support that address different aspects of evolution.
... Nowadays in most of the literature, it is widely accepted that continuous changes are the key feature of evolution. It has been associated with the change of code [26] [27] [35], modules [2][62] or architecture [34] [1][32] of a software system typically. In addition, there are also studies on reverse engineering [1][32] at the code and process level. ...
... [3] develops a formal abstract service (service schema) model for service evolution management, which provides an understanding for change tracking, control and impact. In [27], an algorithm for automatically assessing the forward compatibility between two revisions of a service specification is proposed. ...
Article
While culture being the software controlling human mind, computer software development becomes one of the most creative activities that human undertake since the civilisation began. The only limitation in software creation is human imagination, and that limit is often self-imposed. The “Internetware”, referring to a software paradigm, aims to satisfy the need of human kind using Internet as an integrated development and execution platform. Such software systems are composed of entities distributed through the Internetwork, allowing connections that would be impossible or difficult to make otherwise. One of the tasks for the Internetware is to accommodate creativity, to understand the general settings of creative design process and to develop programs that can enhance creativity without necessarily being creative themselves. Therefore, it can be summarized that a development environment needs to be built to best support software creation process of six steps including searching, ideating, specifying, coding, testing and evolving. An E-Health application eco-system is used to illustrate the proposed development process model.
... He explicitly suggested to use regression techniques, autocorrelation plots and time series analysis for the study of software evolution, based on the idea of feedback driven evolution [2]. Interestingly, time series analysis has provided some empirical findings in the field of software evolution, such as accurate forecasting [72], that is based on the fact that software evolution is a short memory process [73] 8 Lehman made these suggestions based on the idea of software evolution as a feedbackdriven process. Feedback causes smooth profiles, with trends that do not vary greatly in the history of the system. ...
... In 1978, to explain this statistically smooth feature, Lehman introduced two more effects besides feedback: inertial and momentum effects. The net result of these three effects is that "for maximum cost-effectiveness, management consideration and judgement should include the entire history of the project with the current state having the strongest, but not exclusive, influence" [8], which is similar to the empirical finding previously mentioned [73], that states that software evolves like a short memory process, where the strongest influence is due to recent events, although some longer term effects may have influence as well (for instance, periodical events with large periods). This property was found using a large sample of software, and most of that sample fulfilled it, suggesting that it is an invariant property of software evolution. ...
Article
Full-text available
After more than 40 years of life, software evolution should be considered as a mature field. However, despite such a long history, many research questions still remain open, and controversial studies about the validity of the laws of software evolution are common. During the first part of these 40 years the laws themselves evolved to adapt to changes in both the research and the software industry environments. This process of adaption to new paradigms, standards, and practices stopped about 15 years ago, when the laws were revised for the last time. However, most controversial studies have been raised during this latter period. Based on a systematic and comprehensive literature review, in this paper we describe how and when the laws, and the software evolution field, evolved. We also address the current state of affairs about the validity of the laws, how they are perceived by the research community, and the developments and challenges that are likely to occur in the coming years.
... After extracting the source code, it is analyzed from various perspectives. We noticed the following techniques for analyzing the software evolution in the review papers: • Using Metrics ▪ Growth Analysis (Godfrey and Tu, 2000;Lehman et al., 2001;Robles et al., 2005;Koch, 2007) ▪ Complexity Analysis (Tahvildari et al., 1999;Stewart et al., 2006;Darcy et al., 2010;Girba et al.,2005b) ▪ Modularity Analysis (Milev et al., 2009;Capiluppi, 2009;Alenezi and Zarour, 2015;Olszak et al., 2015) ▪ Architectural Analysis (Capiluppi, 2004a;LaMantia et al., 2008;Wermilinger et al., 2011;Le et al., 2015;Alenezi and Khellah, 2015) • Topic Models Based Approach (Hassan et al., 2005a;Thomas et al., 2014, Hu et al., 2015 • Complex Systems Theory (Wu et al., 2007;Herraiz et al., 2008;Gorshenev and Pismak, 2003) • Graph/Network Analysis Based Approach (Jenkins and Kirk,2007;Murgia et al., 2009;Wang et al., 2009;Ferreira et al., 2011;Pan et al., 2011;Chaikalis et al., 2015;Kpodjedo et al., 2013) There is need to analyze an OSS system from a wider perspective, i.e. beyond its source code files, to understand and improve the software development process (Robles et al., 2006a). An OSS project management team uses several types of repositories to track the activities of a software project as it progresses (Hassan et al., 2005b). ...
Chapter
Many studies have been conducted to understand the evolution process of Open Source Software (OSS). The researchers have used various techniques for understanding the OSS evolution process from different perspectives. This chapter reports a meta-data analysis of the systematic literature review on the topic in order to understand its current state and to identify opportunities for the future. This research identified 190 studies, selected against a set of questions, for discussion. It categorizes the research studies into nine categories. Based on the results obtained from the systematic review, there is evidence of a shift in the metrics and methods for OSS evolution analysis over the period of time. The results suggest that there is a lack of a uniform approach to analyzing and interpreting the results. There is need of more empirical work using a standard set of techniques and attributes to verify the phenomenon governing the OSS projects. This will help to advance the field and establish a theory of software evolution.
... After extracting the source code, it is analyzed from various perspectives. We noticed the following techniques for analyzing the software evolution in the review papers: • Using Metrics ▪ Growth Analysis (Godfrey and Tu, 2000;Lehman et al., 2001;Robles et al., 2005;Koch, 2007) ▪ Complexity Analysis (Tahvildari et al., 1999;Stewart et al., 2006;Darcy et al., 2010;Girba et al.,2005b) ▪ Modularity Analysis (Milev et al., 2009;Capiluppi, 2009;Alenezi and Zarour, 2015;Olszak et al., 2015) ▪ Architectural Analysis (Capiluppi, 2004a;LaMantia et al., 2008;Wermilinger et al., 2011;Le et al., 2015;Alenezi and Khellah, 2015) • Topic Models Based Approach (Hassan et al., 2005a;Thomas et al., 2014, Hu et al., 2015 • Complex Systems Theory (Wu et al., 2007;Herraiz et al., 2008;Gorshenev and Pismak, 2003) • Graph/Network Analysis Based Approach (Jenkins and Kirk,2007;Murgia et al., 2009;Wang et al., 2009;Ferreira et al., 2011;Pan et al., 2011;Chaikalis et al., 2015;Kpodjedo et al., 2013) There is need to analyze an OSS system from a wider perspective, i.e. beyond its source code files, to understand and improve the software development process (Robles et al., 2006a). An OSS project management team uses several types of repositories to track the activities of a software project as it progresses (Hassan et al., 2005b). ...
Chapter
Many studies have been conducted to understand the evolution process of Open Source Software (OSS). The researchers have used various techniques for understanding the OSS evolution process from different perspectives. This chapter reports a meta-data analysis of the systematic literature review on the topic in order to understand its current state and to identify opportunities for the future. This research identified 190 studies, selected against a set of questions, for discussion. It categorizes the research studies into nine categories. Based on the results obtained from the systematic review, there is evidence of a shift in the metrics and methods for OSS evolution analysis over the period of time. The results suggest that there is a lack of a uniform approach to analyzing and interpreting the results. There is need of more empirical work using a standard set of techniques and attributes to verify the phenomenon governing the OSS projects. This will help to advance the field and establish a theory of software evolution.
... The change dynamics reappear as there is long range correlation in the time series of change. Though Herraiz et al. (2008)'s findings suggest that the correlations in the time series of changes are short term. Gorshenev and Pismak (2003) also had the observation of power law distribution in change size. ...
Article
Due to the dominance of Open Source Software (OSS) in IT and the IT enabled services industry, various stakeholders are keen to understand the OSS evolution process. Several studies have been conducted in the past in this regard. There are various techniques used in the past for understanding OSS evolution process from different perspectives. This paper reports a systematic literature review on the topic in order to understand its current state and to identify opportunities for future. This research identified 190 studies, selected against a set of questions, for discussion. It categorizes the research studies into nine categories. We report the review results in a set of two papers. This paper discusses the research results of the techniques used for OSS evolution analysis only i.e. one out of the nine categories. A subsequent paper carries discussion on the remaining categories. Based on the results obtained from the systematic review, there is evidence of shift in the metrics and methods for OSS evolution analysis over the period of time. OSS systems were found to grow at a super linear rate in the initial studies. But later studies revealed that branches of an OSS system grow at different rates. However, more number of studies should be carried out using a repeatable methodology in order to obtain well-formed and generalizable results.
... These include @BULLET The projects that we surveyed are all open source projects . Furthermore, it has been observed that many SourceForge projects are not active [2, 10] . In our analysis , we use projects where there is activity in terms of download. ...
Article
Software architecture is concerned with the structure of soft-ware systems and is generally agreed to influence software quality. Even so, little empirical research has been per-formed on the relationship between software architecture and software quality. Based on 1,141 open source Java projects, we calculate three software architecture metrics (measuring classes per package, normalized distance, and degree of coupling) and analyze to which extent these met-rics are related to defect ratio and download rate. We con-clude that there are a number of significant relationships. In particular, the number of open defects depend significantly on all our architecture measures.
... However, their results regarding long-term correlations were then challenged by Herraiz et. al. [9] albeit on a different dataset. ...
Article
Code changes propagate. Type, frequency, size of changes typically explain and even predict impact of changes in software products. What can changes tell about software processes? In this study, we propose a novel method to render software processes by graphs of linked commits as carriers of change information. Mining histories in such commit graphs allows to exploit techniques of graph analysis and coloring that can be used to understand activities in software processes. As application of our method, we analysed colored commit graphs to investigate the presence of large architectural changes and their likelihood of occurrence in bug fixing. For this, we introduced a new measure of architectural change based on hashing and a linear-time kernel for bit-labels graphs. We applied our approach to analyse the evolution of change of Eclipse JDT and Spring Framework.
... • The past of an OSS project does not determine its future except for relatively short periods of time. [49] study of the large close source systems, is not sufficient to justify or account for the evolutionary pattern and behavior of the open source software. As none-the-less these laws did not consider the community dimension of the OSS projects which is an integral part of sustainable evolution of the open source software. ...
Article
Full-text available
Open Source Software (OSS) is continuously gaining acceptance in commercial organizations. It is in this regard that those organizations strive for a better understanding of evolutionary aspects of OSS projects. The study of evolutionary patterns of OSS projects and communities has received substantial attention from the research community over the last decade. These efforts have resulted in an ample set of research results for which there is a need for up-to-date comprehensive overviews and literature surveys. This paper reports on a systematic literature survey aimed at the identification and structuring of research on evolution of OSS projects. In this review we systematically selected and reviewed 101 articles published in relevant venues. The study outcome provides insight in what constitutes the main contributions of the field, identifies gaps and opportunities, and distills several important future research directions.
Chapter
Many studies have been conducted to understand the evolution process of Open Source Software (OSS). The researchers have used various techniques for understanding the OSS evolution process from different perspectives. This chapter reports a meta-data analysis of the systematic literature review on the topic in order to understand its current state and to identify opportunities for the future. This research identified 190 studies, selected against a set of questions, for discussion. It categorizes the research studies into nine categories. Based on the results obtained from the systematic review, there is evidence of a shift in the metrics and methods for OSS evolution analysis over the period of time. The results suggest that there is a lack of a uniform approach to analyzing and interpreting the results. There is need of more empirical work using a standard set of techniques and attributes to verify the phenomenon governing the OSS projects. This will help to advance the field and establish a theory of software evolution.
Conference Paper
Full-text available
Libre (free, open source) software projects are lately getting increasing attention from the research community; for instance, several studies have focused on the inner working of some successful projects. However, there is still little emphasis on trying to explain the landscape of libre software development at large, maybe due to the distribution of developers, to the (in many cases) non-compulsory nature of their relationships, and to the extreme importance of motivation to attract resources to a project. In this paper we model the relationships among developers (with each other and the projects they decide to put work in) with the behavior of some social insects performing large-scale works. Specifically, we apply the concept of stigmergy, which considers that communication (by means of stimulus) does not happen directly among entities (in our case developers), but through changes in the environment. Stigmergy makes an autocatalytic reaction, of the same kind that the one observed in bazaar-like self-organized libre software projects, possible. We will build a model based upon these ideas, test it against quantitative data and results from previous research, and provide results of a simulation. Our conclusion is that the libre software development can indeed be modeled as a stigmergic phenomenon, in terms of allocation of developers to projects, and in the further evolution of those projects. An important consequence of this fact is that the individual productivity of developers would be not as important as the total production of a community. This would mean that the exploitation of stigmergic mechanisms would be more efficient for increasing the output of a project than actions oriented towards increasing productivity of individuals.
Article
Full-text available
This article introduces and expands on previous work on a collaborative project, called FLOSSmole (formerly OSSmole), designed to gather, share, and store comparable data and analyses of free, libre, and open source software (FLOSS) development for academic research. The project draws on the ongoing collection and analysis efforts of many research groups, reducing duplication, and promoting compatibility both across sources of FLOSS data and across research groups and analyses. The article outlines current difficulties with the current typical quantitative FLOSS research process and uses these to develop requirements and presents the design of the system.
Article
Full-text available
Software systems are continuously subject to evolution to add new functionalities, to improve quality or performance, to support different hardware platforms and, in general, to meet market request and/or customer requirements. As a part of a larger study on software evolution, this paper proposes a method to estimate the size and the com- plexity of a software system, that can be used to improve the software development process. The method is based upon the analysis of historical data by means oftime series. The proposed method has been applied to the estima- tion of the evolution of 68 subsequent stable versions of the Linux kernel in terms of KLOCs, number of functions and average cyclomatic complexity.
Article
Full-text available
Software evolution research has recently focused on new development paradigms, studying whether laws found in more classic development environments also apply. Previous works have pointed out that at least some laws seem not to be valid for these new environments and even Lehman has labeled those (up to the moment few) cases as anomalies and has suggested that further research is needed to clarify this issue. In this line, we consider in this paper a large set of libre (free, open source) software systems featuring a large community of users and developers. In particular, we analyze a number of projects found in literature up to now, including the Linux kernel. For comparison, we include other libre software kernels from the BSD family, and for completeness we consider a wider range of libre software applications. In the case of Linux and the other operating system kernels we have studied growth patterns also at the subsystem level. We have observed in the studied sample that super-linearity occurs only exceptionally, that many of the systems follow a linear growth pattern and that smooth growth is not that common. These results differ from the ones found generally in classical software evolution studies. Other behaviors and patterns give also a hint that development in the libre software world could follow different laws than those known, at least in some cases.
Chapter
This chapter will discuss attempts to produce formal mathematical models for dynamical simulation of the development process of Free/Open Source Software (F/OSS) projects. First, a brief overview for simulation methods of closed source software development is given. Then, based on empirical facts reported in F/OSS case studies, we describe a general framework for F/OSS dynamical simulation models and discuss its similarities and differences to closed source software simulation. A specific F/OSS simulation model is introduced. The model is applied to the Apache project and to the gtk+ module of the GNOME project, and simulation outputs are compared to real data. The potential of formal F/OSS simulation models to turn into practical tools used by F/OSS coordinators to predict key project factors is demonstrated. Finally, issues for further research and efforts for improvement of this first-attempt model are discussed. Purchase this chapter to continue reading all 29 pages >