Conference PaperPDF Available

Abstract

In this paper we make the case that software dependencies are a form of innovation adoption. We then test this on the time-evolution of the Gentoo package dependency graph. We find that the Bass model of innovation diffusion fits the growth of the number of packages depending on a given library. Interestingly, we also find that low-level packages have a primarily imitation driven adoption and multimedia libraries have primarily innovation driven growth.
Innovation Diffusion in Open Source Software
Preliminary analysis of dependency changes in the Gentoo Portage package
database
Remco Bloemen, Chintan Amrit, Stefan Kuhlmann, Gonzalo Ordóñez–Matamoros
University of Twente
PO Box 217, 7500 AE
Enschede, The Nethelands
<remco@coblue.eu> <c.amrit@utwente.nl>, <s.kuhlmann@utwente.nl>,
<h.g.ordonezmatamoros@utwente.nl>
ABSTRACT
In this paper we make the case that software dependencies
are a form of innovation adoption. We then test this on
the time-evolution of the Gentoo package dependency graph.
We find that the Bass model of innovation diffusion fits the
growth of the number of packages depending on a given
library. Interestingly, we also find that low-level packages
have a primarily imitation driven adoption and multimedia
libraries have primarily innovation driven growth.
Keywords
innovation, dependencies, graph, Gentoo
1. INTRODUCTION
Diffusion is the process of market uptake of an innovation,
the users of a particular innovation are called adopters [8].
Taking the open source projects as a ’market’, these concepts
can be applied to libraries and dependencies. For example,
consider the projects with a graphical user interface (GUI).
These have a demand for a GUI toolkit, and there are several
competing implementations available (
Qt
,
GTK
,
Wx
, etc..). The
introduction and uptake of a new GUI toolkit is a process of
innovation diffusion and the projects that use a particular
toolkit can be considered adopters of that toolkit. When
talking about software projects such a relation is often called
adependency. In the next section we describe the Bass [1]
diffusion model, we then describe the results of our analysis
of fitting the Bass diffusion model on the Gentoo portage
package dependency graph. This is followed by a discussion
of the results of our analysis and finally we end the paper
with the conclusions from our analysis and some discussion
of possible future work.
2. THE BASS DIFFUSION MODEL
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MSR 2014, Hyderabad, India
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
To model the process of innovation diffusion, Bass [1]
introduces two processes that propagate an innovation. The
first processes involves individuals that decide to use an
innovation based on their perception of its merits. The second
process involves the word-of-mouth effect or the bandwagon
effect: individuals adopt the innovation because they hear
of the experiences of previous adopters. In reality however,
everyone will be somewhere in between these two extreme
types, but for the sake of modelling it suffices to consider the
relative contribution of both types. It should be noted that
for historical reasons Bass (1969) [1] and all later authors use
the following terms; the first type are called ”innovators”, not
to be confused with those actually inventing the innovation
and the second type are called ”imitators”, not to be confused
with those developing imitating offerings. Taking
M
to stand
for the total market size and
A
to stand for the total number
of adopters the Bass diffusion can be modelled with two
simultaneous processes:
Innovators: Market participants not using use the innova-
tion yet might decide to adopt the innovation. The rate at
which this happens is
p
, the coefficient of innovation. The
number of user that do not use the innovation is
MA
, so
the inflow of adopters is p(MA).
Imitators: Users of the innovation can express their fond-
ness to market participants who do not yet use the innovation.
This can influence them to adopt the innovation at a rate
q
, the rate of imitation. The number of user that do not
use the innovation is again
MA
, the chance of meeting
someone that does use the innovation is proportional to
A
M
so the inflow of imitators can be modelled as qA
M(MA).
When these two effects are combined the net inflow of
adopters represented by the time derivative of
A
can be
modelled as
dA
dt=p(MA) + qA
M(MA). (1)
This first order homogeneous ordinary differential equation
can be solved for A(t) to give
A(t) = M1e(p+q)t
1 + q
pe(p+q)t. (2)
The solution assumes the innovation is introduced at time
zero, to account for this the substitution
ttt0
is made.
This results in four parameters,
t0
,
M
,
p
and
q
that can be
Table 1: Results of fitting the Bass model.
Package p q M
git 0.00 ±0.01 0.73 ±0.13 746 ±394
libnotify 0.05 ±0.08 0.72 ±0.30 103 ±9
udev 0.01 ±0.01 0.50 ±0.12 200 ±65
cairo 0.01 ±0.01 0.43 ±0.09 249 ±44
libmad 0.18 ±0.14 1.13 ±0.3 55 ±1
libtheora 0.11 ±0.09 0.63 ±0.21 32 ±1
taglib 0.22 ±0.03 0.04 ±0.06 62 ±12
fit to the empirical data.
3. THE GENTOO PORTAGE DEPENDENCY
GRAPH
The empirical data used is the time evolution of the Gentoo
Portage package dependency graph [7]. This dataset contains
the full dependency graph for every month since the project
was initiated in 2000, the resulting graphs have a combined
total of 1.3 million packages and 6.9 million dependency
relations, with the largest graph having 15 thousand packages
and 80 thousand dependency relations.
Special tools where developed to extract the time series of
the number of adopters A
t
for a given package. This time
series was then fit to eq. (2) using Mathematica’s
NonLin-
earModelFit
. The goodness-of-fit was analysed using an
ANOVA table and calculated using the adjust coefficient of
determination
¯
R2
. The parameters were extracted from the
fit, and confidence intervals were calculated under assump-
tions of normality. In table 1 the relevant parameters are
presented with their mean value and a 95% confidence inter-
val. Since normality was assumed, the confidence intervals
ignore the p0 and q0 constraints.
The plots were drawn using a thick red line for the model
and shades of red for the prediction bands. The thick red
line was drawn according to eq. (2), with the mean values
used as parameters. Then single value predictions bands
were calculated for 90%, 95%, 99% and 99.9% confidence
and were drawn in progressively darker shades of pink. They
represent a prediction for where a single additional value
would likely fall. According to the model and the uncertainty
introduced by the fit, there is a 90% chance that it will fall
in the innermost band, 95% chance that it would falil in the
second band, etc. If the data fits the model properly, one
would expect to see 90% of the points in the inner band.
Finally, the empirical data points were plotted as black dots,
connected with a thin vertical line - the residual error of the
model.
4. RESULTS
From the entire list of packages, a few well-known (at
least according to the author) packages where selected. The
selection criteria was that the package must not have existed
before (around) 2004, because the Gentoo Portage database
was still too immature then, and the package must have
gained a considerable number of dependers since its intro-
duction. In table 1, we list some selected packages along
with their parameters. In all cases the adjusted coefficient of
determination ¯
R2was more than 99%.
4.1 Imitator driver growth
2006
2008
2010
2012
0
50
100
150
200
250
300
350
Figure 1: The imitation driven adoption of git.
The first package we consider is
git
, a modern revision
control system that shows an imitator driven adoption. It’s
growth can be seen in fig. 1, the corresponding statistics are
in table 1. The package first appeared just before 2005, it
had around ten packages depending on it in 2006, twenty in
2008 and is currently used by almost three hundred packages.
According to the Bass model, it will continue to grow to
approximately 750 users. The innovator inflow is only 0.2%
of the potential market per year, so one would expect 0.002 ·
(750 300) = 1 user to adopt
git
out of sheer innovation.
Taking the analogy with persons, if someone from the 450
current non-users where to meet a random person from the
entire 750 market, there is a
300
750
= 40% chance of meeting
a user which can convince him/her to start using
git
. The
chance of this happening is the imitator inflow q= 0.73.
Therefore, the total number of users
git
can expect to gain
from imitation this year is 450 ·40% ·0.73 = 131. Very much
imitator driven!
The relative slowness of git’s growth and its dependence
on imitator can be explained. Open source projects, and
software project in general, consist of numerous large textual
files containing source code. Changes made in one place can
hugely and unpredictably affect other places. To complicate
matters further, usually more than one developer works on
the source code at the same time. There are competing
systems such as
cvs
,
subversion
,
mercurial
, etcetera., but
the basic functionality of maintaining version is provided by
all of them. Thus two explanations can be derived for
git
’s
growth: First, the revision control system is not a part that
affects the products delivered by the open source project
and second, there is little incentive to switch unless the new
revision control system is proved to be superior.
libnotify
is a library for notifications. In modern desktop
environments applications may want to notify the user of
certain events, for example a battery that is about to go
empty, a new email or an incoming phone call. The adoption
is relatively slow, despite its usefulness. A possible explana-
tion is that the target applications all have their own custom
solutions, which the developers are keen to keep.
udev
is a device manager. Its task is to communicate
closely with the hardware drivers in Linux kernel to monitor
any changes in the hardware configuration. It represents
an architectural change in a very low level component, this
might explain its slow imitator driven growth.
2004
2006
2008
2010
2012
0
10
20
30
40
50
60
Figure 2: The innovation driven adoption of libmad.
2006
2007
2008
2009
2010
2011
2012
2013
0
10
20
30
40
Figure 3: The rise and decline of xulrunner.
cairo
is a graphics library. It provides facilities for drawing
lines, circles, text and other graphics primitives and is used by
user graphics-heavy projects such as user interface libraries.
Much like
udev
it is an architectural change at a low level,
this might explain its similar growth pattern.
4.2 Innovator driver growth
A typical example of innovator driven growth is given
by
libmad
. The model is fitted resulting in fig. 2. Again,
the data is neatly explained by a Bass diffusion process, in
particular the rapid steep growth and the stable user base
afterwards. The name is an acronym for “library for MPEG
Audio Decoding” and the package provides a high quality
mp3 decoder for use in multimedia applications. This might
also explain the rapid growth of its adoption: multimedia
applications can benefit a lot from good quality mp3 support.
libtheora
is a library for the Schroedinger video codec. It
implements a multimedia standard for use by video players.
Just as with
libmad
there is a strong innovator driven growth.
taglib
is a library that processes metadata from multi-
media files. The package allows media players to read and
store information such as artist and title from multimedia
files. Again, like the other multimedia packages we observe
rapid innovator driven growth.
4.3 Growth and demise
The previous examples are all about projects that start
and undergo a growth phase that can be explained by a
Bass diffusion process. So far, the Bass diffusion model has
appeared to give a very accurate explanation of the adoption
of an open source software library.
A Bass diffusion increases monotonically, and never de-
clines. However, not all packages follow this behaviour.
Project
libmad
(see fig. 2) is a good example of this contra-
behaviour. The package has an innovator driver growth that
brings it close to its maximum in about two years. After that,
the package’s usage remains almost flat for years, and will do
so indefinitely if it is a perfect Bass diffusion process. This
is called the “maturity stage” in product life-cycle parlance.
A real product life-cycle will also include a “decline stage”
where the product begins to become obsolete. The Bass
innovation diffusion model does not account for this. In
a deep sense it would not have to, once ideas spread they
become part of our collective knowledge and will continue
to be used by the new products being developed. But the
Bass model was not developed for the spreading of ideas,
it was developed in the context of marketing to model the
adoption of products. Extending the Bass model to include
obsolescence would be an interesting extension for future
research.
The package
xulrunner
in the dataset is a nice example
of a short but complete life cycle, see fig. 3. When the
Bass model is applied naively and a least mean squares
best-fit is made, the result is a poor fit. If one looks at
the dependency growth of the package, the cause is clear:
the package becomes obsolete, which the Bass model as
presented in section eq. (2) does not represent. The decline
of the package from approximately 2011 onwards can be seen
as blue dots in the figure.
Excluding the blue dots from the data results in the
Bass model fit from figure fig. 3. The fitness increases to
¯
R2
= 99.54% and the parameters have tighter and reason-
able confidence intervals. This is strong evidence that the
initial adoption of the package is a Bass diffusion process.
To explain the last part, the model should be extended with
an obsolescence term.
5. RELATED WORK
In 2008 Crowston et al. [3] published a comprehensive
overview of academical research on open source software
development. Of the 184 articles they cite, the vast majority
of articles are case studies or surveys, with 4% of the articles
describing the development of empirical instruments and/or
measurements. Also, most articles look at the level of a
particular group or project, with 4% looking at the societal
level of interacting projects. Crowston et al. find no articles
that develop instruments or measurements on the level of
OSS packages, which this paper aims to do.
In their 2008 article Zheng et al. [10] analyse the depen-
dency graph of the Gentoo Portage package database as it
was in February 2007. The global structure of the graph
is analysed in graph theoretic terms of sparsity, clustering
coefficient, degree distribution and degree growth rate. The
authors conclude that the graph can not be explained well
by existing models of network growth and they propose two
new models instead. Our study differs in two ways: First,
we analyse the changes to the dependency graph over time,
whereas Zheng et al. look at a particular instance in 2007.
Second, we analyse and model the development of a given
software package in the dependency network, instead of the
overall development of the network. It can be interesting to
study how the Bass diffusion model for the development of
individual nodes corresponds to Zheng et al. model for the
development of the whole graph.
Haefliger et al. (2008) [5] study code re-use on six open
source projects, they conclude that there is extensive code
re-use in open source software. The article proceeds by
identifying the process of code re-use, such as the drivers
for re-using and the tools used to find relevant code. The
study by Haefliger et al. looks at a given project and how it
re-uses existing components, whereas this paper looks at a
given project and how it is being re-used by other projects.
Additionally, we performed our analysis using a large set of
automatically collected and processed empirical data.
Dedrick and West (2004) [6] and Chen (2006) [2] study the
adoption process of open source software by (commercial) end
users. Their focus is on the competitive economic strengths
of open source software versus commercial software. The
conclusion is that cost is the most important driver for open
source adoption and freedom and extensibility plays a lesser
role. The articles do not provide empirical data on the
adoption process itself, which makes it hard to compare it
to our present finding of a Bass diffusion adoption process.
6. CONCLUSIONS AND FUTURE WORK
The growth of the number of packages depending on a
packages can be modelled as a a Bass diffusion process.
Overall the Bass diffusion model gave very a good fit for
most OSS projects. Using only four parameters, it was able
to describe the growth curves from the empirical data. Full
statistical rigour would require a more involved analysis using
the methods from, for example, Carlos Escanciano (2006)
[4], but given the amount of and quality of evidence found
we can conclude that most OSS project do follow the Bass
diffusion model.
As can be seen in table 1, the Bass parameters pand q
are difficult to interpret and compare. A high pdoes not
automatically mean an innovator driven growth: if the q
value is also high then the result is simply a lot of growth.
For the same reason it is also difficult to compare the pand
qbetween packages. Maha jan et al. (1995) [9] suggests using
q
p
and q+p, this represents the total adoption rate and an
imitator/innovator ratio.
Analysing the package dependency graph and its changes
over time can provide new insights. Our exploratory study
provides some evidence for insights such as how multimedia
libraries are being adopted through an innovator driven
process with low-level architectural changes happening slowly
and through imitation. Further studies could test these
hypotheses.
The package dependency graph contains empirical data
to test extensions of the Bass diffusion model - extended
with discarders. The Bass model and the present analysis is
formulated in terms of absolute number of users, but in most
applications only sales figures are available. The amount
of sales is the first derivative of the Bass model, hence the
model is usually applied in its derivative form [9]. As a
consequence the model only considers adopters, but does
not consider discarders. In the
xulrunner
example, we see
the package being discarded from 2011 onwards, providing
insights in the discarding mechanism. The next step would
be to collect more examples of packages being discard, look
at their patterns of demise and develop a model of discarding
to supplement the Bass model of adoption. One model could
for example be the inverse of a Bass curve, this makes sense
when the market share of the original package is taken over
by a new package. The unique feature of dependency graph
analysis to give absolute user numbers facilitates this. Such
an analysis can also help in predicting the behaviour of
certain projects - hence forewarning the particular project
stakeholders.
The scale and complexity of the dependency graphs and
open source innovation requires some care. Three notable
issues became apparent in this study: First, In the open
source community there is a lot of forking. It is not always
clear whether a forked project constitutes the continuation
of the original project or a separate new project. A more
thorough study on the nature of forking could provide the
insights to resolve this. Second, due to the public nature
of open source development many immature or abandoned
projects are visible in the larger datasets. This is good from
a scientific perspective: it allows one to research projects
from their early beginning and look at projects that failed
to grow or became obsolete. But it clouds the ‘big picture’
with many projects that do not significantly contribute to
the overall innovation. In large datasets one would have
to devise a relevance metric to select the relevant metrics.
Such metrics could be the number of developers, the number
users or the number of dependees. Third, the sheer scale of
the available OSS databases provide challenges for analysis.
Specialist tooling is required to transform the raw data into
more manageable formats.
7. REFERENCES
[1]
F. M. Bass. A new product growth for model consumer
durables. Management Science, 15(5):215–227, January
1969.
[2] S. Chen. An economic model of open source software
adoption. The Journal of Portfolio Management, 2006.
[3] K. Crowston, K. Wei, J. Howison, and A. Wiggins.
Free/libre open-source software development: What we
know and what we do not know. ACM Comput. Surv.,
44(2):7:1–7:35, Mar. 2008.
[4] J. C. Escanciano. Goodness-of-fit tests for linear and
nonlinear time series models. Journal of the American
Statistical Association, 101(474):531–541, 2006.
[5]
S. Haefliger, G. von Krogh, and S. Spaeth. Code Reuse
in Open Source Software. Management Science,
54(1):180–193, 2008.
[6] Hawaii International Conference on System Sciences.
An Exploratory Study into Open Source Platform
Adoption, 2004.
[7]
MSR: Mining Source Repositories. The Gentoo Portage
package dependency graph, 2014. Available at
http://2\pi.com/.
[8]
V. K. Narayanan. Managing Technology and Innovation
for Competitive Advantage. Prentice Hall, Englewood
Cliffs, New Jersey, 2001.
[9] F. M. B. Vijay Mahajan, Eitan Muller. Diffusion of
new products: Emperical generalizations and
manegerial uses. Marketing Science, 14:G79–G88, 1995.
[10] X. Zheng, D. Zeng, H. Li, and F. Wang. Analyzing
open-source software systems as complex networks.
Physica A: Statistical Mechanics and its Applications,
387(24):6190–6200, 2008.
... On this paper, they describe a method to parse ebuild files using Paludis and a custom built program to produce the dependency graph for a specific package, and how they used the CVS revision control system where the database is kept to track changes for a specific package. The change over time of the dependency graph is analysed in another paper by the same authors [13]. ...
Preprint
Package managers are a very important part of Linux distributions but we have noticed two weaknesses in them: They use pre-built packages that are not optimised for specific hardware and often they are too heavy for a specific need, or packagesmay require plenty of time and resources to be compiled. In this paper, we present a novel Linux package manager which uses cloud computing features to compile and distribute Linux packages without impacting the end user's performance. We also show how Portage, Gentoo's package manager can be optimised for customisation and performance, along with the cloud computing features to compile Linux packages more efficiently. All of this resulting in a new cloud-based Linux package manager that is built for better computing performance.
... In a separate paper [1] we analyzed the changes in the dependency graph over time. In particular the growth of the number of dependers on a given package is explained using the Bass model of innovation diffusion. ...
Conference Paper
Full-text available
Open source distributions such as Gentoo need to accurately track dependency relations between software packages in order to install working systems. To do this, Gentoo has a carefully authored database containing those relations. In this paper, we extract the Gentoo package dependency graph and its changes over time. The final dependency graph spans 15 thousand open source projects and 80 thousand depen- dency relations. Furthermore, the development of this graph is tracked over time from the beginning of the Gentoo project in 2000 to the first quarter of 2012, with monthly resolution. We perform a cluster analyses of the package dependency graph that reveal meaningful relations among packages, and in a separate paper we analyse changes in the dependen- cies over time to get insights in the innovation dynamics of open source software. The resulting dataset provides many opportunities for both research on Open Source as well as practice.
Article
We review the empirical research on Free/Libre and Open-Source Software (FLOSS) development and assess the state of the literature. We develop a framework for organizing the literature based on the input-mediator-output-input (IMOI) model from the small groups literature. We present a quantitative summary of articles selected for the review and then discuss findings of this literature categorized into issues pertaining to inputs (e.g., member characteristics, technology use, and project characteristics), processes (software development practices, social processes, and firm involvement practices), emergent states (e.g., social states and task-related states), and outputs (e.g. team performance, FLOSS implementation, and project evolution). Based on this review, we suggest topics for future research, as well as identify methodological and theoretical issues for future inquiry in this area, including issues relating to sampling and the need for more longitudinal studies.
Article
(This article originally appeared in Management Science, January 1969, Volume 15, Number 5, pp. 215–227, published by The Institute of Management Sciences.) A growth model for the timing of initial purchase of new products is developed and tested empirically against data for eleven consumer durables. The basic assumption of the model is that the timing of a consumer's initial purchase is related to the number of previous buyers. A behavioral rationale for the model is offered in terms of innovative and imitative behavior. The model yields good predictions of the sales peak and the timing of the peak when applied to historical data. A long-range forecast is developed for the sales of color television sets.
Article
The diffusion model developed by Bass (Bass, F. M. 1969. A new product growth model for consumer durables. (January) 215–227.) constitutes an empirical generalization. It represents a pattern or regularity that has been shown to repeat over many new products and services in many countries and over a variety circumstances. Numerous and various applications of the model have lead to further generalizations. Modifications and extensions of the model have lead to further generalizations. In addition to the empirical generalizations that stem from the model, we discuss here some of the managerial applications of the model.
Article
Software systems represent one of the most complex man-made artifacts. Understanding the structure of software systems can provide useful insights into software engineering efforts and can potentially help the development of complex system models applicable to other domains. In this paper, we analyze one of the most popular open-source Linux meta packages/distributions called the Gentoo Linux. In our analysis, we model software packages as nodes and dependencies among them as edges. Our empirical results show that the resulting Gentoo network cannot be easily explained by existing complex network models. This in turn motivates our research in developing two new network growth models in which a new node is connected to an old node with the probability that depends not only on the degree but also on the “age” of the old node. Through computational and empirical studies, we demonstrate that our models have better explanatory power than the existing ones. In an effort to further explore the properties of these new models, we also present some related analytical results.
Article
Code reuse is a form of knowledge reuse in software development, which is fundamental to innovation in many fields. To date, there has been no systematic investigation of code reuse in open source software projects. This study uses quantitative and qualitative data gathered from a sample of six open source software projects, to evaluate two sets of propositions derived from the literature on software reuse in firms and open source software development. We find that code reuse is extensive across the sample and that open source software developers, much like developers in firms, apply tools that lower their search costs for knowledge and code, assess the quality of software components, and they have incentives to reuse code. Open source software developers reuse code because they want to integrate functionality quickly, because they want to write preferred code, because they operate under limited resources in terms of time and skills, and because they can mitigate development costs through code reuse. I
Article
In this article we study a general class of goodness-of-fit tests for a parametric conditional mean of a linear or nonlinear time series model. Among the properties of the proposed tests are that they are suitable when the conditioning set is infinite-dimensional; that they are consistent against a broad class of alternatives, including Pitman's local alternatives converging at the parametric rate $n^{-1/2}$, with n the sample size; and that they do not need to choose a lag order depending on the sample size or to smooth the data. It turns out that the asymptotic null distributions of the tests depend on the data generating process, so a new bootstrap procedure is proposed and theoretically justified. The proposed bootstrap tests are robust to higher-order dependence, particularly to conditional heteroscedasticity of unknown form. A simulation study compares the finite-sample performance of the proposed and shows that our tests can play a valuable role in time series modeling. Finally, an application to an economic price series highlights the merits of our approach.
Conference Paper
Sustainable competitive advantage has become part of the jargon in the practise of management. Unfortunately the term has become diluted and is often used to describe organizational strengths that do not fully comply with the meaning of the term in traditional strategic management theory. This paper explains the concept of sustainable competitive advantage with a focus on the management of technology and innovation, comparing and contrasting the relevant theories grounded in extant research and scholarship.