Conference PaperPDF Available

# Innovation Diffusion in Open Source Software

Authors:

## Abstract

In this paper we make the case that software dependencies are a form of innovation adoption. We then test this on the time-evolution of the Gentoo package dependency graph. We find that the Bass model of innovation diffusion fits the growth of the number of packages depending on a given library. Interestingly, we also find that low-level packages have a primarily imitation driven adoption and multimedia libraries have primarily innovation driven growth.
Innovation Diffusion in Open Source Software
Preliminary analysis of dependency changes in the Gentoo Portage package
database
Remco Bloemen, Chintan Amrit, Stefan Kuhlmann, Gonzalo Ordóñez–Matamoros
University of Twente
PO Box 217, 7500 AE
Enschede, The Nethelands
<remco@coblue.eu> <c.amrit@utwente.nl>, <s.kuhlmann@utwente.nl>,
<h.g.ordonezmatamoros@utwente.nl>
ABSTRACT
In this paper we make the case that software dependencies
are a form of innovation adoption. We then test this on
the time-evolution of the Gentoo package dependency graph.
We ﬁnd that the Bass model of innovation diﬀusion ﬁts the
growth of the number of packages depending on a given
library. Interestingly, we also ﬁnd that low-level packages
have a primarily imitation driven adoption and multimedia
libraries have primarily innovation driven growth.
Keywords
innovation, dependencies, graph, Gentoo
1. INTRODUCTION
Diﬀusion is the process of market uptake of an innovation,
the users of a particular innovation are called adopters [8].
Taking the open source projects as a ’market’, these concepts
can be applied to libraries and dependencies. For example,
consider the projects with a graphical user interface (GUI).
These have a demand for a GUI toolkit, and there are several
competing implementations available (
Qt
,
GTK
,
Wx
, etc..). The
introduction and uptake of a new GUI toolkit is a process of
innovation diﬀusion and the projects that use a particular
toolkit can be considered adopters of that toolkit. When
talking about software projects such a relation is often called
adependency. In the next section we describe the Bass [1]
diﬀusion model, we then describe the results of our analysis
of ﬁtting the Bass diﬀusion model on the Gentoo portage
package dependency graph. This is followed by a discussion
of the results of our analysis and ﬁnally we end the paper
with the conclusions from our analysis and some discussion
of possible future work.
2. THE BASS DIFFUSION MODEL
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for proﬁt or commercial advantage and that copies
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. To model the process of innovation diﬀusion, Bass [1] introduces two processes that propagate an innovation. The ﬁrst processes involves individuals that decide to use an innovation based on their perception of its merits. The second process involves the word-of-mouth eﬀect or the bandwagon eﬀect: individuals adopt the innovation because they hear of the experiences of previous adopters. In reality however, everyone will be somewhere in between these two extreme types, but for the sake of modelling it suﬃces to consider the relative contribution of both types. It should be noted that for historical reasons Bass (1969) [1] and all later authors use the following terms; the ﬁrst type are called ”innovators”, not to be confused with those actually inventing the innovation and the second type are called ”imitators”, not to be confused with those developing imitating oﬀerings. Taking M to stand for the total market size and A to stand for the total number of adopters the Bass diﬀusion can be modelled with two simultaneous processes: Innovators: Market participants not using use the innova- tion yet might decide to adopt the innovation. The rate at which this happens is p , the coeﬃcient of innovation. The number of user that do not use the innovation is MA , so the inﬂow of adopters is p(MA). Imitators: Users of the innovation can express their fond- ness to market participants who do not yet use the innovation. This can inﬂuence them to adopt the innovation at a rate q , the rate of imitation. The number of user that do not use the innovation is again MA , the chance of meeting someone that does use the innovation is proportional to A M so the inﬂow of imitators can be modelled as qA M(MA). When these two eﬀects are combined the net inﬂow of adopters represented by the time derivative of A can be modelled as dA dt=p(MA) + qA M(MA). (1) This ﬁrst order homogeneous ordinary diﬀerential equation can be solved for A(t) to give A(t) = M1e(p+q)t 1 + q pe(p+q)t. (2) The solution assumes the innovation is introduced at time zero, to account for this the substitution ttt0 is made. This results in four parameters, t0 , M , p and q that can be Table 1: Results of ﬁtting the Bass model. Package p q M git 0.00 ±0.01 0.73 ±0.13 746 ±394 libnotify 0.05 ±0.08 0.72 ±0.30 103 ±9 udev 0.01 ±0.01 0.50 ±0.12 200 ±65 cairo 0.01 ±0.01 0.43 ±0.09 249 ±44 libmad 0.18 ±0.14 1.13 ±0.3 55 ±1 libtheora 0.11 ±0.09 0.63 ±0.21 32 ±1 taglib 0.22 ±0.03 0.04 ±0.06 62 ±12 ﬁt to the empirical data. 3. THE GENTOO PORTAGE DEPENDENCY GRAPH The empirical data used is the time evolution of the Gentoo Portage package dependency graph [7]. This dataset contains the full dependency graph for every month since the project was initiated in 2000, the resulting graphs have a combined total of 1.3 million packages and 6.9 million dependency relations, with the largest graph having 15 thousand packages and 80 thousand dependency relations. Special tools where developed to extract the time series of the number of adopters A t for a given package. This time series was then ﬁt to eq. (2) using Mathematica’s NonLin- earModelFit . The goodness-of-ﬁt was analysed using an ANOVA table and calculated using the adjust coeﬃcient of determination ¯ R2 . The parameters were extracted from the ﬁt, and conﬁdence intervals were calculated under assump- tions of normality. In table 1 the relevant parameters are presented with their mean value and a 95% conﬁdence inter- val. Since normality was assumed, the conﬁdence intervals ignore the p0 and q0 constraints. The plots were drawn using a thick red line for the model and shades of red for the prediction bands. The thick red line was drawn according to eq. (2), with the mean values used as parameters. Then single value predictions bands were calculated for 90%, 95%, 99% and 99.9% conﬁdence and were drawn in progressively darker shades of pink. They represent a prediction for where a single additional value would likely fall. According to the model and the uncertainty introduced by the ﬁt, there is a 90% chance that it will fall in the innermost band, 95% chance that it would falil in the second band, etc. If the data ﬁts the model properly, one would expect to see 90% of the points in the inner band. Finally, the empirical data points were plotted as black dots, connected with a thin vertical line - the residual error of the model. 4. RESULTS From the entire list of packages, a few well-known (at least according to the author) packages where selected. The selection criteria was that the package must not have existed before (around) 2004, because the Gentoo Portage database was still too immature then, and the package must have gained a considerable number of dependers since its intro- duction. In table 1, we list some selected packages along with their parameters. In all cases the adjusted coeﬃcient of determination ¯ R2was more than 99%. 4.1 Imitator driver growth 2006 2008 2010 2012 0 50 100 150 200 250 300 350 Figure 1: The imitation driven adoption of git. The ﬁrst package we consider is git , a modern revision control system that shows an imitator driven adoption. It’s growth can be seen in ﬁg. 1, the corresponding statistics are in table 1. The package ﬁrst appeared just before 2005, it had around ten packages depending on it in 2006, twenty in 2008 and is currently used by almost three hundred packages. According to the Bass model, it will continue to grow to approximately 750 users. The innovator inﬂow is only 0.2% of the potential market per year, so one would expect 0.002 · (750 300) = 1 user to adopt git out of sheer innovation. Taking the analogy with persons, if someone from the 450 current non-users where to meet a random person from the entire 750 market, there is a 300 750 = 40% chance of meeting a user which can convince him/her to start using git . The chance of this happening is the imitator inﬂow q= 0.73. Therefore, the total number of users git can expect to gain from imitation this year is 450 ·40% ·0.73 = 131. Very much imitator driven! The relative slowness of git’s growth and its dependence on imitator can be explained. Open source projects, and software project in general, consist of numerous large textual ﬁles containing source code. Changes made in one place can hugely and unpredictably aﬀect other places. To complicate matters further, usually more than one developer works on the source code at the same time. There are competing systems such as cvs , subversion , mercurial , etcetera., but the basic functionality of maintaining version is provided by all of them. Thus two explanations can be derived for git ’s growth: First, the revision control system is not a part that aﬀects the products delivered by the open source project and second, there is little incentive to switch unless the new revision control system is proved to be superior. libnotify is a library for notiﬁcations. In modern desktop environments applications may want to notify the user of certain events, for example a battery that is about to go empty, a new email or an incoming phone call. The adoption is relatively slow, despite its usefulness. A possible explana- tion is that the target applications all have their own custom solutions, which the developers are keen to keep. udev is a device manager. Its task is to communicate closely with the hardware drivers in Linux kernel to monitor any changes in the hardware conﬁguration. It represents an architectural change in a very low level component, this might explain its slow imitator driven growth. 2004 2006 2008 2010 2012 0 10 20 30 40 50 60 Figure 2: The innovation driven adoption of libmad. 2006 2007 2008 2009 2010 2011 2012 2013 0 10 20 30 40 Figure 3: The rise and decline of xulrunner. cairo is a graphics library. It provides facilities for drawing lines, circles, text and other graphics primitives and is used by user graphics-heavy projects such as user interface libraries. Much like udev it is an architectural change at a low level, this might explain its similar growth pattern. 4.2 Innovator driver growth A typical example of innovator driven growth is given by libmad . The model is ﬁtted resulting in ﬁg. 2. Again, the data is neatly explained by a Bass diﬀusion process, in particular the rapid steep growth and the stable user base afterwards. The name is an acronym for “library for MPEG Audio Decoding” and the package provides a high quality mp3 decoder for use in multimedia applications. This might also explain the rapid growth of its adoption: multimedia applications can beneﬁt a lot from good quality mp3 support. libtheora is a library for the Schroedinger video codec. It implements a multimedia standard for use by video players. Just as with libmad there is a strong innovator driven growth. taglib is a library that processes metadata from multi- media ﬁles. The package allows media players to read and store information such as artist and title from multimedia ﬁles. Again, like the other multimedia packages we observe rapid innovator driven growth. 4.3 Growth and demise The previous examples are all about projects that start and undergo a growth phase that can be explained by a Bass diﬀusion process. So far, the Bass diﬀusion model has appeared to give a very accurate explanation of the adoption of an open source software library. A Bass diﬀusion increases monotonically, and never de- clines. However, not all packages follow this behaviour. Project libmad (see ﬁg. 2) is a good example of this contra- behaviour. The package has an innovator driver growth that brings it close to its maximum in about two years. After that, the package’s usage remains almost ﬂat for years, and will do so indeﬁnitely if it is a perfect Bass diﬀusion process. This is called the “maturity stage” in product life-cycle parlance. A real product life-cycle will also include a “decline stage” where the product begins to become obsolete. The Bass innovation diﬀusion model does not account for this. In a deep sense it would not have to, once ideas spread they become part of our collective knowledge and will continue to be used by the new products being developed. But the Bass model was not developed for the spreading of ideas, it was developed in the context of marketing to model the adoption of products. Extending the Bass model to include obsolescence would be an interesting extension for future research. The package xulrunner in the dataset is a nice example of a short but complete life cycle, see ﬁg. 3. When the Bass model is applied naively and a least mean squares best-ﬁt is made, the result is a poor ﬁt. If one looks at the dependency growth of the package, the cause is clear: the package becomes obsolete, which the Bass model as presented in section eq. (2) does not represent. The decline of the package from approximately 2011 onwards can be seen as blue dots in the ﬁgure. Excluding the blue dots from the data results in the Bass model ﬁt from ﬁgure ﬁg. 3. The ﬁtness increases to ¯ R2 = 99.54% and the parameters have tighter and reason- able conﬁdence intervals. This is strong evidence that the initial adoption of the package is a Bass diﬀusion process. To explain the last part, the model should be extended with an obsolescence term. 5. RELATED WORK In 2008 Crowston et al. [3] published a comprehensive overview of academical research on open source software development. Of the 184 articles they cite, the vast majority of articles are case studies or surveys, with 4% of the articles describing the development of empirical instruments and/or measurements. Also, most articles look at the level of a particular group or project, with 4% looking at the societal level of interacting projects. Crowston et al. ﬁnd no articles that develop instruments or measurements on the level of OSS packages, which this paper aims to do. In their 2008 article Zheng et al. [10] analyse the depen- dency graph of the Gentoo Portage package database as it was in February 2007. The global structure of the graph is analysed in graph theoretic terms of sparsity, clustering coeﬃcient, degree distribution and degree growth rate. The authors conclude that the graph can not be explained well by existing models of network growth and they propose two new models instead. Our study diﬀers in two ways: First, we analyse the changes to the dependency graph over time, whereas Zheng et al. look at a particular instance in 2007. Second, we analyse and model the development of a given software package in the dependency network, instead of the overall development of the network. It can be interesting to study how the Bass diﬀusion model for the development of individual nodes corresponds to Zheng et al. model for the development of the whole graph. Haeﬂiger et al. (2008) [5] study code re-use on six open source projects, they conclude that there is extensive code re-use in open source software. The article proceeds by identifying the process of code re-use, such as the drivers for re-using and the tools used to ﬁnd relevant code. The study by Haeﬂiger et al. looks at a given project and how it re-uses existing components, whereas this paper looks at a given project and how it is being re-used by other projects. Additionally, we performed our analysis using a large set of automatically collected and processed empirical data. Dedrick and West (2004) [6] and Chen (2006) [2] study the adoption process of open source software by (commercial) end users. Their focus is on the competitive economic strengths of open source software versus commercial software. The conclusion is that cost is the most important driver for open source adoption and freedom and extensibility plays a lesser role. The articles do not provide empirical data on the adoption process itself, which makes it hard to compare it to our present ﬁnding of a Bass diﬀusion adoption process. 6. CONCLUSIONS AND FUTURE WORK The growth of the number of packages depending on a packages can be modelled as a a Bass diﬀusion process. Overall the Bass diﬀusion model gave very a good ﬁt for most OSS projects. Using only four parameters, it was able to describe the growth curves from the empirical data. Full statistical rigour would require a more involved analysis using the methods from, for example, Carlos Escanciano (2006) [4], but given the amount of and quality of evidence found we can conclude that most OSS project do follow the Bass diﬀusion model. As can be seen in table 1, the Bass parameters pand q are diﬃcult to interpret and compare. A high pdoes not automatically mean an innovator driven growth: if the q value is also high then the result is simply a lot of growth. For the same reason it is also diﬃcult to compare the pand qbetween packages. Maha jan et al. (1995) [9] suggests using q p and q+p, this represents the total adoption rate and an imitator/innovator ratio. Analysing the package dependency graph and its changes over time can provide new insights. Our exploratory study provides some evidence for insights such as how multimedia libraries are being adopted through an innovator driven process with low-level architectural changes happening slowly and through imitation. Further studies could test these hypotheses. The package dependency graph contains empirical data to test extensions of the Bass diﬀusion model - extended with discarders. The Bass model and the present analysis is formulated in terms of absolute number of users, but in most applications only sales ﬁgures are available. The amount of sales is the ﬁrst derivative of the Bass model, hence the model is usually applied in its derivative form [9]. As a consequence the model only considers adopters, but does not consider discarders. In the xulrunner example, we see the package being discarded from 2011 onwards, providing insights in the discarding mechanism. The next step would be to collect more examples of packages being discard, look at their patterns of demise and develop a model of discarding to supplement the Bass model of adoption. One model could for example be the inverse of a Bass curve, this makes sense when the market share of the original package is taken over by a new package. The unique feature of dependency graph analysis to give absolute user numbers facilitates this. Such an analysis can also help in predicting the behaviour of certain projects - hence forewarning the particular project stakeholders. The scale and complexity of the dependency graphs and open source innovation requires some care. Three notable issues became apparent in this study: First, In the open source community there is a lot of forking. It is not always clear whether a forked project constitutes the continuation of the original project or a separate new project. A more thorough study on the nature of forking could provide the insights to resolve this. Second, due to the public nature of open source development many immature or abandoned projects are visible in the larger datasets. This is good from a scientiﬁc perspective: it allows one to research projects from their early beginning and look at projects that failed to grow or became obsolete. But it clouds the ‘big picture’ with many projects that do not signiﬁcantly contribute to the overall innovation. In large datasets one would have to devise a relevance metric to select the relevant metrics. Such metrics could be the number of developers, the number users or the number of dependees. Third, the sheer scale of the available OSS databases provide challenges for analysis. Specialist tooling is required to transform the raw data into more manageable formats. 7. REFERENCES [1] F. M. Bass. A new product growth for model consumer durables. Management Science, 15(5):215–227, January 1969. [2] S. Chen. An economic model of open source software adoption. The Journal of Portfolio Management, 2006. [3] K. Crowston, K. Wei, J. Howison, and A. Wiggins. Free/libre open-source software development: What we know and what we do not know. ACM Comput. Surv., 44(2):7:1–7:35, Mar. 2008. [4] J. C. Escanciano. Goodness-of-ﬁt tests for linear and nonlinear time series models. Journal of the American Statistical Association, 101(474):531–541, 2006. [5] S. Haeﬂiger, G. von Krogh, and S. Spaeth. Code Reuse in Open Source Software. Management Science, 54(1):180–193, 2008. [6] Hawaii International Conference on System Sciences. An Exploratory Study into Open Source Platform Adoption, 2004. [7] MSR: Mining Source Repositories. The Gentoo Portage package dependency graph, 2014. Available at http://2\pi.com/. [8] V. K. Narayanan. Managing Technology and Innovation for Competitive Advantage. Prentice Hall, Englewood Cliﬀs, New Jersey, 2001. [9] F. M. B. Vijay Mahajan, Eitan Muller. Diﬀusion of new products: Emperical generalizations and manegerial uses. Marketing Science, 14:G79–G88, 1995. [10] X. Zheng, D. Zeng, H. Li, and F. Wang. Analyzing open-source software systems as complex networks. Physica A: Statistical Mechanics and its Applications, 387(24):6190–6200, 2008. ... On this paper, they describe a method to parse ebuild files using Paludis and a custom built program to produce the dependency graph for a specific package, and how they used the CVS revision control system where the database is kept to track changes for a specific package. The change over time of the dependency graph is analysed in another paper by the same authors [13]. ... Preprint Package managers are a very important part of Linux distributions but we have noticed two weaknesses in them: They use pre-built packages that are not optimised for specific hardware and often they are too heavy for a specific need, or packagesmay require plenty of time and resources to be compiled. In this paper, we present a novel Linux package manager which uses cloud computing features to compile and distribute Linux packages without impacting the end user's performance. We also show how Portage, Gentoo's package manager can be optimised for customisation and performance, along with the cloud computing features to compile Linux packages more efficiently. All of this resulting in a new cloud-based Linux package manager that is built for better computing performance. ... In a separate paper [1] we analyzed the changes in the dependency graph over time. In particular the growth of the number of dependers on a given package is explained using the Bass model of innovation diffusion. ... Conference Paper Full-text available Open source distributions such as Gentoo need to accurately track dependency relations between software packages in order to install working systems. To do this, Gentoo has a carefully authored database containing those relations. In this paper, we extract the Gentoo package dependency graph and its changes over time. The final dependency graph spans 15 thousand open source projects and 80 thousand depen- dency relations. Furthermore, the development of this graph is tracked over time from the beginning of the Gentoo project in 2000 to the first quarter of 2012, with monthly resolution. We perform a cluster analyses of the package dependency graph that reveal meaningful relations among packages, and in a separate paper we analyse changes in the dependen- cies over time to get insights in the innovation dynamics of open source software. The resulting dataset provides many opportunities for both research on Open Source as well as practice. Article We review the empirical research on Free/Libre and Open-Source Software (FLOSS) development and assess the state of the literature. We develop a framework for organizing the literature based on the input-mediator-output-input (IMOI) model from the small groups literature. We present a quantitative summary of articles selected for the review and then discuss findings of this literature categorized into issues pertaining to inputs (e.g., member characteristics, technology use, and project characteristics), processes (software development practices, social processes, and firm involvement practices), emergent states (e.g., social states and task-related states), and outputs (e.g. team performance, FLOSS implementation, and project evolution). Based on this review, we suggest topics for future research, as well as identify methodological and theoretical issues for future inquiry in this area, including issues relating to sampling and the need for more longitudinal studies. Article (This article originally appeared in Management Science, January 1969, Volume 15, Number 5, pp. 215--227, published by The Institute of Management Sciences.) A growth model for the timing of initial purchase of new products is developed and tested empirically against data for eleven consumer durables. The basic assumption of the model is that the timing of a consumer's initial purchase is related to the number of previous buyers. A behavioral rationale for the model is offered in terms of innovative and imitative behavior. The model yields good predictions of the sales peak and the timing of the peak when applied to historical data. A long-range forecast is developed for the sales of color television sets. Article The diffusion model developed by Bass (Bass, F. M. 1969. A new product growth model for consumer durables. (January) 215–227.) constitutes an empirical generalization. It represents a pattern or regularity that has been shown to repeat over many new products and services in many countries and over a variety circumstances. Numerous and various applications of the model have lead to further generalizations. Modifications and extensions of the model have lead to further generalizations. In addition to the empirical generalizations that stem from the model, we discuss here some of the managerial applications of the model. Article Software systems represent one of the most complex man-made artifacts. Understanding the structure of software systems can provide useful insights into software engineering efforts and can potentially help the development of complex system models applicable to other domains. In this paper, we analyze one of the most popular open-source Linux meta packages/distributions called the Gentoo Linux. In our analysis, we model software packages as nodes and dependencies among them as edges. Our empirical results show that the resulting Gentoo network cannot be easily explained by existing complex network models. This in turn motivates our research in developing two new network growth models in which a new node is connected to an old node with the probability that depends not only on the degree but also on the “age” of the old node. Through computational and empirical studies, we demonstrate that our models have better explanatory power than the existing ones. In an effort to further explore the properties of these new models, we also present some related analytical results. Article Code reuse is a form of knowledge reuse in software development, which is fundamental to innovation in many fields. To date, there has been no systematic investigation of code reuse in open source software projects. This study uses quantitative and qualitative data gathered from a sample of six open source software projects, to evaluate two sets of propositions derived from the literature on software reuse in firms and open source software development. We find that code reuse is extensive across the sample and that open source software developers, much like developers in firms, apply tools that lower their search costs for knowledge and code, assess the quality of software components, and they have incentives to reuse code. Open source software developers reuse code because they want to integrate functionality quickly, because they want to write preferred code, because they operate under limited resources in terms of time and skills, and because they can mitigate development costs through code reuse. I Article In this article we study a general class of goodness-of-fit tests for a parametric conditional mean of a linear or nonlinear time series model. Among the properties of the proposed tests are that they are suitable when the conditioning set is infinite-dimensional; that they are consistent against a broad class of alternatives, including Pitman's local alternatives converging at the parametric rate$n^{-1/2}\$, with n the sample size; and that they do not need to choose a lag order depending on the sample size or to smooth the data. It turns out that the asymptotic null distributions of the tests depend on the data generating process, so a new bootstrap procedure is proposed and theoretically justified. The proposed bootstrap tests are robust to higher-order dependence, particularly to conditional heteroscedasticity of unknown form. A simulation study compares the finite-sample performance of the proposed and shows that our tests can play a valuable role in time series modeling. Finally, an application to an economic price series highlights the merits of our approach.