A Generative Model of Software Dependency
Graphs to Better Understand Software Evolution
Vincenzo Musco∗, Martin Monperrus†, Philippe Preux‡
∗University of Lille and INRIA
Email: email@example.com †University of Lille and INRIA
Email: firstname.lastname@example.org ‡University of Lille, LIFL, INRIA
Abstract—Software systems are composed of many interacting
elements. A natural way to abstract over software systems is
to model them as graphs. In this paper we consider software
dependency graphs of object-oriented software and we study
one topological property: the degree distribution. Based on the
analysis of ten software systems written in Java, we show that
there exists completely different systems that have the same
degree distribution. Then, we propose a generative model of
software dependency graphs which synthesizes graphs whose
degree distribution is close to the empirical ones observed in
real software systems. This model gives us novel insights on the
potential fundamental rules of software evolution.
Software systems are composed of many elements inter-
acting with each other’s. For instance there are hundreds of
thousands of interconnected functions in a Linux kernel .
A natural way to abstract over software systems is to model
them as graphs , , , , .
In this paper we consider software dependency graphs of
object-oriented software, where each node represents a class
and each edge corresponds to a compilation dependency.
We study the topology of those graphs and we speciﬁcally
concentrate on their degree distributions (the distribution of the
number of edges connected to a node). Based on the analysis
of ten software systems written in Java, we show that there
exists software systems that are completely different yet have
the same degree distribution.
This is a surprising result: despite being developed by
different persons with different processes in different domains,
a common degree distribution emerges. This makes us hypoth-
esizing that there are common rules guiding the construction
of software systems over time.
In network science, a generative model deﬁnes a set of rules
used to synthesize artiﬁcial networks in a given domain. For
example, there exists generative models that aim at producing
graphs that are similar to the World Wide Web . One
equivalent model in software engineering would be a model
generating graphs which look like real software graphs. If such
a generative model exists, it may encode common evolution
rules driving the emergence of the graph structure of software
In this paper, our goal is to propose a generative model of
software dependency graphs. If this model ﬁts the empirical
data, it means that it approximates certain fundamental rules
of software evolution.
Our experimental methodology is as follows. We propose
a generative model of software graphs, and then we evaluate
its capacity to create graphs whose degree distribution is close
to the empirical ones observed in real software systems. We
compare its ﬁt-to-data to the only comparable model of the
literature . Our experimental results show that our genera-
tive model of software dependency graphs, GD-GNC, both ﬁts
the empirical data and outperforms the model proposed in .
To put it shortly, our generative model is the ﬁrst to produce
software dependency graphs that look like real ones.
To sum up, our contributions are:
•empirical evidence of the common asymmetric topology
of dependency graphs in object-oriented software sys-
•a generative model of software dependency graphs
•the validation of the model on its ability to ﬁt ten graphs
of real software systems totaling 10619 nodes and 52855
•a speculative explanation of the fundamental evolution
rules of software.
The rest of this paper is structured as follow. Section II de-
ﬁnes the main concepts used in this paper. Section III presents
our goals and experimental methodology. In Section IV, we
look closer at real software dependency graphs and highlight
their common topology. In Section V, we introduce a new
generative model for software dependency graphs and we
analyze its ﬁtness. In Section VI, we discuss our discoveries
from a software engineering perspective.
II. DE FIN IT IO NS
In this section, we provide background knowledge about the
concepts used in this paper.
Agraph is a mathematical object for modeling connections
among concepts or entities of a speciﬁc domain: entities are
represented by nodes and the links between them are edges.
For example, the networks peripherals such as routers or
switches and their connections can be modeled using graphs:
each node is a peripheral and each edge represents a physical
wire between them, see Fig. 1(a).
arXiv:1410.7921v1 [cs.SE] 29 Oct 2014
In some situations, links between entities are one-way only,
i.e., an edge is meaningful from one node to another one, but
not the opposite. Thus, edges require to be directed in order to
accurately model a speciﬁc domain. In this case, we talk about
directed graphs or digraphs. The food web is an example of
digraph where each species is a node and each edge between
two species indicates a ”feeding on” relation. In this kind of
graphs, the direction of the edge is important as it indicates
what eats what, which cannot be represented on an undirected
graph. Fig. 1(b) exempliﬁes a set of prey-predator relations.
Fig. 1: Basic examples of two graphs. (a) An undirected graph
illustrating a computer network on which an edge represents a
physical connection between two entities. (b) A directed graph
representing a trophic network on which a node is a species
and an edge represents which one is eaten by which one. Here
the directed edge is needed as a predator eat a prey but not the
opposite (e.g. snakes feed on mice, but the opposite is false).
B. Degrees and Distributions
The number of edges connected to a speciﬁc node is named
the degree of this node. On digraphs, two degrees are deﬁned:
the in-degree and the out-degree which are respectively the
number of in-coming edges and the number of out-going
The degree distribution is a function representing the fre-
quency of node’s degrees of a graph (i.e. the number of times
a speciﬁc degree value is encountered in the graph). Graphs
considered in this paper have, like many other graph mate-
rializing real concepts, noisy and right skewed distributions
(i.e. asymmetric distribution with lots of values larger than
the mean located on the right tail of the distribution). In order
to ease the study of such distributions, cumulative distributions
are generally preferred .
The cumulative degree distribution gives the proportion of
nodes which degree is smaller or equal to a given value.
Symetrically, the inverse cumulative degree distribution gives
the maximal degree of a given proportion of nodes. Both
the cumulative degree distribution and the inverse cumulative
degree distribution are monotonic: the former is an increasing
function, the latter a decreasing one. We always consider
inverse cumulative distributions when studying distributions
in this paper.
Node degrees and their distributions are basic properties
which are directly inﬂuenced by many properties of a graph:
changing the graph topology and inherently some graphs
properties (diameter, motifs, graph spectrum, etc.) will impact
to the degree distribution. That is the reason why in this
paper, we consider them as a recognized proxy to the graph
topology (i.e. the manner edges connect nodes to each other)
C. Software Dependency Graphs
A large variety of graphs can be derived from software,
each one focusing on particular characteristics. Hence, nodes
and edges can have various meanings. An example of software
graph is the dependency graph in which nodes are modules
(e.g. packages, classes, etc. depending on the chosen granu-
larity) and edges are added when an element accesses another
one (e.g. function call, inheritance, ﬁeld access, etc.).
Dependency graphs are directed graphs as dependencies are
oriented: indeed, a dependency shows which element depends
on another one, but the opposite is not necessarily true. As
an example, a Person class depends on a File in order to
persist, but the class File should not depend at all on the
Nodes composing a dependency graph can be of two
different types. First, there are application nodes (a.k.a. app
nodes) that are nodes which belong to the core software itself.
Second, there are library nodes (a.k.a. lib nodes) which are
nodes which belong to an external library (i.e. a module which
is used by the core software, an example of library in Java
is the java.util package for classes related to collections
Consequently, there are two types of edges on a software de-
pendency graph: (i) application nodes to application nodes (i.e.
app-app edges a.k.a. endo-dependencies) which express that
a core element depends on another one; (ii) application nodes
to library nodes (i.e. app-lib edges a.k.a. exo-dependencies)
which express that a core element depends on an external
element. Fig. 2 illustrates these notions.
Fig. 2: Example of considered nodes and edges if we consider
only endo-dependencies (left) or exo-dependencies (right).
Considering or not exo-dependencies has an impact on
the observed degree distributions. Fig. 3 shows a line chart
plotting the inverse cumulative distributions for endo- (App-
App), exo- (App-Lib) and both dependencies for ant 1.9.2.
Two distributions have to be considered as those graphs are
directed: one for in-degree (3a) and one for out-degree (3b).
Plots are on a logarithmic scale. We see that endo- and exo-
dependencies have close yet different distributions. For in-
degree, the slope is different. If we consider only app-lib
dependencies i.e. we exclude app-app links (straight thin line),
the number of zero in-degree nodes strongly increases because
Fig. 3: Node in- and out-degree distributions for software
package “ant”. Three kinds of dependency are considered: app-
app, app-lib and both. X-axis are degrees and Y-axis are cumu-
lative degree frequencies. Both axes are on logarithmic scale.
It is important to differentiate endo- and exo- dependencies
because their topology is different.
lib nodes never connect to an app node. In this paper, we focus
on endo-dependencies and all the data we present excludes
D. Generative Models
A generative model for graphs is an algorithm that generates
artiﬁcial graphs. A generative model takes a set of parameters
as input (such as the number of nodes and parameters that
inﬂuence the degree distribution). A generative model may
be deterministic or stochastic. Given a set of parameters, a
deterministic model always generate the exact same graph
whereas a stochastic model generates a new graph each time
it is run.
In our case, we generate software dependency graphs and
we intend these graphs to look like empirical software depen-
dency graphs. We consider two types of graphs: those resulting
from an analysis of software systems, and those created by a
model. The former are qualiﬁed as “empirical” or “true”, the
latter are qualiﬁed as “synthetic” or “artiﬁcial”.
III. EXP ER IM EN TAL DESIGN
We now present the experimental protocol we use to analyze
software dependency graphs.
In this paper, we have two goals: 1) we want to study
the topology of software graphs; 2) we aim at inventing a
generative model of software graphs that ﬁts real data.
First, to study recurring topologies, we compare the degree
distributions obtained from our dataset. For instance, we may
observe that our set of empirical graphs can be partitioned into
a set of prototypical topologies, each one being shared by a
subset of software systems.
Second, we want to invent a model that generates software
dependency graphs that are similar to empirical graphs. We
optimize its parameters so that the degree distribution of the
generated graphs is as close as possible to those of empirical
graphs. Such a model would implicitly capture certain software
evolution rules (such rules are presented in Section VI-A).
We deﬁne a dataset1of 10 Java software applications listed
in Table I. This table contains the software name, the version
of the software considered for the software, the year of the ﬁrst
version of this software and the year of the considered version
of the software, the number of nodes and the number of edges
contained in the extracted graphs. The considered software
applications are between 10 and 14 years old. The last column
is the connectance γfor each software; the connectance is
computed using formula (1) with |N|the number of total
nodes contained in the graph and |E|the number of total
edges. In words, the connectance expresses the proportion of
pairs of nodes being connected in the graph.
TABLE I: The 10 Java programs considered in our experi-
ments: their version, their ﬁrst and current release year, the
number of nodes and edges and the connectance γ.
Version Years Nodes Edges γ
ant 1.9.2 2000-2013 1252 5763 0.004
jfreechart 1.0.16 2000-2013 858 4783 0.006
jftp 1.57 2002-2013 173 736 0.025
jtds 1.3.1 2001-2013 90 328 0.040
maven 3.1.1 2002-2013 1515 6933 0.003
hsqldb 2.3.1 2001-2013 602 4976 0.014
log4j 2.0b9 2004-2013 895 4136 0.005
squirrelsql 3.5.0 2001-2013 2288 10141 0.002
argouml 0.34 2002-2011 2664 13445 0.002
mvnforum 1.3 2003-2010 282 1614 0.020
In order to have a set of diversiﬁed set of software, we
chose software applications which are different in size (number
of nodes/edges). The graph size ranges from 90 to 2664
nodes, from 328 to 13445 edges and has a connectance γ
value ranging from 0.002 to 0.04. As thse software are all
1The dataset can be downloaded at http://www.vmusco.com//pages/dataset
developped by different teams, we reduce the risk of having
results biased towards speciﬁc development processes. Since
our graph extraction tool chain handles Java software, we
consider software that is open-source and written in Java.
We now present the key points of our methodology.
1) Dependency Graph Extraction: We now present our
method to extract dependency graphs from our dataset. We
focus on the class granularity (i.e. one node represents one
class), as this is the most important modularity unit in object-
oriented software. Also, we only consider endo-dependencies,
that is, edges connecting internal nodes of the project to
each others and not those connecting to external libraries (cf.
section II-C), because we aim at understanding the topology
of core software graphs, regardless of the number of libraries
that are included and the frequency of the library usage.
The graph extraction phase produces a set of nodes and
edges forming a graph which can then be analyzed. This
extraction phase is done using Dependency Finder2. This tool
takes as input Java byte code and outputs all the depen-
dencies being found. Graph metrics are computed using the
2) Degree Distribution Comparison: In this paper, our main
analysis tool is the comparison of degree distributions. The
degree distribution is a numerical property that captures certain
topological properties. For instance, the edge density and the
clustering coefﬁcient are directly correlated to it. To compare
two degree distributions, we use the Kolmogorov-Smirnov
statistic K(also called distance) which measures a distance
between two distributions. The statistic is given in Formula (2)
where sup is the supremum of a set, F1and F2are the two
distributions to compare and xranges over degree values.
Kis a numerical value that indicates how close two
distributions are: the lower K, the closer the distributions. For
one experiment presented in this paper, we perform a Mann-
Whitney U test  on K to compare two generative models.
Also, we use the K statistic in the context of the
Kolmogorov-Smirnov test. This statistical test checks whether
a sample follows a reference probability distribution (one-
sample test) or another sample distribution (two-sample test)
3) Evaluation of Generative Models: To evaluate whether
a generative model is good, we use the Kolmogorov-Smirnov
statistic to compare artiﬁcial degree distributions with real ones
(recall that the degree distribution is a proxy for the graph
topology, see Section II-B). If the degree distributions of the
generated dependency graphs match with the empirical ones,
the basic operations of the generative model are candidate to
be the fundamental evolution rules we are looking for.
4) Fine Tuning of Generative Models: Generative models
frequently requires parameters. For instance, the small-world
model  requires one probability parameter. In order to
determine the best parameter values needed to generate graphs
as close as possible to real ones (according to their degree
distribution) we perform a grid-search and generate graphs
for each point of the grid. This is done as follows: we iterate
over the range of each parameter. For instance, if a probability
parameter ranges between 0.0 to 1.0, we evaluate the model
ten times for 0.1, . . . , 1. For stochastic models, the evaluation
or parameter optimization is repeated 10 times and the median
value is selected as the ﬁnal result.
Then, we use the Kstatistic to assess the distance between
the true graph and the generated one. The graph with the
smaller statistic value is the one that looks like real data the
IV. STU DY OF SO FT WARE GR AP H TOPOLOGY
We want to determine whether there exists topologies shared
by different software applications. As the production of those
software packages is inﬂuenced by different factors (teams,
developing techniques...), it is a priori expected that there
is no common topology. On the opposite, ﬁnding common
topologies would be an interesting fact, it would mean that
there exists common software development rules. Hence, our
ﬁrst research question reads:
Research Question 1 Are there common topologies of soft-
ware dependency graphs at the class granularity?
Fig. 4: Inverse cumulative in- and out-degree distribution
for all 10 software applications of our dataset (axes on a
logarithmic scale). The in- and out- distributions are different.
To answer this question, we ﬁrst look at the graphical
resemblance between the cumulative degree distributions of
the set of applications we consider. The software compilation
graphs we consider are directed (see Section II-C), so we have
to consider two distributions: in-degree and out-degree. Then,
we determine the signiﬁcance of our graphical observations,
using the two-sample Kolmogorov-Smirnov test.
Figure 4 shows the plot of the inverse cumulative in-
degree (4a) and out-degree (4b) distribution on a log-log
scale of our dataset. There is a different curve and color for
each considered software applications. We plot the inverse
cumulative frequency against degrees, i.e. the number of nodes
in the graph which have a degree greater than or equals to the
We observe: (i) the position and the shape of curves are
similar which indicate there are common topologies across
software; (ii) in- and out- degrees distributions are not the
same: in-degree distribution is a straight line but out-degree
distribution is more curved; (iii) each plot on the in-degree
distribution is an almost straight line on a log-log scale
which means that the in-degree distribution generally follows
a power-law 4; (iv) the out-degree distribution is not a straight
line: out-degrees do not follow a power law. Observations (ii)
and (iii) have already been made .
C. Statistical Signiﬁcance
In order to assess our observations in a statistical manner,
we now set our null and alternate hypotheses:
H0: Samples from the software in-degree distributions
(resp. out) are drawn from the same distribution.
H1: Samples from the software in-degree distributions
(resp. out) are not drawn from the same distribution.
Using the two-sample Kolmogorov-Smirnov test on each pair
of software in the dataset, we can determine statistically the
common topology across software in our dataset. The test will
return a p-value which is used to reject H0or not. If H0is not
rejected, we gain conﬁdence about the common topology for
those two software. In the other hand, if H0is rejected, the
test outcome can not be used to conclude about the common
topology (which does not necessarily means samples are not
drawn from the same distribution).
Table II gives results for running 90 two-sided Kolmogorov-
Smirnov tests with a conﬁdence level αof 0.01 5(one on each
pairs of our dataset software). The rows express respectively
results for in-, out- and both distributions. The second and third
column present the number and the ratio of tests for which
the two-sided Kolmogorov-Smirnov test has rejected H0. The
third and the fourth column express the opposite data (i.e. the
test has not rejected H0).
4plotting a straight line on a log-log scale is the standard way to visually
identify power-laws on data .
5We need to test each pair of sotfware, hence C2
10 = 45 tests, which is
doubled since we test in-degree and out-degree.
TABLE II: Number or times the H0hypothesis is rejected
(or not) for in-, out- and both cumulative degree distribution
according to the two-sided Kolmogorov-Smirnov test with a
conﬁdence level αof 0.01.
H0Rejected H0Not rejected
Count Ratio Count Ratio
In 19/45 42% 26/45 58%
Out 9/45 20% 36/45 80%
Total 28/90 31% 62/90 69%
As we can see, for 69% of the tested pairs the com-
mon distribution hypothesis cannot be rejected. However, this
afﬁrmation does not necessarily involve there is a unique
distribution shared by all those software. On the other hand,
for the remaining 31% of tested pairs which has rejected H0,
no conclusion can be drawn at this level of conﬁdence.
To sum up, we reply positively to our ﬁrst research question:
our experiment indicates that, according to the degree distri-
bution, there are common shapes across software dependency
graphs. We consider it as an emergence phenomenon and we
hypothesize that there is a common evolution process which
eventually yields those common degree distributions.
V. A G ENERATIVE MODEL F OR SO FT WARE DEPENDENCY
In this section, we present a new generative model of
software dependency graphs. This stochastic model generates
an arbitrary number of artiﬁcial dependency graphs. It is
parametrized by three values: the expected number of nodes
and two probabilistic parameters.
A. Generative Models of the Literature
We discuss here three related generative models of graphs:
os & R´
enyi is a prototypical one ; the relation between
GNC and software graphs has been observed once ; Baxter
and Frean’s model  is the only one explicitly targeting the
generation of software graphs.
os & R´
enyi proposed in 1959 is one of the oldest
generative model . This model connects pairs of nodes
according to a ﬁxed probability p. The connectance of the
resulting graph is hence γ=p. We will later use this model
as a point of comparison.
In 2005, Krapivsky and Redner proposed the GNC model
(GNC stands for “Growing Network model with Copying”)
: this model requires one parameter: the number of nodes
of the resulting graph. The GNC model is an iterative algo-
rithm which, at each iteration, a new node is added to the
graph and connected at random to a set of already existing
nodes as follows: an existing node is selected according to a
uniform distribution and directed edges are created from the
new node to this node along with all its successors. We call
this the “GNC-Attach process”, is is illustrated in Figure 5.
Fig. 5: Illustration of GNC-Attach, the GNC primitive opera-
tion. The grey node is a new node added to the graph using the
GNC primitive. The central node is randomly selected and a
directed edge is added from the new node to it (dashed edge).
Then, a directed edge is also added to all destination nodes
the randomly selected one is already connected to as a source
Algorithm 1 shows the core primitive for attaching nodes using
GNC. The full GNC model executes ntimes this function
to create a graph with nnodes. The striking fact about this
generative model is its high ability to ﬁt in-degree distributions
of software graphs as observed by Valverde and Sol´
Algorithm 1: GNC-Attach Algorithm
Input:nithe current node being inserted. GN,Ethe
digraph on which we add the node (composed of
two sets: nodes (N) and directed edges (E))
Function GNC_Attach(GN,E, ni)is
Randomly selects a node njin graph different from
Add an edge from nito nj
for all edge (nj, nd) in the out-edges set of njdo
Add an edge from nito nd
In 2008, Baxter and Frean  proposed a generative model
of dependency graphs. This model has an explicit hard coded
preferential attachment based on the out-degree of nodes. Its
logic is based on edges creation/transfer between nodes of the
graph. We consider this model as our baseline. First, this model
is also intended to generate software graphs, and second, it has
acceptable ﬁts on in- and out- degree distribution.
Many other generative models of directed graphs have been
considered (cf. section VII), in many application domains (e.g.
WWW, proteins ...). We have explored whether they generate
likely dependency graphs, and, expectedly, they do not.
B. Generalized Double GNC (GD-GNC)
We now present our generative model of software depen-
dency graphs, called “Generalized Double GNC” (GD-GNC
for short). It is a generalization of the GNC model and is based
on the GNC-Attach primitive.
Our model consists of a main loop in which for each loop
iteration: (i) a unique node niis added to the graph; (ii) an
existing node njis drawn in a uniformly random manner.
The process of creating edges is as follows: (i) with proba-
bility p,niis connected to njin the same way as in the GNC-
Attach algorithm (i.e. a directed edge is created between nito
njbut also from nito each node to which njis connected to),
and with probability qwe repeat this GNC-based attachment
twice (if the random node to attach is twice the same, the
second one is ignored). (ii) with probability 1−p,njis
connected to ni(which we refer as the attachment alternative);
A pseudo-code is shown in Algorithm 2. GD-GNC is a
generalization of GNC: the GNC algorithm is a special case
where p= 1 and q= 0.
We note that this model never modiﬁes existing edges: at
each loop iteration, it only adds a single node and a set of
edges. No explicit preferential attachment is hard coded in the
algorithm, but an implicit one is still present. In our model,
attaching using the GNC algorithm implies the node does not
only connect to a node, but also to all children of this node.
As a consequence, the higher the in-degree value of a node
is, the higher the probability of being attached is. So, nodes
with a high in-degree value are more likely to be pointed to
by new nodes, and their in-degree increases accordingly.
Two parameters are required by our model and inﬂuence
the generation. The ﬁrst one, p, determines whether the node
must be added using GNC or the attachment procedure. As this
probability changes, the quantity of nodes without outgoing
edges varies. The second one, q, determines whether the GNC
algorithm should be executed once or twice for the inserting
node. Increasing the number of GNC executions for a node
impacts the inverse cumulative degree distributions. Regarding
the in-degree, the coefﬁcient of the power-law (i.e. the line
slope) is affected: the line decreases more slowly when the
number of GNC iteration increases (higher q). Regarding the
out-degree, the convexity of the distribution increases as q
C. Evaluation of of GD-GNC
We now want to determine whether the Generalized Double
GNC can generate graphs that are more realistic than the
graphs generated with Baxter & Frean’s model. We formu-
late this research question as: Research Question 2 Do class
dependency graphs generated using GD-GNC better ﬁt real
software data than Baxter & Frean model (according to the
cumulative degree distribution)?
1) Protocol: To answer this question, we ﬁrst run a pa-
rameter optimization (as presented on section III-C4) for each
model (GD-GNC and Baxter & Frean) on all programs of our
dataset (see Table I). Then, we generate 30 synthetic graphs
with each model, using the best parameters found for each pair
model, program. Finally, we compute the inverse cumulative
degree distribution of each graph: and we calculate the median
ﬁtness value according to its δvalue deﬁned by Equation (3).
In addition to the comparison, we also compare against the
os & R´
enyi model, a purely random and simple model.
With Erdos-Renyi’s model, generating graphs with the same
number of nodes and edges as real software graphs requires
no parameter optimization: we can simply use the connectance
of the real graph.
2) Results: Figure 6 shows the in- (6a) and the out- (6b)
inverse cumulative degree distribution of graphs generated
using different models. Each small plot represents a different
Algorithm 2: Iterative algorithm for the ”Generalized Double GNC” generative model
Input:Nthe number of iterations to execute/nodes to add, pthe probability to do a GNC and qthe probability to do a
Output: A digraph GN,Ewhich is composed of two sets: nodes (N) and directed edges (E)
while |N | < N do
Create a node niand add it to the graph
Randomly selects a node njin graph different from ni
Add an edge from njto ni
software application, the meaning of the axis are the same
for all graphics: x-axes are degrees and y-axes are the invert
cumulative degrees frequency. Both axis are in logarithmic
scale. The thick continuous line corresponds to the distribution
of real data, the thin continuous line is for graphs generated
using the GD-GNC model and the dotted line is for graphs
generated using Baxter & Frean’s.
Graphically, we observe that GD-GNC in- and out-degree
distributions are almost always better than Baxter & Frean’s.
In other words, the GD-GNC algorithm produces generally
synthetic software graphs whose inverse cumulative in- and
out- degree distribution better ﬁts the ones of real software
dependency graphs than Baxter & Frean.
3) Statistical Signiﬁcance: To determine statistically which
model generates the closest graph to the true one, we compare
the Kolmogorov-Smirnov statistic or distance K(as presented
on section III-C3) for in- and out- cumulative degree dis-
tribution between the generated graph Gand the real graph
R. For this purpose we deﬁne the δfunction, as shown
in Equation (3), which is the max value between the two
Kolmogorov-Smirnov distances: ﬁrst the distance between the
in-cumulative degree distribution for the generated graph Gin
and the real one Rin, and last the out-cumulative degree
distribution for the generated graph Gout and the real one Rout.
δR,G= max(KRin,Gin , KRout ,Gout )(3)
Indeed, this function is required as we must consider both
in- and out- distances at the same time as both distribution are
intimately related to each other: considering only in- or out-
distribution would be meaningless as a good in-distribution
does not necessarily involve a good out-distribution and vice-
As we want graphs which are similar to real ones according
to their degree distribution, and as the δvalue represents the
largest distance between a pair of distributions, considering in-
and out-degree distributions, we know then the model which
produces the smallest δvalue is the best. To statistically ensure
each δvalue obtained for a model is drawn from a different
distribution, we use the Mann-Whitney U test . In terms
of null hypothesis, this test allows us to reject or not the null
H0: the δvalues obtained for GD-GNC and the ones
obtained from Baxter & Frean model belong to an identical
H1: the δvalues obtained for GD-GNC and the ones
obtained from Baxter & Frean model belong to a different
Table III sums up various values for δobtained from
30 generated graphs with each model for each software.
Each row represents values for each software: columns report
respectively the name of the software, the minimal, median and
maximal δvalues for Erd¨
os & R´
enyi model, then for Baxter &
Frean and ﬁnally for GD-GNC. The last column is the p-value
obtained using the Mann-Whitney test between GD-GNC and
Baxter & Frean.
Comparing graphs generated by Erd¨
os & R´
(columns 2–4) on the one hand to GD-GNC (column 8–10)
and Baxter & Frean’s (column 5–7) on the other hand, it is
clear that both models generate graphs more similar to real
ones with regards to their δvalue than Erd¨
os & R´
Furthermore, comparing GD-GNC (column 8–10) and Bax-
ter & Frean (column 5–7) shows that graphs generated using
the former are almost always closer to real graph than ones
generated using the latter. However, for some topologies
(maven, hsqldb and log4j if considering only the median
value), Baxter & Frean seems to generate better graphs. The
Mann-Whitney p-value test shows those results are reliable as
their p-value is lower than 0.05, excepted for log4j.
To sum up, according to our experiments on the degree dis-
tributions, our GD-GNC model is able to reproduce software
topologies more accurately than Baxter & Frean one.
We now put aside technical considerations and discuss the
meaning and validity of our empirical results.
A. GD-GNC from a Software Engineering Perspective
We now have a generative model which ﬁts degree dis-
tributions of empirical software graphs. This model is only
Fig. 6: Plot of the cumulative in- and out- degree distribution for 1) the real software graph (thick solid line); 2) the best
generated match using the Generalized Double GNC model (thin solid line) 3) the best match generated using Baxter & Frean’s
model (dotted line). Generalized Double GNC outperforms Baxter & Frean’s model.
TABLE III: δmin, median and max values for GD-GNC, Baxter & Frean and Erd¨
os & R´
enyi models for 30 random generations
of each model (cf. section V-C for more information on δvalue). The last column is the p-value determined using the Mann-
Whitney test and assess if all values of GD-GNC are always smaller than the Baxter & Frean ones in a statistical manner.
os & R´
enyi Baxter & Frean GD-GNC p-value
Min Med Max Min Med Max Min Med Max
ant 0.77 0.91 0.97 0.32 0.51 0.82 0.29 0.45 0.63 ≤10−3
jfreechart 0.66 0.81 0.91 0.17 0.39 0.58 0.13 0.20 0.39 ≤10−8
jftp 0.52 0.70 0.82 0.22 0.44 0.82 0.08 0.27 0.58 ≤10−6
jtds 0.47 0.64 0.81 0.24 0.40 0.72 0.17 0.31 0.37 ≤10−5
maven 0.63 0.76 0.81 0.09 0.22 0.46 0.24 0.46 0.51 ≤10−10
hsqldb 0.58 0.71 0.83 0.14 0.30 0.46 0.51 0.62 0.65 ≤10−11
log4j 0.69 0.83 0.92 0.15 0.41 0.72 0.23 0.40 0.50 0.381
squirrelsql 0.65 0.85 0.94 0.26 0.47 0.78 0.19 0.29 0.43 ≤10−9
argouml 0.80 0.92 0.97 0.27 0.56 0.73 0.20 0.38 0.61 ≤10−4
mvnforum 0.57 0.69 0.78 0.12 0.30 0.53 0.19 0.39 0.45 ≤10−3
expressed in terms of primitive graph operations on nodes
and edges, without any speciﬁc rules coming from software
engineering. Our initial intuition is that such a model implicitly
captures certain software evolution rules. We now try to
express those rules. In other words, we now speculatively
explain the model from a software engineering perspective.
The GD-GNC model is made-up of two basic operations
(the top-level if/then/else of Algorithm 2).
The ﬁrst basic operation of the model is a node creation
followed by an attachment to existing nodes using a GNC
primitive. To us, it represents the creation of a new class
implementing a new feature. This new feature depends upon
existing classes. The point of being attached to all dependent
classes of a class (as the GNC primitives) means that those
classes are already used to collaborate together. If class X
depends on A, B and C, it means that A, B and C interact
together in a way that is deﬁned by X. When a new node αis
connected to X with the GNC primitive, it is also connected
to A, B and C. It other words, the new class αcreates a novel
interaction between A, B and C.
When the GNC primitive is executed twice, it may be
explained by the fact that the new class mixes two existing
groups of classes. In the model, there is never more than
two groups of classes being linked from a new node (a new
feature). According to our experiments, mixing more than two
groups of classes never signiﬁcantly increases the ﬁt to real
data. One possible explanation is that it is already quite a
hard operation to meaningfully and correctly mix two groups
of classes, and it happens very rarely to remix more than two
The second basic operation of GD-GNC (the top-level else
condition) is a reverse attachment from an existing node to
a newly created node. For us, it may be explained as a
refactoring, where a piece of logics is extracted from an
existing class, in order to ease reuse and to simplify the code.
Once the refactoring is performed, the newly created class is
ready for being reused. This is what can happen in subsequent
iterations of the algorithm with the GNC primitive.
To us, this is the most likely explanation of why our
algorithm ﬁts real software graphs. It is to be noted that we
have performed experiments with many different models. They
embedded mechanisms corresponding to various assumptions
on what a new feature or a refactoring may be (according to the
common software engineering sense of our own programming
experience). They all led to a poor ﬁt in terms of degree
To sum up, the two core operations GD-GNC can be
explained as: 1) representing the creation of a new feature
by remix 2) refactoring.
B. Threats to Validity
Let us now discuss the threats to the validity of our ﬁndings.
First, we have optimized our model with respect to the ﬁt to in-
and out-degree distributions. Even if the degree distributions
capture many topological properties of graphs, it is only one
facet of the topology. One threat to the construct validity of our
experiments is that the other important topological properties
of software dependency graphs are completely orthogonal to
Second, our experiments are done on a dataset of 10
Java software systems. Although we think that our results
are somehow independent of the programming language, our
ﬁndings may only hold for object-oriented code, Java software
or even worse, to our dataset only. For us, a sign of hope is
that the degree distributions on other programming languages
and systems that are reported in previous work qualitatively
look the same [?].
Third, our evolution model is completely expressed in
abstract graph terms. We have reformulated the algorithm in a
software engineering perspective in Section VI-A. It may be
the case that we have correctly extracted the core topological
phenomena but that, at the same time, we have misinterpreted
their meaning. We look forward to more work in this area, to
discuss with the community in order to see the emergence of
a consensus on the core software evolution mechanisms.
VII. REL ATED WO RK
Several authors have proposed models for generating di-
rected graphs in various domains. Kumar et al.  and
as et al.  has proposed models intended to generate
graphs looking like the World Wide Web graph. Grindrod
 has proposed a model related to protein identiﬁcation on
bioinformatics. Many other models are generic: Erdos & Renyi
 one, Dorogovtsev et al.  one or Vazquez one are
example of generic models. The R-MAT model proposed by
Chakrabarti et al  is also generic, but its generation process
is based on matrices operations. According to our experiments,
those models are not able to reproduce software dependency
graphs cumulative degree distribution.
Baxter et al. studied a large amount of metrics, incl. graph
metrics ; Louridas et al. studied the “pervasive” presence
of power-law distributions on software dependencies graphs
at the class and features level for a large range of software
written in various languages . Myers also studied graph
metrics on software . Nevertheless, none of them showed
the common topology across software degree distributions as
we have presented in this paper. Our results on the asymmetry
between the in- and out-degree distributions conﬁrm previous
ﬁndings by Meyers , Valverde and Sol´
e  and Baxter
and Frean .
The tendancy of software graphs to follow a growth mech-
anism similar to GNC one has been reported by Valverde and
e , but they made no proposal of a concrete generative
model. Baxter and Frean proposed a generative model of
software graphs , based on a preferential attachment which
depends on the node degree distributions. We have shown that
our model better ﬁts real data.
Maddison and Tarlow  proposed a generative model of
source code. Our motivations are similar but the considered
software artifacts are completely different (abstract syntax tree
versus dependency graphs).
Other authors have studied other topological characteristics
on different kinds of graphs: Harman et al.  focus on
dependency clusters to demonstrate the widespread existence
of clusters in software source code. They show a common
traits of software using a different approach. Mitchell and
Mancoridis  uses clustering techniques to infer aggregate
view of a software system but they want to gain understanding
for speciﬁc system in order to improve debugging and refac-
toring but they do not focus on generalities about software.
Furthermore, none of those studies propose generative model
of any kind.
In this paper, we have studied the evolution rules of soft-
ware. Motivated by the fact that there is a common topology
across many software dependency graphs, we have devised an
experimental protocol to understand the evolution principles
that result in such a common topology. Our experimental
approach is to encode those rules in a generative model of
software dependency graphs and to compare the topology of
synthesized graphs against real software graphs.
Our new evolution model generates graphs whose degree
distribution tends to be the same as real ones. The operations
of the model tell us something about the way software evolves.
According to our experiments, new features are based on the
perpetual remix of existing interacting classes and refactoring
mostly consists of extracting a reusable class from an existing
Now that we have shown that meaningful generative models
exist, future work can go beyond degree distributions. Graph
motifs are patterns consisting of a small amount of nodes
connected to each other in a certain way. Graph motifs 
may turn to be valuable to determine the topological closeness
of generated graphs. We hypothesize that motifs also emerge
from the core evolution rules. Hence, if generated graphs share
similar motifs with real software graphs, there is a good chance
that the core evolution primitives of the model are close to the
 G. J. Baxter and M. R. Frean. Software Graphs and Programmer
 P. Bhattacharya, M. Iliofotou, I. Neamtiu, and M. Faloutsos. Graph-
based Analysis and Prediction for Software Evolution. In Proceedings
of the 34th International Conference on Software Engineering, ICSE
’12, pages 419–429. IEEE Press.
 B. Bollob´
as, C. Borgs, J. Chayes, and O. Riordan. Directed Scale-
free Graphs. In Proceedings of the Fourteenth Annual ACM-SIAM
Symposium on Discrete Algorithms, SODA ’03, pages 132–139. Society
for Industrial and Applied Mathematics.
 I. T. Bowman, R. C. Holt, and N. V. Brewster. Linux As a Case
Study: Its Extracted Software Architecture. In Proceedings of the 21st
International Conference on Software Engineering, ICSE ’99, pages
 D. Chakrabarti and C. Faloutsos. Graph Mining: Laws, Generators, and
 G. Concas, M. Marchesi, S. Pinna, and N. Serra. On the Suitability
of Yule Process to Stochastically Model Some Properties of Object-
oriented Systems. 370(2):817–831.
 S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin. Structure of
Growing Networks with Preferential Linking. 85(21):4633–4636.
 P. Erd˝
os and A. R´
enyi. On Random Graphs. 6:290–297.
 J. D. Gibbons and S. Chakraborti. Nonparametric Statistical Inference,
Fourth Edition: Revised and Expanded. CRC Press.
 P. Grindrod. Range-dependent Random Graphs and Their Application
to Modeling Large Small-world Proteome Datasets. 66(6):066702.
 M. Harman, D. Binkley, K. Gallagher, N. Gold, and J. Krinke. Depen-
dence Clusters in Source Code. 32(1):1:1–1:33.
 J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S.
Tomkins. The Web as a Graph: Measurements, Models, and Methods.
In T. Asano, H. Imai, D. T. Lee, S.-i. Nakano, and T. Tokuyama,
editors, Computing and Combinatorics, number 1627 in Lecture Notes
in Computer Science, pages 1–17. Springer Berlin Heidelberg.
 P. L. Krapivsky and S. Redner. Network Growth by Copying.
 R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins,
and E. Upfal. Stochastic Models for the Web Graph. In 41st Annual
Symposium on Foundations of Computer Science, 2000. Proceedings,
 P. Louridas, D. Spinellis, and V. Vlachos. Power Laws in Software.
 C. J. Maddison and D. Tarlow. Structured Generative Models of Natural
 R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and
U. Alon. Network Motifs: Simple Building Blocks of Complex Net-
 B. Mitchell and S. Mancoridis. On the Automatic Modularization of
Software Systems Using the Bunch Tool. 32(3):193–208.
 C. R. Myers. Software Systems as Complex Networks: Structure, func-
tion, and evolvability of software collaboration graphs. 68(4):046116.
 M. Newman. Networks: An Introduction. Oxford University Press.
 M. Newman. The Structure and Function of Complex Networks.
 S. Valverde and R. V. Sol´
e. Logarithmic Growth Dynamics in Software
 A. Vazquez. Knowing a Network by Walking on it: Emergence of
 D. J. Watts and S. H. Strogatz. Collective Dynamics of ‘Small-World’