ArticlePDF Available

Abstract and Figures

Software systems are composed of many interacting elements. A natural way to abstract over software systems is to model them as graphs. In this paper we consider software dependency graphs of object-oriented software and we study one topological property: the degree distribution. Based on the analysis of ten software systems written in Java, we show that there exists completely different systems that have the same degree distribution. Then, we propose a generative model of software dependency graphs which synthesizes graphs whose degree distribution is close to the empirical ones observed in real software systems. This model gives us novel insights on the potential fundamental rules of software evolution.
Content may be subject to copyright.
1
A Generative Model of Software Dependency
Graphs to Better Understand Software Evolution
Vincenzo Musco, Martin Monperrus, Philippe Preux
University of Lille and INRIA
Email: vincenzo.musco@inria.fr University of Lille and INRIA
Email: martin.monperrus@univ-lille1.fr University of Lille, LIFL, INRIA
Email: philippe.preux@univ-lille3.fr
Abstract—Software systems are composed of many interacting
elements. A natural way to abstract over software systems is
to model them as graphs. In this paper we consider software
dependency graphs of object-oriented software and we study
one topological property: the degree distribution. Based on the
analysis of ten software systems written in Java, we show that
there exists completely different systems that have the same
degree distribution. Then, we propose a generative model of
software dependency graphs which synthesizes graphs whose
degree distribution is close to the empirical ones observed in
real software systems. This model gives us novel insights on the
potential fundamental rules of software evolution.
I. INTRODUCTION
Software systems are composed of many elements inter-
acting with each other’s. For instance there are hundreds of
thousands of interconnected functions in a Linux kernel [4].
A natural way to abstract over software systems is to model
them as graphs [1], [2], [6], [15], [22].
In this paper we consider software dependency graphs of
object-oriented software, where each node represents a class
and each edge corresponds to a compilation dependency.
We study the topology of those graphs and we specifically
concentrate on their degree distributions (the distribution of the
number of edges connected to a node). Based on the analysis
of ten software systems written in Java, we show that there
exists software systems that are completely different yet have
the same degree distribution.
This is a surprising result: despite being developed by
different persons with different processes in different domains,
a common degree distribution emerges. This makes us hypoth-
esizing that there are common rules guiding the construction
of software systems over time.
In network science, a generative model defines a set of rules
used to synthesize artificial networks in a given domain. For
example, there exists generative models that aim at producing
graphs that are similar to the World Wide Web [12]. One
equivalent model in software engineering would be a model
generating graphs which look like real software graphs. If such
a generative model exists, it may encode common evolution
rules driving the emergence of the graph structure of software
systems.
In this paper, our goal is to propose a generative model of
software dependency graphs. If this model fits the empirical
data, it means that it approximates certain fundamental rules
of software evolution.
Our experimental methodology is as follows. We propose
a generative model of software graphs, and then we evaluate
its capacity to create graphs whose degree distribution is close
to the empirical ones observed in real software systems. We
compare its fit-to-data to the only comparable model of the
literature [1]. Our experimental results show that our genera-
tive model of software dependency graphs, GD-GNC, both fits
the empirical data and outperforms the model proposed in [1].
To put it shortly, our generative model is the first to produce
software dependency graphs that look like real ones.
To sum up, our contributions are:
empirical evidence of the common asymmetric topology
of dependency graphs in object-oriented software sys-
tems;
a generative model of software dependency graphs
the validation of the model on its ability to fit ten graphs
of real software systems totaling 10619 nodes and 52855
edges;
a speculative explanation of the fundamental evolution
rules of software.
The rest of this paper is structured as follow. Section II de-
fines the main concepts used in this paper. Section III presents
our goals and experimental methodology. In Section IV, we
look closer at real software dependency graphs and highlight
their common topology. In Section V, we introduce a new
generative model for software dependency graphs and we
analyze its fitness. In Section VI, we discuss our discoveries
from a software engineering perspective.
II. DE FIN IT IO NS
In this section, we provide background knowledge about the
concepts used in this paper.
A. Graphs
Agraph is a mathematical object for modeling connections
among concepts or entities of a specific domain: entities are
represented by nodes and the links between them are edges.
For example, the networks peripherals such as routers or
switches and their connections can be modeled using graphs:
each node is a peripheral and each edge represents a physical
wire between them, see Fig. 1(a).
arXiv:1410.7921v1 [cs.SE] 29 Oct 2014
2
In some situations, links between entities are one-way only,
i.e., an edge is meaningful from one node to another one, but
not the opposite. Thus, edges require to be directed in order to
accurately model a specific domain. In this case, we talk about
directed graphs or digraphs. The food web is an example of
digraph where each species is a node and each edge between
two species indicates a ”feeding on” relation. In this kind of
graphs, the direction of the edge is important as it indicates
what eats what, which cannot be represented on an undirected
graph. Fig. 1(b) exemplifies a set of prey-predator relations.
Fig. 1: Basic examples of two graphs. (a) An undirected graph
illustrating a computer network on which an edge represents a
physical connection between two entities. (b) A directed graph
representing a trophic network on which a node is a species
and an edge represents which one is eaten by which one. Here
the directed edge is needed as a predator eat a prey but not the
opposite (e.g. snakes feed on mice, but the opposite is false).
B. Degrees and Distributions
The number of edges connected to a specific node is named
the degree of this node. On digraphs, two degrees are defined:
the in-degree and the out-degree which are respectively the
number of in-coming edges and the number of out-going
edges.
The degree distribution is a function representing the fre-
quency of node’s degrees of a graph (i.e. the number of times
a specific degree value is encountered in the graph). Graphs
considered in this paper have, like many other graph mate-
rializing real concepts, noisy and right skewed distributions
(i.e. asymmetric distribution with lots of values larger than
the mean located on the right tail of the distribution). In order
to ease the study of such distributions, cumulative distributions
are generally preferred [21].
The cumulative degree distribution gives the proportion of
nodes which degree is smaller or equal to a given value.
Symetrically, the inverse cumulative degree distribution gives
the maximal degree of a given proportion of nodes. Both
the cumulative degree distribution and the inverse cumulative
degree distribution are monotonic: the former is an increasing
function, the latter a decreasing one. We always consider
inverse cumulative distributions when studying distributions
in this paper.
Node degrees and their distributions are basic properties
which are directly influenced by many properties of a graph:
changing the graph topology and inherently some graphs
properties (diameter, motifs, graph spectrum, etc.) will impact
to the degree distribution. That is the reason why in this
paper, we consider them as a recognized proxy to the graph
topology (i.e. the manner edges connect nodes to each other)
[20, sec.8.3].
C. Software Dependency Graphs
A large variety of graphs can be derived from software,
each one focusing on particular characteristics. Hence, nodes
and edges can have various meanings. An example of software
graph is the dependency graph in which nodes are modules
(e.g. packages, classes, etc. depending on the chosen granu-
larity) and edges are added when an element accesses another
one (e.g. function call, inheritance, field access, etc.).
Dependency graphs are directed graphs as dependencies are
oriented: indeed, a dependency shows which element depends
on another one, but the opposite is not necessarily true. As
an example, a Person class depends on a File in order to
persist, but the class File should not depend at all on the
Person class.
Nodes composing a dependency graph can be of two
different types. First, there are application nodes (a.k.a. app
nodes) that are nodes which belong to the core software itself.
Second, there are library nodes (a.k.a. lib nodes) which are
nodes which belong to an external library (i.e. a module which
is used by the core software, an example of library in Java
is the java.util package for classes related to collections
classes).
Consequently, there are two types of edges on a software de-
pendency graph: (i) application nodes to application nodes (i.e.
app-app edges a.k.a. endo-dependencies) which express that
a core element depends on another one; (ii) application nodes
to library nodes (i.e. app-lib edges a.k.a. exo-dependencies)
which express that a core element depends on an external
element. Fig. 2 illustrates these notions.
Fig. 2: Example of considered nodes and edges if we consider
only endo-dependencies (left) or exo-dependencies (right).
Considering or not exo-dependencies has an impact on
the observed degree distributions. Fig. 3 shows a line chart
plotting the inverse cumulative distributions for endo- (App-
App), exo- (App-Lib) and both dependencies for ant 1.9.2.
Two distributions have to be considered as those graphs are
directed: one for in-degree (3a) and one for out-degree (3b).
Plots are on a logarithmic scale. We see that endo- and exo-
dependencies have close yet different distributions. For in-
degree, the slope is different. If we consider only app-lib
dependencies i.e. we exclude app-app links (straight thin line),
the number of zero in-degree nodes strongly increases because
3
(a)
(b)
Fig. 3: Node in- and out-degree distributions for software
package “ant”. Three kinds of dependency are considered: app-
app, app-lib and both. X-axis are degrees and Y-axis are cumu-
lative degree frequencies. Both axes are on logarithmic scale.
It is important to differentiate endo- and exo- dependencies
because their topology is different.
lib nodes never connect to an app node. In this paper, we focus
on endo-dependencies and all the data we present excludes
exo-dependencies.
D. Generative Models
A generative model for graphs is an algorithm that generates
artificial graphs. A generative model takes a set of parameters
as input (such as the number of nodes and parameters that
influence the degree distribution). A generative model may
be deterministic or stochastic. Given a set of parameters, a
deterministic model always generate the exact same graph
whereas a stochastic model generates a new graph each time
it is run.
In our case, we generate software dependency graphs and
we intend these graphs to look like empirical software depen-
dency graphs. We consider two types of graphs: those resulting
from an analysis of software systems, and those created by a
model. The former are qualified as “empirical” or “true”, the
latter are qualified as “synthetic” or “artificial”.
III. EXP ER IM EN TAL DESIGN
We now present the experimental protocol we use to analyze
software dependency graphs.
A. Goals
In this paper, we have two goals: 1) we want to study
the topology of software graphs; 2) we aim at inventing a
generative model of software graphs that fits real data.
First, to study recurring topologies, we compare the degree
distributions obtained from our dataset. For instance, we may
observe that our set of empirical graphs can be partitioned into
a set of prototypical topologies, each one being shared by a
subset of software systems.
Second, we want to invent a model that generates software
dependency graphs that are similar to empirical graphs. We
optimize its parameters so that the degree distribution of the
generated graphs is as close as possible to those of empirical
graphs. Such a model would implicitly capture certain software
evolution rules (such rules are presented in Section VI-A).
B. Dataset
We define a dataset1of 10 Java software applications listed
in Table I. This table contains the software name, the version
of the software considered for the software, the year of the first
version of this software and the year of the considered version
of the software, the number of nodes and the number of edges
contained in the extracted graphs. The considered software
applications are between 10 and 14 years old. The last column
is the connectance γfor each software; the connectance is
computed using formula (1) with |N|the number of total
nodes contained in the graph and |E|the number of total
edges. In words, the connectance expresses the proportion of
pairs of nodes being connected in the graph.
TABLE I: The 10 Java programs considered in our experi-
ments: their version, their first and current release year, the
number of nodes and edges and the connectance γ.
Version Years Nodes Edges γ
ant 1.9.2 2000-2013 1252 5763 0.004
jfreechart 1.0.16 2000-2013 858 4783 0.006
jftp 1.57 2002-2013 173 736 0.025
jtds 1.3.1 2001-2013 90 328 0.040
maven 3.1.1 2002-2013 1515 6933 0.003
hsqldb 2.3.1 2001-2013 602 4976 0.014
log4j 2.0b9 2004-2013 895 4136 0.005
squirrelsql 3.5.0 2001-2013 2288 10141 0.002
argouml 0.34 2002-2011 2664 13445 0.002
mvnforum 1.3 2003-2010 282 1614 0.020
γ=|E|
|N|2(1)
In order to have a set of diversified set of software, we
chose software applications which are different in size (number
of nodes/edges). The graph size ranges from 90 to 2664
nodes, from 328 to 13445 edges and has a connectance γ
value ranging from 0.002 to 0.04. As thse software are all
1The dataset can be downloaded at http://www.vmusco.com//pages/dataset
generative model.html
4
developped by different teams, we reduce the risk of having
results biased towards specific development processes. Since
our graph extraction tool chain handles Java software, we
consider software that is open-source and written in Java.
C. Methodology
We now present the key points of our methodology.
1) Dependency Graph Extraction: We now present our
method to extract dependency graphs from our dataset. We
focus on the class granularity (i.e. one node represents one
class), as this is the most important modularity unit in object-
oriented software. Also, we only consider endo-dependencies,
that is, edges connecting internal nodes of the project to
each others and not those connecting to external libraries (cf.
section II-C), because we aim at understanding the topology
of core software graphs, regardless of the number of libraries
that are included and the frequency of the library usage.
The graph extraction phase produces a set of nodes and
edges forming a graph which can then be analyzed. This
extraction phase is done using Dependency Finder2. This tool
takes as input Java byte code and outputs all the depen-
dencies being found. Graph metrics are computed using the
NetworkX3library.
2) Degree Distribution Comparison: In this paper, our main
analysis tool is the comparison of degree distributions. The
degree distribution is a numerical property that captures certain
topological properties. For instance, the edge density and the
clustering coefficient are directly correlated to it. To compare
two degree distributions, we use the Kolmogorov-Smirnov
statistic K(also called distance) which measures a distance
between two distributions. The statistic is given in Formula (2)
where sup is the supremum of a set, F1and F2are the two
distributions to compare and xranges over degree values.
KF1,F2= sup
x
{F1(x)F2(x)}(2)
Kis a numerical value that indicates how close two
distributions are: the lower K, the closer the distributions. For
one experiment presented in this paper, we perform a Mann-
Whitney U test [9] on K to compare two generative models.
Also, we use the K statistic in the context of the
Kolmogorov-Smirnov test. This statistical test checks whether
a sample follows a reference probability distribution (one-
sample test) or another sample distribution (two-sample test)
[9].
3) Evaluation of Generative Models: To evaluate whether
a generative model is good, we use the Kolmogorov-Smirnov
statistic to compare artificial degree distributions with real ones
(recall that the degree distribution is a proxy for the graph
topology, see Section II-B). If the degree distributions of the
generated dependency graphs match with the empirical ones,
the basic operations of the generative model are candidate to
be the fundamental evolution rules we are looking for.
2http://depfind.sourceforge.net/
3http://networkx.lanl.gov/
4) Fine Tuning of Generative Models: Generative models
frequently requires parameters. For instance, the small-world
model [24] requires one probability parameter. In order to
determine the best parameter values needed to generate graphs
as close as possible to real ones (according to their degree
distribution) we perform a grid-search and generate graphs
for each point of the grid. This is done as follows: we iterate
over the range of each parameter. For instance, if a probability
parameter ranges between 0.0 to 1.0, we evaluate the model
ten times for 0.1, . . . , 1. For stochastic models, the evaluation
or parameter optimization is repeated 10 times and the median
value is selected as the final result.
Then, we use the Kstatistic to assess the distance between
the true graph and the generated one. The graph with the
smaller statistic value is the one that looks like real data the
most.
IV. STU DY OF SO FT WARE GR AP H TOPOLOGY
We want to determine whether there exists topologies shared
by different software applications. As the production of those
software packages is influenced by different factors (teams,
developing techniques...), it is a priori expected that there
is no common topology. On the opposite, finding common
topologies would be an interesting fact, it would mean that
there exists common software development rules. Hence, our
first research question reads:
Research Question 1 Are there common topologies of soft-
ware dependency graphs at the class granularity?
(a)
(b)
Fig. 4: Inverse cumulative in- and out-degree distribution
for all 10 software applications of our dataset (axes on a
logarithmic scale). The in- and out- distributions are different.
5
A. Protocol
To answer this question, we first look at the graphical
resemblance between the cumulative degree distributions of
the set of applications we consider. The software compilation
graphs we consider are directed (see Section II-C), so we have
to consider two distributions: in-degree and out-degree. Then,
we determine the significance of our graphical observations,
using the two-sample Kolmogorov-Smirnov test.
B. Results
Figure 4 shows the plot of the inverse cumulative in-
degree (4a) and out-degree (4b) distribution on a log-log
scale of our dataset. There is a different curve and color for
each considered software applications. We plot the inverse
cumulative frequency against degrees, i.e. the number of nodes
in the graph which have a degree greater than or equals to the
x degree.
We observe: (i) the position and the shape of curves are
similar which indicate there are common topologies across
software; (ii) in- and out- degrees distributions are not the
same: in-degree distribution is a straight line but out-degree
distribution is more curved; (iii) each plot on the in-degree
distribution is an almost straight line on a log-log scale
which means that the in-degree distribution generally follows
a power-law 4; (iv) the out-degree distribution is not a straight
line: out-degrees do not follow a power law. Observations (ii)
and (iii) have already been made [22].
C. Statistical Significance
In order to assess our observations in a statistical manner,
we now set our null and alternate hypotheses:
H0: Samples from the software in-degree distributions
(resp. out) are drawn from the same distribution.
H1: Samples from the software in-degree distributions
(resp. out) are not drawn from the same distribution.
Using the two-sample Kolmogorov-Smirnov test on each pair
of software in the dataset, we can determine statistically the
common topology across software in our dataset. The test will
return a p-value which is used to reject H0or not. If H0is not
rejected, we gain confidence about the common topology for
those two software. In the other hand, if H0is rejected, the
test outcome can not be used to conclude about the common
topology (which does not necessarily means samples are not
drawn from the same distribution).
Table II gives results for running 90 two-sided Kolmogorov-
Smirnov tests with a confidence level αof 0.01 5(one on each
pairs of our dataset software). The rows express respectively
results for in-, out- and both distributions. The second and third
column present the number and the ratio of tests for which
the two-sided Kolmogorov-Smirnov test has rejected H0. The
third and the fourth column express the opposite data (i.e. the
test has not rejected H0).
4plotting a straight line on a log-log scale is the standard way to visually
identify power-laws on data [5].
5We need to test each pair of sotfware, hence C2
10 = 45 tests, which is
doubled since we test in-degree and out-degree.
TABLE II: Number or times the H0hypothesis is rejected
(or not) for in-, out- and both cumulative degree distribution
according to the two-sided Kolmogorov-Smirnov test with a
confidence level αof 0.01.
H0Rejected H0Not rejected
Count Ratio Count Ratio
In 19/45 42% 26/45 58%
Out 9/45 20% 36/45 80%
Total 28/90 31% 62/90 69%
As we can see, for 69% of the tested pairs the com-
mon distribution hypothesis cannot be rejected. However, this
affirmation does not necessarily involve there is a unique
distribution shared by all those software. On the other hand,
for the remaining 31% of tested pairs which has rejected H0,
no conclusion can be drawn at this level of confidence.
D. Summary
To sum up, we reply positively to our first research question:
our experiment indicates that, according to the degree distri-
bution, there are common shapes across software dependency
graphs. We consider it as an emergence phenomenon and we
hypothesize that there is a common evolution process which
eventually yields those common degree distributions.
V. A G ENERATIVE MODEL F OR SO FT WARE DEPENDENCY
GRAPHS
In this section, we present a new generative model of
software dependency graphs. This stochastic model generates
an arbitrary number of artificial dependency graphs. It is
parametrized by three values: the expected number of nodes
and two probabilistic parameters.
A. Generative Models of the Literature
We discuss here three related generative models of graphs:
Erd¨
os & R´
enyi [8]is a prototypical one ; the relation between
GNC and software graphs has been observed once [13]; Baxter
and Frean’s model [1] is the only one explicitly targeting the
generation of software graphs.
Erd¨
os & R´
enyi proposed in 1959 is one of the oldest
generative model [8]. This model connects pairs of nodes
according to a fixed probability p. The connectance of the
resulting graph is hence γ=p. We will later use this model
as a point of comparison.
In 2005, Krapivsky and Redner proposed the GNC model
(GNC stands for “Growing Network model with Copying”)
[13]: this model requires one parameter: the number of nodes
of the resulting graph. The GNC model is an iterative algo-
rithm which, at each iteration, a new node is added to the
graph and connected at random to a set of already existing
nodes as follows: an existing node is selected according to a
uniform distribution and directed edges are created from the
new node to this node along with all its successors. We call
this the “GNC-Attach process”, is is illustrated in Figure 5.
6
Fig. 5: Illustration of GNC-Attach, the GNC primitive opera-
tion. The grey node is a new node added to the graph using the
GNC primitive. The central node is randomly selected and a
directed edge is added from the new node to it (dashed edge).
Then, a directed edge is also added to all destination nodes
the randomly selected one is already connected to as a source
(dotted edge).
Algorithm 1 shows the core primitive for attaching nodes using
GNC. The full GNC model executes ntimes this function
to create a graph with nnodes. The striking fact about this
generative model is its high ability to fit in-degree distributions
of software graphs as observed by Valverde and Sol´
e [22].
Algorithm 1: GNC-Attach Algorithm
Input:nithe current node being inserted. GN,Ethe
digraph on which we add the node (composed of
two sets: nodes (N) and directed edges (E))
Function GNC_Attach(GN,E, ni)is
Randomly selects a node njin graph different from
ni
Add an edge from nito nj
for all edge (nj, nd) in the out-edges set of njdo
Add an edge from nito nd
In 2008, Baxter and Frean [1] proposed a generative model
of dependency graphs. This model has an explicit hard coded
preferential attachment based on the out-degree of nodes. Its
logic is based on edges creation/transfer between nodes of the
graph. We consider this model as our baseline. First, this model
is also intended to generate software graphs, and second, it has
acceptable fits on in- and out- degree distribution.
Many other generative models of directed graphs have been
considered (cf. section VII), in many application domains (e.g.
WWW, proteins ...). We have explored whether they generate
likely dependency graphs, and, expectedly, they do not.
B. Generalized Double GNC (GD-GNC)
We now present our generative model of software depen-
dency graphs, called “Generalized Double GNC” (GD-GNC
for short). It is a generalization of the GNC model and is based
on the GNC-Attach primitive.
Our model consists of a main loop in which for each loop
iteration: (i) a unique node niis added to the graph; (ii) an
existing node njis drawn in a uniformly random manner.
The process of creating edges is as follows: (i) with proba-
bility p,niis connected to njin the same way as in the GNC-
Attach algorithm (i.e. a directed edge is created between nito
njbut also from nito each node to which njis connected to),
and with probability qwe repeat this GNC-based attachment
twice (if the random node to attach is twice the same, the
second one is ignored). (ii) with probability 1p,njis
connected to ni(which we refer as the attachment alternative);
A pseudo-code is shown in Algorithm 2. GD-GNC is a
generalization of GNC: the GNC algorithm is a special case
where p= 1 and q= 0.
We note that this model never modifies existing edges: at
each loop iteration, it only adds a single node and a set of
edges. No explicit preferential attachment is hard coded in the
algorithm, but an implicit one is still present. In our model,
attaching using the GNC algorithm implies the node does not
only connect to a node, but also to all children of this node.
As a consequence, the higher the in-degree value of a node
is, the higher the probability of being attached is. So, nodes
with a high in-degree value are more likely to be pointed to
by new nodes, and their in-degree increases accordingly.
Two parameters are required by our model and influence
the generation. The first one, p, determines whether the node
must be added using GNC or the attachment procedure. As this
probability changes, the quantity of nodes without outgoing
edges varies. The second one, q, determines whether the GNC
algorithm should be executed once or twice for the inserting
node. Increasing the number of GNC executions for a node
impacts the inverse cumulative degree distributions. Regarding
the in-degree, the coefficient of the power-law (i.e. the line
slope) is affected: the line decreases more slowly when the
number of GNC iteration increases (higher q). Regarding the
out-degree, the convexity of the distribution increases as q
increases.
C. Evaluation of of GD-GNC
We now want to determine whether the Generalized Double
GNC can generate graphs that are more realistic than the
graphs generated with Baxter & Frean’s model. We formu-
late this research question as: Research Question 2 Do class
dependency graphs generated using GD-GNC better fit real
software data than Baxter & Frean model (according to the
cumulative degree distribution)?
1) Protocol: To answer this question, we first run a pa-
rameter optimization (as presented on section III-C4) for each
model (GD-GNC and Baxter & Frean) on all programs of our
dataset (see Table I). Then, we generate 30 synthetic graphs
with each model, using the best parameters found for each pair
model, program. Finally, we compute the inverse cumulative
degree distribution of each graph: and we calculate the median
fitness value according to its δvalue defined by Equation (3).
In addition to the comparison, we also compare against the
Erd¨
os & R´
enyi model, a purely random and simple model.
With Erdos-Renyi’s model, generating graphs with the same
number of nodes and edges as real software graphs requires
no parameter optimization: we can simply use the connectance
of the real graph.
2) Results: Figure 6 shows the in- (6a) and the out- (6b)
inverse cumulative degree distribution of graphs generated
using different models. Each small plot represents a different
7
Algorithm 2: Iterative algorithm for the ”Generalized Double GNC” generative model
Input:Nthe number of iterations to execute/nodes to add, pthe probability to do a GNC and qthe probability to do a
double GNC.
Output: A digraph GN,Ewhich is composed of two sets: nodes (N) and directed edges (E)
begin
while |N | < N do
Create a node niand add it to the graph
if pthen
GNC_Attach(GN,E,ni)
if qthen
GNC_Attach(GN,E,ni)
else
Randomly selects a node njin graph different from ni
Add an edge from njto ni
software application, the meaning of the axis are the same
for all graphics: x-axes are degrees and y-axes are the invert
cumulative degrees frequency. Both axis are in logarithmic
scale. The thick continuous line corresponds to the distribution
of real data, the thin continuous line is for graphs generated
using the GD-GNC model and the dotted line is for graphs
generated using Baxter & Frean’s.
Graphically, we observe that GD-GNC in- and out-degree
distributions are almost always better than Baxter & Frean’s.
In other words, the GD-GNC algorithm produces generally
synthetic software graphs whose inverse cumulative in- and
out- degree distribution better fits the ones of real software
dependency graphs than Baxter & Frean.
3) Statistical Significance: To determine statistically which
model generates the closest graph to the true one, we compare
the Kolmogorov-Smirnov statistic or distance K(as presented
on section III-C3) for in- and out- cumulative degree dis-
tribution between the generated graph Gand the real graph
R. For this purpose we define the δfunction, as shown
in Equation (3), which is the max value between the two
Kolmogorov-Smirnov distances: first the distance between the
in-cumulative degree distribution for the generated graph Gin
and the real one Rin, and last the out-cumulative degree
distribution for the generated graph Gout and the real one Rout.
δR,G= max(KRin,Gin , KRout ,Gout )(3)
Indeed, this function is required as we must consider both
in- and out- distances at the same time as both distribution are
intimately related to each other: considering only in- or out-
distribution would be meaningless as a good in-distribution
does not necessarily involve a good out-distribution and vice-
versa.
As we want graphs which are similar to real ones according
to their degree distribution, and as the δvalue represents the
largest distance between a pair of distributions, considering in-
and out-degree distributions, we know then the model which
produces the smallest δvalue is the best. To statistically ensure
each δvalue obtained for a model is drawn from a different
distribution, we use the Mann-Whitney U test [9]. In terms
of null hypothesis, this test allows us to reject or not the null
hypothesis:
H0: the δvalues obtained for GD-GNC and the ones
obtained from Baxter & Frean model belong to an identical
population.
H1: the δvalues obtained for GD-GNC and the ones
obtained from Baxter & Frean model belong to a different
population.
Table III sums up various values for δobtained from
30 generated graphs with each model for each software.
Each row represents values for each software: columns report
respectively the name of the software, the minimal, median and
maximal δvalues for Erd¨
os & R´
enyi model, then for Baxter &
Frean and finally for GD-GNC. The last column is the p-value
obtained using the Mann-Whitney test between GD-GNC and
Baxter & Frean.
Comparing graphs generated by Erd¨
os & R´
enyi’s model
(columns 2–4) on the one hand to GD-GNC (column 8–10)
and Baxter & Frean’s (column 5–7) on the other hand, it is
clear that both models generate graphs more similar to real
ones with regards to their δvalue than Erd¨
os & R´
enyi’s model.
Furthermore, comparing GD-GNC (column 8–10) and Bax-
ter & Frean (column 5–7) shows that graphs generated using
the former are almost always closer to real graph than ones
generated using the latter. However, for some topologies
(maven, hsqldb and log4j if considering only the median
value), Baxter & Frean seems to generate better graphs. The
Mann-Whitney p-value test shows those results are reliable as
their p-value is lower than 0.05, excepted for log4j.
To sum up, according to our experiments on the degree dis-
tributions, our GD-GNC model is able to reproduce software
topologies more accurately than Baxter & Frean one.
VI. DISCUSSION
We now put aside technical considerations and discuss the
meaning and validity of our empirical results.
A. GD-GNC from a Software Engineering Perspective
We now have a generative model which fits degree dis-
tributions of empirical software graphs. This model is only
8
(a)
(b)
Fig. 6: Plot of the cumulative in- and out- degree distribution for 1) the real software graph (thick solid line); 2) the best
generated match using the Generalized Double GNC model (thin solid line) 3) the best match generated using Baxter & Frean’s
model (dotted line). Generalized Double GNC outperforms Baxter & Frean’s model.
9
TABLE III: δmin, median and max values for GD-GNC, Baxter & Frean and Erd¨
os & R´
enyi models for 30 random generations
of each model (cf. section V-C for more information on δvalue). The last column is the p-value determined using the Mann-
Whitney test and assess if all values of GD-GNC are always smaller than the Baxter & Frean ones in a statistical manner.
Program Erd¨
os & R´
enyi Baxter & Frean GD-GNC p-value
Min Med Max Min Med Max Min Med Max
ant 0.77 0.91 0.97 0.32 0.51 0.82 0.29 0.45 0.63 103
jfreechart 0.66 0.81 0.91 0.17 0.39 0.58 0.13 0.20 0.39 108
jftp 0.52 0.70 0.82 0.22 0.44 0.82 0.08 0.27 0.58 106
jtds 0.47 0.64 0.81 0.24 0.40 0.72 0.17 0.31 0.37 105
maven 0.63 0.76 0.81 0.09 0.22 0.46 0.24 0.46 0.51 1010
hsqldb 0.58 0.71 0.83 0.14 0.30 0.46 0.51 0.62 0.65 1011
log4j 0.69 0.83 0.92 0.15 0.41 0.72 0.23 0.40 0.50 0.381
squirrelsql 0.65 0.85 0.94 0.26 0.47 0.78 0.19 0.29 0.43 109
argouml 0.80 0.92 0.97 0.27 0.56 0.73 0.20 0.38 0.61 104
mvnforum 0.57 0.69 0.78 0.12 0.30 0.53 0.19 0.39 0.45 103
expressed in terms of primitive graph operations on nodes
and edges, without any specific rules coming from software
engineering. Our initial intuition is that such a model implicitly
captures certain software evolution rules. We now try to
express those rules. In other words, we now speculatively
explain the model from a software engineering perspective.
The GD-GNC model is made-up of two basic operations
(the top-level if/then/else of Algorithm 2).
The first basic operation of the model is a node creation
followed by an attachment to existing nodes using a GNC
primitive. To us, it represents the creation of a new class
implementing a new feature. This new feature depends upon
existing classes. The point of being attached to all dependent
classes of a class (as the GNC primitives) means that those
classes are already used to collaborate together. If class X
depends on A, B and C, it means that A, B and C interact
together in a way that is defined by X. When a new node αis
connected to X with the GNC primitive, it is also connected
to A, B and C. It other words, the new class αcreates a novel
interaction between A, B and C.
When the GNC primitive is executed twice, it may be
explained by the fact that the new class mixes two existing
groups of classes. In the model, there is never more than
two groups of classes being linked from a new node (a new
feature). According to our experiments, mixing more than two
groups of classes never significantly increases the fit to real
data. One possible explanation is that it is already quite a
hard operation to meaningfully and correctly mix two groups
of classes, and it happens very rarely to remix more than two
groups.
The second basic operation of GD-GNC (the top-level else
condition) is a reverse attachment from an existing node to
a newly created node. For us, it may be explained as a
refactoring, where a piece of logics is extracted from an
existing class, in order to ease reuse and to simplify the code.
Once the refactoring is performed, the newly created class is
ready for being reused. This is what can happen in subsequent
iterations of the algorithm with the GNC primitive.
To us, this is the most likely explanation of why our
algorithm fits real software graphs. It is to be noted that we
have performed experiments with many different models. They
embedded mechanisms corresponding to various assumptions
on what a new feature or a refactoring may be (according to the
common software engineering sense of our own programming
experience). They all led to a poor fit in terms of degree
distribution.
To sum up, the two core operations GD-GNC can be
explained as: 1) representing the creation of a new feature
by remix 2) refactoring.
B. Threats to Validity
Let us now discuss the threats to the validity of our findings.
First, we have optimized our model with respect to the fit to in-
and out-degree distributions. Even if the degree distributions
capture many topological properties of graphs, it is only one
facet of the topology. One threat to the construct validity of our
experiments is that the other important topological properties
of software dependency graphs are completely orthogonal to
degree distributions.
Second, our experiments are done on a dataset of 10
Java software systems. Although we think that our results
are somehow independent of the programming language, our
findings may only hold for object-oriented code, Java software
or even worse, to our dataset only. For us, a sign of hope is
that the degree distributions on other programming languages
and systems that are reported in previous work qualitatively
look the same [?].
Third, our evolution model is completely expressed in
abstract graph terms. We have reformulated the algorithm in a
software engineering perspective in Section VI-A. It may be
the case that we have correctly extracted the core topological
phenomena but that, at the same time, we have misinterpreted
their meaning. We look forward to more work in this area, to
discuss with the community in order to see the emergence of
a consensus on the core software evolution mechanisms.
10
VII. REL ATED WO RK
Several authors have proposed models for generating di-
rected graphs in various domains. Kumar et al. [14] and
Bollob´
as et al. [3] has proposed models intended to generate
graphs looking like the World Wide Web graph. Grindrod
[10] has proposed a model related to protein identification on
bioinformatics. Many other models are generic: Erdos & Renyi
[8] one, Dorogovtsev et al. [7] one or Vazquez[23] one are
example of generic models. The R-MAT model proposed by
Chakrabarti et al [5] is also generic, but its generation process
is based on matrices operations. According to our experiments,
those models are not able to reproduce software dependency
graphs cumulative degree distribution.
Baxter et al. studied a large amount of metrics, incl. graph
metrics [1]; Louridas et al. studied the “pervasive” presence
of power-law distributions on software dependencies graphs
at the class and features level for a large range of software
written in various languages [15]. Myers also studied graph
metrics on software [19]. Nevertheless, none of them showed
the common topology across software degree distributions as
we have presented in this paper. Our results on the asymmetry
between the in- and out-degree distributions confirm previous
findings by Meyers [19], Valverde and Sol´
e [22] and Baxter
and Frean [1].
The tendancy of software graphs to follow a growth mech-
anism similar to GNC one has been reported by Valverde and
Sol´
e [22], but they made no proposal of a concrete generative
model. Baxter and Frean proposed a generative model of
software graphs [1], based on a preferential attachment which
depends on the node degree distributions. We have shown that
our model better fits real data.
Maddison and Tarlow [16] proposed a generative model of
source code. Our motivations are similar but the considered
software artifacts are completely different (abstract syntax tree
versus dependency graphs).
Other authors have studied other topological characteristics
on different kinds of graphs: Harman et al. [11] focus on
dependency clusters to demonstrate the widespread existence
of clusters in software source code. They show a common
traits of software using a different approach. Mitchell and
Mancoridis [18] uses clustering techniques to infer aggregate
view of a software system but they want to gain understanding
for specific system in order to improve debugging and refac-
toring but they do not focus on generalities about software.
Furthermore, none of those studies propose generative model
of any kind.
VIII. CONCLUSION
In this paper, we have studied the evolution rules of soft-
ware. Motivated by the fact that there is a common topology
across many software dependency graphs, we have devised an
experimental protocol to understand the evolution principles
that result in such a common topology. Our experimental
approach is to encode those rules in a generative model of
software dependency graphs and to compare the topology of
synthesized graphs against real software graphs.
Our new evolution model generates graphs whose degree
distribution tends to be the same as real ones. The operations
of the model tell us something about the way software evolves.
According to our experiments, new features are based on the
perpetual remix of existing interacting classes and refactoring
mostly consists of extracting a reusable class from an existing
class.
Now that we have shown that meaningful generative models
exist, future work can go beyond degree distributions. Graph
motifs are patterns consisting of a small amount of nodes
connected to each other in a certain way. Graph motifs [17]
may turn to be valuable to determine the topological closeness
of generated graphs. We hypothesize that motifs also emerge
from the core evolution rules. Hence, if generated graphs share
similar motifs with real software graphs, there is a good chance
that the core evolution primitives of the model are close to the
real ones.
REFERENCES
[1] G. J. Baxter and M. R. Frean. Software Graphs and Programmer
Awareness.
[2] P. Bhattacharya, M. Iliofotou, I. Neamtiu, and M. Faloutsos. Graph-
based Analysis and Prediction for Software Evolution. In Proceedings
of the 34th International Conference on Software Engineering, ICSE
’12, pages 419–429. IEEE Press.
[3] B. Bollob´
as, C. Borgs, J. Chayes, and O. Riordan. Directed Scale-
free Graphs. In Proceedings of the Fourteenth Annual ACM-SIAM
Symposium on Discrete Algorithms, SODA ’03, pages 132–139. Society
for Industrial and Applied Mathematics.
[4] I. T. Bowman, R. C. Holt, and N. V. Brewster. Linux As a Case
Study: Its Extracted Software Architecture. In Proceedings of the 21st
International Conference on Software Engineering, ICSE ’99, pages
555–563. ACM.
[5] D. Chakrabarti and C. Faloutsos. Graph Mining: Laws, Generators, and
Algorithms. 38(1).
[6] G. Concas, M. Marchesi, S. Pinna, and N. Serra. On the Suitability
of Yule Process to Stochastically Model Some Properties of Object-
oriented Systems. 370(2):817–831.
[7] S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin. Structure of
Growing Networks with Preferential Linking. 85(21):4633–4636.
[8] P. Erd˝
os and A. R´
enyi. On Random Graphs. 6:290–297.
[9] J. D. Gibbons and S. Chakraborti. Nonparametric Statistical Inference,
Fourth Edition: Revised and Expanded. CRC Press.
[10] P. Grindrod. Range-dependent Random Graphs and Their Application
to Modeling Large Small-world Proteome Datasets. 66(6):066702.
[11] M. Harman, D. Binkley, K. Gallagher, N. Gold, and J. Krinke. Depen-
dence Clusters in Source Code. 32(1):1:1–1:33.
[12] J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S.
Tomkins. The Web as a Graph: Measurements, Models, and Methods.
In T. Asano, H. Imai, D. T. Lee, S.-i. Nakano, and T. Tokuyama,
editors, Computing and Combinatorics, number 1627 in Lecture Notes
in Computer Science, pages 1–17. Springer Berlin Heidelberg.
[13] P. L. Krapivsky and S. Redner. Network Growth by Copying.
71(3):036118.
[14] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins,
and E. Upfal. Stochastic Models for the Web Graph. In 41st Annual
Symposium on Foundations of Computer Science, 2000. Proceedings,
pages 57–65.
[15] P. Louridas, D. Spinellis, and V. Vlachos. Power Laws in Software.
18(1):2:1–2:26.
[16] C. J. Maddison and D. Tarlow. Structured Generative Models of Natural
Source Code.
[17] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and
U. Alon. Network Motifs: Simple Building Blocks of Complex Net-
works. 298(5594):824–827.
[18] B. Mitchell and S. Mancoridis. On the Automatic Modularization of
Software Systems Using the Bunch Tool. 32(3):193–208.
[19] C. R. Myers. Software Systems as Complex Networks: Structure, func-
tion, and evolvability of software collaboration graphs. 68(4):046116.
[20] M. Newman. Networks: An Introduction. Oxford University Press.
11
[21] M. Newman. The Structure and Function of Complex Networks.
45(2):167–256.
[22] S. Valverde and R. V. Sol´
e. Logarithmic Growth Dynamics in Software
Networks. 72(5):858.
[23] A. Vazquez. Knowing a Network by Walking on it: Emergence of
Scaling.
[24] D. J. Watts and S. H. Strogatz. Collective Dynamics of ‘Small-World’
Networks. 393(6684):440–442.
... We compare our algorithm with GDGNC [6] where it is shown to be the best graph generator available in the context of software graphs. We also compare our model with Bollobas et al.'s since they are quite similar: it is interesting to check how the small difference in these 2 algorithms convert into difference of performance. ...
Article
Generating graphs that are similar to real ones is an open problem, while the similarity notion is quite elusive and hard to formalize. In this paper, we focus on sparse digraphs and propose SDG, an algorithm that aims at generating graphs similar to real ones. Since real graphs are evolving and this evolution is important to study in order to understand the underlying dynamical system, we tackle the problem of generating series of graphs. We propose SEDGE, an algorithm meant to generate series of graphs similar to a real series. SEDGE is an extension of SDG. We consider graphs that are representations of software programs and show experimentally that our approach outperforms other existing approaches. Experiments show the performance of both algorithms.
... The impact prediction much depends on the structure of the call graphs [15]. For instance, the presence of large utility methods, with many incoming and outgoing edges has a direct impact on the prediction performance. ...
Conference Paper
Full-text available
Change impact analysis (CIA) consists in predicting the impact of a code change in a software application. In this paper, the artifacts that are considered for CIA are methods of object-oriented software; the change under study is a change in the code of the method, the impact is the test methods that fail because of the change that has been performed. We propose LCIP, a learning algorithm that learns from past impacts to predict future impacts. To evaluate LCIP, we consider Java software applications that are strongly tested. We simulate 6000 changes and their actual impact through code mutations, as done in mutation testing. We find that LCIP can predict the impact with a precision of 74%, a recall of 85%, corresponding to a F-score of 64%. This shows that taking a learning perspective on change impact analysis let us achieve good precision and recall in change impact analysis.
... The impact prediction much depends on the structure of the call graphs [14]. For instance, the presence of large utility methods, with many incoming and outgoing edges has a direct impact on the predication performance. ...
Article
Full-text available
Change impact analysis consists in predicting the impact of a code change in a software application. In this paper, we take a learning perspective on change impact analysis and consider the problem formulated as follows. The artifacts that are considered are methods of object-oriented software; the change under study is a change in the code of the method, the impact is the test methods that fail because of the change that has been performed. We propose an algorithm, called LCIP that learns from past impacts to predict future impacts. To evaluate our system, we consider 7 Java software applications totaling 214,000+ lines of code. We simulate 17574 changes and their actual impact through code mutations, as done in mutation testing. We find that LCIP can predict the impact with a precision of 69%, a recall of 79%, corresponding to a F-Score of 55%.
Article
As the complexity of modern software systems increases, changes in software have become crucial to the software lifecycle. For this reason, it is an important issue for software developers to analyze the changes that occur in the software and to prevent the changes from causing errors in the software. In this paper, mutation testing as software test analysis is examined. Mutation tests have been implemented on open-source Java projects. In addition, for aviation projects, emphasis is placed on performing change impact analysis processes in compliance with the certification based on DO-178C processes.
Chapter
Software evolution control mostly relies on the better structure of the inherent software artifacts and the evaluation of different qualitative factors like maintainability. The attributes of changeability are commonly used to measure the capability of the software to change with minimal side effects. This article describes the use of the design of experiments method to evaluate the influence of variations of software metrics on the change impact in developed software. The coupling metrics are considered to analyze their degree of contribution to cause a change impact. The data from participant software metrics are expressed in the form of mathematical models. These models are then validated on different versions of software to estimate the correlation of coupling metrics with the change impact. The proposed approach is evaluated with the help of a set of experiences which are conducted using statistical analysis tools. It may serve as a measurement tool to qualify the significant indicators that can be included in a Software Maintenance dashboard.
Article
Full-text available
Software evolution control mostly relies on the better structure of the inherent software artifacts and the evaluation of different qualitative factors like maintainability. The attributes of changeability are commonly used to measure the capability of the software to change with minimal side effects. This article describes the use of the design of experiments method to evaluate the influence of variations of software metrics on the change impact in developed software. The coupling metrics are considered to analyze their degree of contribution to cause a change impact. The data from participant software metrics are expressed in the form of mathematical models. These models are then validated on different versions of software to estimate the correlation of coupling metrics with the change impact. The proposed approach is evaluated with the help of a set of experiences which are conducted using statistical analysis tools. It may serve as a measurement tool to qualify the significant indicators that can be included in a Software Maintenance dashboard.
Conference Paper
Intensive dependencies of a Java project on third-party libraries can easily lead to the presence of multiple library or class versions on its classpath. When this happens, JVM will load one version and shadows the others. Dependency conflict (DC) issues occur when the loaded version fails to cover a required feature (e.g., method) referenced by the project, thus causing runtime exceptions. However, the warnings of duplicate classes or libraries detected by existing build tools such as Maven can be benign since not all instances of duplication will induce runtime exceptions, and hence are often ignored by developers. In this paper, we conducted an empirical study on real-world DC issues collected from large open source projects. We studied the manifestation and fixing patterns of DC issues. Based on our findings, we designed Decca, an automated detection tool that assesses DC issues' severity and filters out the benign ones. Our evaluation results on 30 projects show that Decca achieves a precision of 0.923 and recall of 0.766 in detecting high-severity DC issues. Decca also detected new DC issues in these projects. Subsequently, 20 DC bug reports were filed, and 11 of them were confirmed by developers. Issues in 6 reports were fixed with our suggested patches.
Conference Paper
Generating graphs that are similar to real ones is an open problem, while the similarity notion is quite elusive and hard to formalize. In this paper, we focus on sparse digraphs and propose SDG, an algorithm that aims at generating graphs similar to real ones. Since real graphs are evolving and this evolution is important to study in order to understand the underlying dynamical system, we tackle the problem of generating series of graphs. We propose SEDGE, an algorithm meant to generate series of graphs similar to a real series. SEDGE is an extension of SDG. We consider graphs that are representations of software programs and show experimentally that our approach outperforms other existing approaches. Experiments show the performance of both algorithms.
Article
Full-text available
In project source deployment to the application server, it is needed to analyze the dependencies of the source which is dependent on the defined process of the application. There are three categories of dependencies available. Dependencies can be order time, runtime, obvious, covered up, immediate, backhanded, relevant and so forth. A component that is to be reused across many different contexts should not have any context dependencies. Standard and customized interface dependencies are also involved with the depreciated method and interfaces. The DDR algorithm will help to find dependencies and resolving their dependencies for best utilization.
Article
Full-text available
We study the problem of building generative models of natural source code (NSC); that is, source code written and understood by humans. Our primary contribution is to describe a family of generative models for NSC that have three key properties: First, they incorporate both sequential and hierarchical structure. Second, we learn a distributed representation of source code elements. Finally, they integrate closely with a compiler, which allows leveraging compiler logic and abstractions when building structure into the model. We also develop an extension that includes more complex structure, refining how the model generates identifier tokens based on what variables are currently in scope. Our models can be learned efficiently, and we show empirically that including appropriate structure greatly improves the models, measured by the probability of generating test programs.
Article
Full-text available
In a recent paper, Krapivsky and Redner (Phys. Rev. E, 71 (2005) 036118) proposed a new growing network model with new nodes being attached to a randomly selected node, as well to all ancestors of the target node. The model leads to a sparse graph with an average degree growing logarithmically with the system size. Here we present compeling evidence for software networks being the result of a similar class of growing dynamics. The predicted pattern of network growth, as well as the stationary in- and out-degree distributions are consistent with the model. Our results confirm the view of large-scale software topology being generated through duplication-rewiring mechanisms. Implications of these findings are outlined.
Article
The scientific study of networks, including computer networks, social networks, and biological networks, has received an enormous amount of interest in the last few years. The rise of the Internet and the wide availability of inexpensive computers have made it possible to gather and analyze network data on a large scale, and the development of a variety of new theoretical tools has allowed us to extract new knowledge from many different kinds of networks. The study of networks is broadly interdisciplinary and important developments have occurred in many fields, including mathematics, physics, computer and information sciences, biology, and the social sciences. This book brings together the most important breakthroughs in each of these fields and presents them in a coherent fashion, highlighting the strong interconnections between work in different areas. Subjects covered include the measurement and structure of networks in many branches of science, methods for analyzing network data, including methods developed in physics, statistics, and sociology, the fundamentals of graph theory, computer algorithms, and spectral methods, mathematical models of networks, including random graph models and generative models, and theories of dynamical processes taking place on networks.
Article
We exploit recent advances in analysis of graph topology to better understand software evolution, and to construct predictors that facilitate software development and maintenance. Managing an evolving, collaborative software system is a complex and expensive process, which still cannot ensure software reliability. Emerging techniques in graph mining have revolutionized the modeling of many complex systems and processes. We show how we can use a graph-based characterization of a software system to capture its evolution and facilitate development, by helping us estimate bug severity, prioritize refactoring efforts, and predict defect-prone releases. Our work consists of three main thrusts. First, we construct graphs that capture software structure at two different levels: (a) the product, i.e., source code and module level, and (b) the process, i.e., developer collaboration level. We identify a set of graph metrics that capture interesting properties of these graphs. Second, we study the evolution of eleven open source programs, including Firefox, Eclipse, MySQL, over the lifespan of the programs, typically a decade or more. Third, we show how our graph metrics can be used to construct predictors for bug severity, high-maintenance software parts, and failure-prone releases. Our work strongly suggests that using graph topology analysis concepts can open many actionable avenues in software engineering research and practice.
Article
We present a study of three large object-oriented software systems—VisualWorks Smalltalk, Java JDK and Eclipse—searching for scaling laws in some of their properties. We study four system properties related to code production, namely the inheritance hierarchies, the naming of variables and methods, and the calls to methods. We systematically found power-law distributions in these properties, most of which have never been reported before. We were also able to statistically model the programming activities leading to the studied properties as Yule processes, with very good correspondence between empirical data and the prediction of Yule model. The fact that a design and optimization process like software development can be modeled on the large with the laws of statistical physics poses intriguing issues to software engineers, and could be exploited for finding new metrics and quality measures.
Article
The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph is a fascinating object of study: it has several hundred million nodes today, over a billion links, and appears to grow exponentially with time. There are many reasons -- mathematical, sociological, and commercial -- for studying the evolution of this graph. In this paper we begin by describing two algorithms that operate on the Web graph, addressing problems from Web search and automatic community discovery. We then report a number of measurements and properties of this graph that manifested themselves as we ran these algorithms on the Web. Finally, we observe that traditional random graph models do not explain these observations, and we propose a new family of random graph models. These models point to a rich new sub-field of the study of random graphs, and raise questions about the analysis of graph algorithms on the Web.
Conference Paper
The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph is a fascinating object of study: it has several hundred million nodes today, over a billion links, and appears to grow exponentially with time. There are many reasons — mathematical, sociological, and commercial — for studying the evolution of this graph. In this paper we begin by describing two algorithms that operate on the Web graph, addressing problems from Web search and automatic community discovery. We then report a number of measurements and properties of this graph that manifested themselves as we ran these algorithms on the Web. Finally, we observe that traditional random graph models do not explain these observations, and we propose a new family of random graph models. These models point to a rich new sub-field of the study of random graphs, and raise questions about the analysis of graph algorithms on the Web.