ArticlePDF Available

Abstract

Sparse Neural Networks regained attention due to their potential of mathematical and computational advantages. We give motivation to study Artificial Neural Networks (ANNs) from a network science perspective, provide a technique to embed arbitrary Directed Acyclic Graphs into ANNs and report study results on predicting the performance of image classifiers based on the structural properties of the networks’ underlying graph. Results could further progress neuroevolution and add explanations for the success of distinct architectures from a structural perspective.
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 159 (2019) 107–116
1877-0509 © 2019 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of KES International.
10.1016/j.procs.2019.09.165
10.1016/j.procs.2019.09.165 1877-0509
© 2019 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of KES International.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000
www.elsevier.com/locate/procedia
23rd International Conference on Knowledge-Based and Intelligent Information & Engineering
Systems
Structural Analysis of Sparse Neural Networks
Julian Stiera,, Michael Granitzera
aUniversity of Passau, Innstrasse 42, 94032 Passau, Germany
Abstract
Sparse Neural Networks regained attention due to their potential of mathematical and computational advantages. We give moti-
vation to study Artificial Neural Networks (ANNs) from a network science perspective, provide a technique to embed arbitrary
Directed Acyclic Graphs into ANNs and report study results on predicting the performance of image classifiers based on the struc-
tural properties of the networks’ underlying graph. Results could further progress neuroevolution and add explanations for the
success of distinct architectures from a structural perspective.
c
2019 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of KES International.
Keywords: artificial neural networks; sparse network structures; small-world neural networks; scale-free; architecture performance estimation
1. Introduction
Artificial Neural Networks (ANNs) are highly successful machine learning models and achieve human performance
in various domains such as image processing and speech synthesis. The choice on architecture is crucial for their
success – but ANN architectures, seen as network structures, have not been extensively studied from a network science
perspective, yet. The motivation of studying ANNs from a network science perspective is manifold:
First of all, ANNs are inspired from biological neural networks (notably the human brain) which have scale-free
properties and “are shown to be small-world networks” [25]. This is contradictory to most successful models in various
machine learning problem domains. However, with respect to their required neurons, those successful models have
shown to be redundant by a magnitude. For example, Han et al. claim “on the ImageNet dataset, our method reduced
the number of parameters of AlexNet by a factor of 9×” and “VGG-16 can be reduced by 13×”[8] through pruning.
Thus, biology and structural searches such as e.g. pruning suggest to develop sparse neural network structures [24].
Secondly, a look on the history of ANNs shows that, after major breakthroughs in the early 90s, the research
community already tried to find characteristic networks structures by e.g. constructive and destructive approaches.
However, network structures have only been studied in graph theory by then and the major remaining architectures
Corresponding author
E-mail address: julian.stier@uni-passau.de
1877-0509 c
2019 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of KES International.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000
www.elsevier.com/locate/procedia
23rd International Conference on Knowledge-Based and Intelligent Information & Engineering
Systems
Structural Analysis of Sparse Neural Networks
Julian Stiera,, Michael Granitzera
aUniversity of Passau, Innstrasse 42, 94032 Passau, Germany
Abstract
Sparse Neural Networks regained attention due to their potential of mathematical and computational advantages. We give moti-
vation to study Artificial Neural Networks (ANNs) from a network science perspective, provide a technique to embed arbitrary
Directed Acyclic Graphs into ANNs and report study results on predicting the performance of image classifiers based on the struc-
tural properties of the networks’ underlying graph. Results could further progress neuroevolution and add explanations for the
success of distinct architectures from a structural perspective.
c
2019 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of KES International.
Keywords: artificial neural networks; sparse network structures; small-world neural networks; scale-free; architecture performance estimation
1. Introduction
Artificial Neural Networks (ANNs) are highly successful machine learning models and achieve human performance
in various domains such as image processing and speech synthesis. The choice on architecture is crucial for their
success – but ANN architectures, seen as network structures, have not been extensively studied from a network science
perspective, yet. The motivation of studying ANNs from a network science perspective is manifold:
First of all, ANNs are inspired from biological neural networks (notably the human brain) which have scale-free
properties and “are shown to be small-world networks” [25]. This is contradictory to most successful models in various
machine learning problem domains. However, with respect to their required neurons, those successful models have
shown to be redundant by a magnitude. For example, Han et al. claim “on the ImageNet dataset, our method reduced
the number of parameters of AlexNet by a factor of 9×” and “VGG-16 can be reduced by 13×”[8] through pruning.
Thus, biology and structural searches such as e.g. pruning suggest to develop sparse neural network structures [24].
Secondly, a look on the history of ANNs shows that, after major breakthroughs in the early 90s, the research
community already tried to find characteristic networks structures by e.g. constructive and destructive approaches.
However, network structures have only been studied in graph theory by then and the major remaining architectures
Corresponding author
E-mail address: julian.stier@uni-passau.de
1877-0509 c
2019 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of KES International.
108 Julian Stier et al. / Procedia Computer Science 159 (2019) 107–116
2Stier et al. /Procedia Computer Science 00 (2019) 000–000
from this research period are notably Convolutional Neural Networks (CNNs) [16] and Long Short-Term Memory
Networks (LSTMs) [12]. New breakthroughs in the early 21st century led to recent trends of new architectures such as
Highway Networks [23] and Residual Networks [9]. This suggests that further insights from revisiting and analysing
the topology of sparse neural networks could be gained, particularly from a network science perspective as done in
Mocanu et al. [19], which also “argue that ANNs, too, should not have fully-connected layers”.
Last but not least, many researchers reported various advantages of sparse structures over unjustified densely
stacked layers. Glorot et al. argued, that sparsity is one of the factors for the success of Rectified Linear Units (ReLU)
as activation functions because “using a rectifying non-linearity gives rise to real zeros of activations and thus truly
sparse representations” [7]. The optimization process with ReLUs might not only find good representations but also
better structures through sparser connections. Sparser connections can also decrease the number of computations and
thus time and energy consumption. Not only exists biological motivation but also mathematical and computational
advantages have been reported for sparse neural network structures.
Looking at the structural design and training of ANNs as a search problem, one can argue to apply structural
regularisation when postulating distinct structural properties. CNNs exploit sparsely connected neurons due to spatial
relationships of the input features, which leads in conjunction with weight-sharing to very successful models. LSTMs
overcome analytical issues in training by postulating structures with slightly changed training behaviour. We argue,
that there can be new insights gained from studying ANNs from a network science perspective.
Sparse Neural Networks (SNNs), which we define as networks not being fully connecteted between layers, form
another important, yet not well-understood structural regularisation. We mention three major approaches to study the
influence of structural regularisation to properties of SNNs: 1) studying fixed sparse structures with empirical success
or selected analytical advantages, 2) studying sparse structures obtained in search heuristics and 3) studying sparse
structures with common graph theoretical properties.
The first approach studies selected models with fixed structure analytically or within empirical works. Exemplarily,
LSTMs arose from analytical work [11] on issues in the domain of time-dependent problems and have also been
studied empirically since then. In larger network structures, LSTMs and CNNs provide fixed components which have
been proven to be successful due to analytical research and much experience in empirical work. These insights provide
foundations to construct larger networks in a bottom-up fashion.
On larger scale, sparse network structures can be approached by automatic construction, pruning, evolutionary
techniques or even as a result of training regularisation or the choice of activation function. Despite the common
motivation to find explanations for resulting sparse structures, a lot of those approaches indeed obtain sparse structures
but fail to find an explanation for them.
Studying real-world networks has been addressed by the field of Network Science. However, as of now, Network
Science has been hardly used to study the structural properties of ANNs and to understand the regularisation eects
introduced by sparse network structures.
This article reports results on embedding sparse graph structures with characteristic properties into feed-forward
networks and gives first insights into our study on network properties of sparse ANNs. Our Contributions comprise
a technique to embed Directed Acyclic Graphs into Artificial Neural Networks (ANN)
a comparison of Random Graph Generators as generators for structures of ANNs
a performance estimator for ANNs based on structural properties of the networks’ underlying graph
Related Work
A lot of related works in fields such as network science, graph theory, and neuroevolution give motivation to study
Sparse Structured Neural Networks (SNNs). From a network science perspective, SNNs are Directed Acyclic Graphs
(DAGs) in case of non-recurrent and Directed Graphs in case of Recurrent Neural Networks. Directed acyclic and
cyclic graphs are built, studied and characterized in graph theory and network science. Notably, already Erd ¨
os “aimed
to show [..] that the evolution of a random graph shows very clear-cut features” [6]. Graphs with distinct degree
distributions have been characterised as small-world and scale-free networks with the works of Watts & Strogatz
[25] and Barab´
asi & Albert [1]. Both phenomens are “not merely a curiosity of social networks nor an artefact of
Julian Stier et al. / Procedia Computer Science 159 (2019) 107–116 109
Stier et al. /Procedia Computer Science 00 (2019) 000–000 3
an idealized model – it is probably generic for many large, sparse networks found in nature” [25]. However, from a
biological perspective, the question of how the human brain is organised, still remains open: According to Hilgetag et
al. “a reasonable guess is that the large-scale neuronal networks of the brain are arranged as globally sparse hierarchical
modular networks” [10].
Concerning the combination of ANNs and Network Science Mocanu et al. claim that “ANNs perform perfectly
well with sparsely-connected layers” and introduce a training procedure SET, which induces sparsity by magnitude-
based pruning and randomly adding connections within the training phase [19]. While their method is driven by the
same motivation and inspiration, it is conceptually fundamentally dierent from our idea of finding characteristic
properties of SNNs.
The work of Bourely et al. can be considered very close to our idea of creating SNNs before the training phase.
They “propose Sparse Neural Network architectures that are based on random or structured bipartite graph topologies”
[4] but do not seem to base their construction on existing work in Network Science when transforming their randomly
generated structures into ANNs.
There exists a vast amount of articles on automatic methods which yield sparsity. Besides well-established regular-
isation methods (e.g. L1-regularisation), new structural regularisation methods can be found: Srinivas et al. “introduce
additional gate variables to perform parameter selection” [22] and Louizos et al. “propose a practical method for L0
norm regularisation” in which they “prune the network during training by encouraging weights to become exactly
zero” [18]. Besides regularisation one can also achieve SNNs through pruning, construction and evolutionary strate-
gies. All of those domains achieve successes to some extent but seem to fail in providing explanations for why certain
found architectures succeed and others do not.
The idea of predicting the model performance can already be found in a similar way in the works of Klein et al.,
Domhan et al. and Baker et al.. Klein et al. exploit “information in automatic hyperparameter optimization by means
of a probabilistic model of learning curves across hyperparameter settings” [15]. They compare the idea with a human
expert assessing the course of the learning curve of a model. Domhan et al. also “mimic the early termination of bad
runs using a probabilistic model that extrapolates the performance from the first part of a learning curve” [5]. Both are
concerned with predicting the performance based on learning curve, not on structural network properties.
Baker et al. are closest to our method by “predicting the final performance of partially trained model configurations
using features based on network architectures, hyperparameters, and time-series validation performance data” [2].
They report good R2values (e.g. 0.969 for Cifar10 with MetaQNN CNNs) for predicting the performance and did
so on slightly more complex datasets such as Cifar10, TinyImageNet and Penn Treebank. However, our motivation
is to focus only on architecture parameters and include more characteristic properties of the network graph than only
“including total number of weights and number of layers” which is independent of other hyperparameters and saves
conducting an expensive training phase.
In the following chapters we give an introduction to Network Science, illustrate how Directed Acyclic Graphs are
embedded into ANNs, provide details on generated graphs and their properties and visualize results from predicting
the performance of built ANNs only with features based on structural properties of the underlying graph.
2. A Network Science Perspective: Random Graph Generators
A graph G=(V,E) is defined by its naturally ordered vertices Vand edges EV×V. The number of edges
containing a vertex vVdetermines the degree of v. The “distribution function P(k) [..] gives the probability that a
randomly selected node has exactly kedges” [1]. It can be used as a first characteristic to dier between certain types
of graphs. For a “random graph [it] is a Poisson distribution with a peak at P(k)”, with kbeing “the average degree
of the network” [1]. Graphs with degree distributions following a power-law tail P(k)kγare called scale-free
graphs. Graphs with “relatively small characteristic path lengths” are called small-world graphs [25].
Graphs can be generated by Random Graph Generators (RGG). Various RGGs can yield very distinct statistical
properties. The most prominent model to generate a random graph is the Erd˝os-R´enyi- /Gilbert-model (ERG-model)
which connects a given number of vertices randomly. While this ERG-model yields a Poisson distribution for the
degree distribution, other models have been developed to close the gap of generating large graphs with distinct char-
110 Julian Stier et al. / Procedia Computer Science 159 (2019) 107–116
4Stier et al. /Procedia Computer Science 00 (2019) 000–000
(a) Sketched graph from Random Graph
Generator (RGG).
(b) Transformed Directed Acyclic Graph
(DAG) from RGG.
(c) Embedded DAG between input and out-
put layers of an Artificial Neural Network
classifier.
Figure 1. Random (possibly undirected) graph, a directed transformation of it and the final embedding of the sparse structure within a Neural
Network Classifier. In this example the classifier would be used for a problem with four input and two output neurons.
acterstics found in nature. Notably, the first RGGs producing such graphs have been the Watts-Strogatz-model [25]
and the Barab´asi-Albert-model [1].
3. Embedding Arbitrary Structures into Sparse Neural Networks
After obtaining a graph with desired properties by a Random Graph Generator, this graph is embedded into a
Sparse Neural Network with following steps:
1. Make graph directed (if not directed, yet),
2. compute a layer indexing for all vertices,
3. embed layered vertices between an input and output layer meeting the requirements of the dataset.
The first step conducts a transformation into a Directed Acyclic Graph following an approach in Barak et al. [3]
which “consider the class Aof random acyclic directed graphs which are obtained from random graphs by directing
all the edges from higher to lower indexed vertices”. Given a graph and a natural ordering of its vertices, a DAG is
obtained by directing all edges from higher to lower ordered vertices. This can be easily computed by setting the upper
(or lower) triangle of the adjacency matrix to zero.
Next, a layer indexing is computed to assign vertices into dierent layers within a feed-forward network. Vertices in
a common layer can be represented in a unified layer vector. The indexing function indl(v):VNis recursively
defined by v→ max({indl(s)|(s,v)Ein
v} ∪ {−1})+1 which then defines a set of layers Land a family of neurons
indexed by Ilfor each layer l∈L.
Finally, the layered DAG can be embed into an ANN for a given task, e.g. a classification task such as MNIST with
784 input and 10 output neurons. All neurons of the input layer are connected to neurons representing the vertices of
the DAG with in-degree equal to zero (layer index zero). Succedingly, each vertex gets represented as a neuron and is
connected according to the DAG. The build process finishes by connecting all neurons of the last layer of the DAG
with neurons of the output layer. Following this approach, each resulting ANN has at least two fully connected layers
in size depending on the DAG vertices of in-degree and out-degree equalling zero.
Figure 1 visualizes the steps from a random generated graph to a DAG and embedded into an ANN classifier.
Julian Stier et al. / Procedia Computer Science 159 (2019) 107–116 111
Stier et al. /Procedia Computer Science 00 (2019) 000–000 5
4. Performance Estimation Through Structural Properties
In order to analyse the impact of the dierent structural elements, we conducted a supervised experiment predicting
network performances on the structural properties only. We then analysed the most important features involved in the
decision along with the prediction quality.
For this experiment we created an artificial dataset graphs10k based on two Random Graph Generators (RGG).
Each graph in the dataset is transformed into a Sparse Neural Network (SNN), implemented in PyTorch [20]. The
resulting model is then trained and evaluated on MNIST [17]. Based on obtained evaluation measures, three esti-
mator models – Ordinary Linear Regression, Support Vector Machine and Random Forest – are trained by splitting
graphs10k into a training and test set. The estimator models give opportunity to discuss influence of structural prop-
erties to the SNN performances.
4.1. The graphs10k dataset
To investigate which structural properties influence the performance of an Artificial Neural Network most, we
created a dataset of graphs generated by Watts-Strogatz- and Barab´asi-Albert-models. The dataset comprises 10,000
graphs, randomly generated by having between 50 and 500 vertices. An exemplary property frequency distribution of
the number of edges is shown in 2a. The non-uniform distribution visualizes the diculty of uniformly sampling in
graph space1.
The dataset contains 5018 Barab´
asi-Albert-graphs and 4982 Watts-Strogatz-graphs. The graphs have between 97
and 4,365 edges and on average 1,399.17 edges with a standard deviation of 954,12. In estimating the model per-
formances the number of source and sink vertices are very prominent. Source vertices are those with no incoming
edges, sink vertices those with no outgoing edges. On average the graphs have 79.75 source vertices with a standard
deviation of 80.69 and 9.56 sink vertices with a standard deviation of 17.53.
Distributions within the graphs are reduced to four properties, namely minimum, arithmetic mean, maximum and
standard deviation or variance. For the mean degree distribution, a vertex has on average 10.36 connected edges and
this arithmetic mean has a standard deviation of 4.47. The variance of the degree distributions is on average 174.46
with a standard deviation of 259.08.
After each graph was embedded in a SNN, its accuracy value for test and validation set was obtained and added to
the dataset. More detailed statistics on other properties are given in Table 2.
With the presented embedding technique it could be found, that graphs based on the Barab´
asi-Albert model could
not achieve comparable performances to graphs based on the Watts-Strogatz model. Only 2618 of Barab´
asi-Albert
models achieved over 0.3 in accuracy (less than 50%). Excluding the Barab´
asi-Albert model did not change the
estimation results in subsection 4.2, therefore statistics are reported for all graphs combined. In future, the dataset will
be enhanced to comprise more RGGs.
The test accuracy has three major peeks at arond 0.15, 0.3 and 0.94 as can be seen in the frequency distribution on
the x-axis of 2b, depicting a joint plot of the test accuracy and the variance of a graphs’ eccentricity2distribution.
4.2. Estimators on graphs10k
Ordinary Linear Regression (OLS), Support Vector Machine (SVM) and Random Forest (RF) are used to estimate
model performance values given their underlying structural properties.
The dataset was split into a train and test set with a train set size ratio of 0.7. In case of considering the whole
dataset of 10,000 graphs the estimator models are trained on 7000 data points. The train-test-split was repeated 20
times with dierent constants used as initialization seed for the random state.
Dierent feature sets are considered to assess the influence of single features.
1Note, that according to Karrer et al. “we do not at present know of any way to sample uniformly from the unordered ensemble” in the context
of samling from possibile orderings of random acyclic graphs [14].
“Bayesian networks are composed of directed acyclic graphs, and it is very hard to represent the space of such graphs. Consequently, it is not easy
to guarantee that a given method actually produces a uniform distribution in that space.” [13]
2The eccentricity of a vertex in a connected graph is the maximum distance to any other vertex.
112 Julian Stier et al. / Procedia Computer Science 159 (2019) 107–116
6Stier et al. /Procedia Computer Science 00 (2019) 000–000
(a) Frequency distribution of number of edges. (b) Jointplot of eccentricity variance vs. test accuracy
Figure 2. 2a: An exemplary frequency distribution of graphs10k. With a uniformly distributed number of vertices, very dierent frequency
distributions occur for other structural properties. 2b: Correlation between variance in eccentricity and accuracy shown in a joint plot. Axes
contain distributions of each feature. The test accuracy distribution (on the top y-axis) shows that there only exist few graphs between 0.4 and 0.85
but three peaks at around 0.2, 0.35 and 0.95. For high accuracies an increasing variance in eccentricity can be observed. A low eccentricity variance,
however, does not explain good performance on its own.
The set denotes all possible features as listed in Table 1,np contains all features except for the number of
vertices, number of edges, number of source vertices and the number of sink vertices. These four features directly
indicate numbers of trainable parameters in the model (features with no direct indication to the number of parameters).
op contains only those four properties.
Features with variance information only are considered in var, namely the variances of the degree, eccentricity,
neighborhood, path length, closeness and edge betweenness distributions. Compared to other properties of the distri-
butions – such as the average, minimum or maximum – features with variance information have shown in experiments
to have some influence on the network performance.
A manually selected set small contains a reduced set of selected features, namely number source vertices,
number sink vertices,degree distribution var,density,neighborhood var,path length var,closeness std,
edge betweenness std,eccentricity var.Table 1 provides an overview of used feature sets and resulting feature
importances for the RF estimator.
OLS achieved a R2score of 0.8631 on average for feature set over 20 repetitions with a standard deviation of
0.0042. Pearson’s r for OLS is on average ρ=0.9291. SVM with RBF kernel achieved on average 0.9356 with 0.0018
in standard deviation. RF achieved a R2score of 0.9714 on average with a standard deviation of 0.0009. None of the
other considered estimators reaches the RF-estimator predicting the model performances.
The feature importances of the RF estimator are calculated with sklearn [21] where “each feature importance is
computed as the (normalized) total reduction of the criterion brought by that feature” 3. They are listed in Table 1
for each used feature in scope of the used feature set. The number of sink vertices clearly have most influence to the
performance of a MNIST-classifier.
3Taken from http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html which refer-
ences Breiman, Friedman, Classification and regression trees, 1984
Julian Stier et al. / Procedia Computer Science 159 (2019) 107–116 113
Stier et al. /Procedia Computer Science 00 (2019) 000–000 7
Model Ω Ωnp op var small min
Random Forest R20.9714 +
0.0009 0.9314 +
0.0031 0.9664 +
0.0010 0.9283 +
0.0032 0.9710 +
0.0011 0.9706 +
0.0009
OLS R20.8631 +
0.0042 0.8476 +
0.0053 0.6522 +
0.0089 0.5968 +
0.0103 0.7211 +
0.0072 0.6907 +
0.0077
SVMlin R20.8620 +
0.0045 0.8463 +
0.0054 0.6432 +
0.0114 0.5551 +
0.0164 0.6943 +
0.0101 0.6676 +
0.0111
SVMrb f R20.9356 +
0.0018 0.9235 +
0.0028 0.8561 +
0.0041 0.7827 +
0.0092 0.8781 +
0.0045 0.8604 +
0.0033
SVMpol R20.9174 +
0.0025 0.8998 +
0.0032 0.5668 +
0.0065 0.6839 +
0.0117 0.8421 +
0.0053 0.7543 +
0.0083
Property RF Feature Importance
number vertices 0.0009 0.0045 +
0.0002
number edges 0.0013 0.0071 +
0.0002
number source vertices 0.0636 0.0720 +
0.0018 0.0643 +
0.0019 0.0659 +
0.0019
number sink vertices 0.9124 0.9164 +
0.0019 0.9137 +
0.0019 0.9139 +
0.0020
degree distribution mean 0.0004 0.0083 +
0.0054
degree distribution var 0.0009 0.4582 +
0.0973 0.3685 +
0.0866 0.0024 +
0.0002 0.0069 +
0.0002
diameter 0.0003 0.0008 +
0.0001
density 0.0007 0.0030 +
0.0007 0.0072 +
0.0007 0.0023 +
0.0002
eccentricity mean 0.0015 0.0047 +
0.0006
eccentricity var 0.0022 0.3025 +
0.1006 0.3401 +
0.0994
eccentricity max 0.0003 0.0009 +
0.0002
neighborhood mean 0.0004 0.0050 +
0.0046
neighborhood var 0.0011 0.1417 +
0.0646 0.2434 +
0.0883 0.0025 +
0.0001
neighborhood min 0.0005 0.0272 +
0.0104
neighborhood max 0.0017 0.0071 +
0.0035
path length mean 0.0011 0.0045 +
0.0013
path length var 0.0013 0.0052 +
0.0009 0.0133 +
0.0018 0.0034 +
0.0001 0.0067 +
0.0002
closeness min 0.0014 0.0067 +
0.0028
closeness mean 0.0010 0.0030 +
0.0003
closeness max 0.0013 0.0034 +
0.0003
closeness std 0.0021 0.0064 +
0.0012 0.0166 +
0.0017 0.0039 +
0.0002 0.0066 +
0.0002
edge betweenness min 0.0001 0.0008 +
0.0003
edge betweenness mean 0.0011 0.0041 +
0.0005
edge betweenness max 0.0015 0.0038 +
0.0004
edge betweenness std 0.0011 0.0027 +
0.0003 0.0109 +
0.0011 0.0036 +
0.0001
Table 1. Structural properties for the SNNs’ underlying graphs from the
graphs10k dataset and their influence as features for the Random Forest model under dierent feature sets.
114 Julian Stier et al. / Procedia Computer Science 159 (2019) 107–116
8Stier et al. /Procedia Computer Science 00 (2019) 000–000
Features, which do not directly indicate the number of parameters, can be seen in the second column for np and the
three most important ones are highlighted in boldface. The three most important features are the variances of degree,
eccentricity and neighborhood distributions of the graph. Considering six variance features together with the number
of source and sink vertices leads to a R2value for RF of 0.9710 +
0.0011, only deviating by 0.0004 from the best
average R2value of 0.9714 +
0.0009. Variance features alone already achieve an average R2of 0.9283 +
0.0032 (see
var). This gives indication, that e.g. in the case of eccentricity a higher diversity leads to better performance. Higher
variances in those distributions imply dierent path lengths through the network which can be found in architectures
such as Residual Networks.
5. Conclusion & Future Work
This work presented motivations and approaches to study Artificial Neural Networks (ANNs) from a network
science perspective. Directed Acyclic Graphs with characteristic properties from network science are obtained by
Random Graph Generators (RGGs) and embedded into ANNs. A dataset of 10,000 graphs with Watts-Strogatz- and
Barab´asi-Albert-models as RGGs was created to base experiments on. ANN models are trained on the presented
structural embedding technique and performance values such as the accuracy on a validation set were obtained.
With both, structural properties and resulting performance values, three estimator models, namely an Ordinary
Linear Regression, a Support Vector Machine and a Random Forest (RF) model are built. The Random Forest model
is most successful in predicting ANN model performances based on the structural properties. Influence of the structural
properties (features of the RF) on predicting ANN model performances is measured.
Clearly, the most important feature is the number of vertices and edges, directly determining the number of trainable
parameters. There is, however, indication that the variance of distributions of properties such as the eccentricity,
vertex degrees and path lengths of networks have influence for high performant models. This insight goes along with
successful models such as Residual Networks and Highway Networks, which introduce more variance and thus lead
to a more ensemble-like behaviour.
More such characteristic graph properties will help explain dierences of various architectures. Understanding the
interaction of graph properties could also be exploitet in the form of structural regularization – designing architectures
or searching for them (e.g. in evolutionary approaches) could be driven by network scientific knowledge.
In future work, insights into the influence of structural properties on model performance will be hardened by more
complex problem domains which assumably have more potential of revealing stronger dierences. The approach
with structural properties will be extended to recurrent architectures to reduce the gap between DAGs used in this
work and directed cyclic networks, as they are found in nature. The performance prediction will also be integrated into
neuroevolutionary methods, most likely leading to a performance boost by early stopping and improved regularized
search.
References
[1] R´
eka Albert and Albert-L´
aszl´
o Barab´
asi. 2002. Statistical mechanics of complex networks. Reviews of modern physics 74, 1 (2002), 47.
[2] Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. 2018. Accelerating neural architecture search using performance prediction.
(2018).
[3] Amnon B Barak and Paul Erd¨
os. 1984. On the maximal number of strongly independent vertices in a random acyclic directed graph. SIAM
Journal on Algebraic Discrete Methods 5, 4 (1984), 508–514.
[4] Alfred Bourely, John Patrick Boueri, and Krzysztof Choromonski. 2017. Sparse Neural Networks Topologies. arXiv preprint arXiv:1706.05683
(2017).
[5] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speeding Up Automatic Hyperparameter Optimization of Deep Neural
Networks by Extrapolation of Learning Curves.. In IJCAI, Vol. 15. 3460–8.
[6] Paul Erd¨
os and Alfr´
ed R´
enyi. 1960. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5 (1960), 17–61.
[7] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International
Conference on Artificial Intelligence and Statistics. 315–323.
[8] Song Han, JePool, John Tran, and William Dally. 2015. Learning both weights and connections for ecient neural network. In Advances in
neural information processing systems. 1135–1143.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition. 770–778.
Julian Stier et al. / Procedia Computer Science 159 (2019) 107–116 115
Stier et al. /Procedia Computer Science 00 (2019) 000–000 9
(a) Correlation between test accuracy and number of vertices of the
underlying graph.
(b) Correlation between test accuracy and number of edges of the un-
derlying graph.
Figure 3.
[10] Claus C Hilgetag and Alexandros Goulas. 2016. Is the brain really a small-world network? Brain Structure and Function 221, 4 (2016),
2361–2366.
[11] Sepp Hochreiter. 1991. Untersuchungen zu dynamischen neuronalen Netzen. Diploma, Technische Universit¨at M ¨unchen 91 (1991), 1.
[12] Sepp Hochreiter and J¨
urgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[13] Jaime S Ide and Fabio G Cozman. 2002. Random generation of Bayesian networks. In Brazilian symposium on artificial intelligence. Springer,
366–376.
[14] Brian Karrer and Mark EJ Newman. 2009. Random graph models for directed acyclic networks. Physical Review E 80, 4 (2009), 046110.
[15] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. 2016. Learning curve prediction with Bayesian neural networks.
(2016).
[16] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. 1990.
Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems. 396–404.
[17] Yann LeCun, Corinna Cortes, and Christopher JC Burges. 1998. The MNIST database of handwritten digits. (1998).
[18] Christos Louizos, Max Welling, and Diederik P Kingma. 2017. Learning Sparse Neural Networks through L0 Regularization. arXiv preprint
arXiv:1712.01312 (2017).
[19] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. 2017. Evolutionary
training of sparse artificial neural networks: a network science perspective. arXiv preprint arXiv:1707.04780 (2017).
[20] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga,
and Adam Lerer. 2017. Automatic dierentiation in PyTorch. (2017).
[21] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,
A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine
Learning Research 12 (2011), 2825–2830.
[22] Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. 2017. Training sparse neural networks. In Computer Vision and Pattern
Recognition Workshops (CVPRW), 2017 IEEE Conference on. IEEE, 455–462.
[23] Rupesh Kumar Srivastava, Klaus Gre, and J ¨
urgen Schmidhuber. 2015. Highway networks. arXiv preprint arXiv:1505.00387 (2015).
[24] Julian Stier, Gabriele Gianini, Michael Granitzer, and Konstantin Ziegler. 2018. Analysing Neural Network Topologies: a Game Theoretic
Approach. Procedia Computer Science 126 (2018), 234–243.
[25] Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of ‘small-world’networks. nature 393, 6684 (1998), 440.
116 Julian Stier et al. / Procedia Computer Science 159 (2019) 107–116
10 Stier et al. /Procedia Computer Science 00 (2019) 000–000
(a) Correlation between test accuracy and density of the underlying
graph.
(b) Correlation between test accuracy and path length variance of the
underlying graph.
Figure 4.
Property Min Mean Max Std
number vertices 50 269.66 490 129.56
number edges 97 1399.17 4365 954.12
number source vertices 1 79.7501 306 80.6862
number sink vertices 1 9.5550 125 17.5298
diameter 3 8.72 54 4.86
density 0.0041 0.0277 0.1653 0.0256
degree distribution mean 3.88 10.36 17.82 4.47
degree distribution var 0.4638 174.46 1274.97 259.08
eccentricity mean 1.7800 4.4104 29.2400 2.4913
eccentricity var 0.2784 4.6160 146.8862 9.6251
eccentricity max 3 8.7211 54 4.8585
neighborhood mean 4.8800 11.3557 18.8163 4.4670
neighborhood var 0.4571 173.7798 1272.3148 258.3377
neighborhood min 1 5.7161 14 2.8536
neighborhood max 7 75.8607 312 72.7306
path length mean 1.3008 2.7941 14.2195 1.4005
path length var 0.2229 1.9988 53.2144 3.8514
closeness min 0.0020 0.3229 0.5698 0.1275
closeness mean 0.0476 0.4007 0.6150 0.1132
closeness max 0.0514 0.5324 0.9672 0.1824
closeness std 0.0090 0.0306 0.1137 0.0151
edge betweenness min 1 1.0034 3 0.0467
edge betweenness mean 1.8272 45.2005 1531.7422 101.3437
edge betweenness max 8.3333 441.7764 12259.0184 782.9232
edge betweenness std 1.0408 51.3070 1781.8689 117.5626
Table 2. Considered properties and their statistical distributions across the graphs10k dataset.
... We start by generating randomly structured neural networks by following the technique described by Stier et al. in [SG19] that generate Sparse Neural Networks. ...
... In September 2019, researchers from the University of Passau published [SG19] in which they aimed to predict the performance of Convolutional Neural Networks (CNNs) using its structural properties. Authors build Sparse Neural Networks (ANNs) by embedding Directed Acyclic Graphs (DAG) obtained through Random Graph Generators into Artificial Neural Networks. ...
... While in this section, we explain another method to induce sparsity by generating random structures that are already sparse. This method was introduced for Artificial Neural Network by Stier et al. in [SG19], which we will revise to make it suitable for our goal. ...
Preprint
Full-text available
In the past few years, neural networks have evolved from simple Feedforward Neural Networks to more complex neural networks, such as Convolutional Neural Networks and Recurrent Neural Networks. Where CNNs are a perfect fit for tasks where the sequence is not important such as image recognition, RNNs are useful when order is important such as machine translation. An increasing number of layers in a neural network is one way to improve its performance, but it also increases its complexity making it much more time and power-consuming to train. One way to tackle this problem is to introduce sparsity in the architecture of the neural network. Pruning is one of the many methods to make a neural network architecture sparse by clipping out weights below a certain threshold while keeping the performance near to the original. Another way is to generate arbitrary structures using random graphs and embed them between an input and output layer of an Artificial Neural Network. Many researchers in past years have focused on pruning mainly CNNs, while hardly any research is done for the same in RNNs. The same also holds in creating sparse architectures for RNNs by generating and embedding arbitrary structures. Therefore, this thesis focuses on investigating the effects of the before-mentioned two techniques on the performance of RNNs. We first describe the pruning of RNNs, its impact on the performance of RNNs, and the number of training epochs required to regain accuracy after the pruning is performed. Next, we continue with the creation and training of Sparse Recurrent Neural Networks and identify the relation between the performance and the graph properties of its underlying arbitrary structure. We perform these experiments on RNN with Tanh nonlinearity (RNN-Tanh), RNN with ReLU nonlinearity (RNN-ReLU), GRU, and LSTM. Finally, we analyze and discuss the results achieved from both the experiments.
... We start by generating randomly structured neural networks by following the technique described by Stier et al. in [SG19] that generate Sparse Neural Networks. ...
... In September 2019, researchers from the University of Passau published [SG19] in which they aimed to predict the performance of Convolutional Neural Networks (CNNs) using its structural properties. Authors build Sparse Neural Networks (ANNs) by embedding Directed Acyclic Graphs (DAG) obtained through Random Graph Generators into Artificial Neural Networks. ...
... While in this section, we explain another method to induce sparsity by generating random structures that are already sparse. This method was introduced for Artificial Neural Network by Stier et al. in [SG19], which we will revise to make it suitable for our goal. ...
Thesis
Full-text available
In the past few years, neural networks have evolved from simple Feedforward Neural Networks to more complex neural networks, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Where CNNs are a perfect fit for tasks where the sequence is not important such as image recognition, RNNs are useful when order is important such as machine translation. An increasing number of layers in a neural network is one way to improve its performance, but it also increases its complexity making it much more time and power-consuming to train. One way to tackle this problem is to introduce sparsity in the architecture of the neural network. Pruning is one of the many methods to make a neural network architecture sparse by clipping out weights below a certain threshold while keeping the performance near to the original. Another way is to generate arbitrary structures using random graphs and embed them between an input and output layer of an Artificial Neural Network (ANN). Many researchers in past years have focused on pruning mainly CNNs, while hardly any research is done for the same in RNNs. The same also holds in creating sparse architectures for RNNs by generating and embedding arbitrary structures. Therefore, this thesis focuses on investigating the effects of the before-mentioned two techniques on the performance of RNNs. We first describe the pruning of RNNs, its impact on the performance of RNNs, and the number of training epochs required to regain accuracy after the pruning is performed. Next, we continue with the creation and training of Sparse Recurrent Neural Networks (Sparse-RNNs) and identify the relation between the performance and the graph properties of its underlying arbitrary structure. We perform these experiments on RNN with Tanh nonlinearity (RNN-Tanh), RNN with ReLU nonlinearity (RNN-ReLU), GRU, and LSTM. Finally, we analyze and discuss the results achieved from both the experiments.
... In this case, the overall size of the network (in terms of number of neurons) does not change, while the number of parameters (i.e., weights) can be greatly decreased [12,16,29]. The recent literature has thoroughly studied the most relevant characteristics of sparse NNs, both for Feed-Forward [24,40] and Convolutional NNs [5,34]. One of the most important findings in this field is by You et al. [45], who proved that the best-performing sparse architectures are structurally similar to biological systems [2]. ...
... What sets an SNN apart from a DNN is its unique topology, which can provide insight into its ability to solve a given task even when removing a portion of parameters. Grounded in graph theory, the random generation of small sparse structures for both Multi-Layer Perceptrons (MLPs) [6,42] and CNNs [52] has been studied, showing that performance is associated with the clustering coefficient (which measures the capacity of nodes to cluster together) and the average path length of its graph representation (i.e., the average number of connections across all shortest paths). ...
Preprint
Full-text available
Pruning-at-Initialization (PaI) algorithms provide Sparse Neural Networks (SNNs) which are computationally more efficient than their dense counterparts, and try to avoid performance degradation. While much emphasis has been directed towards how to prune, we still do not know what topological metrics of the SNNs characterize good performance. From prior work, we have layer-wise topological metrics by which SNN performance can be predicted: the Ramanujan-based metrics. To exploit these metrics, proper ways to represent network layers via Graph Encodings (GEs) are needed, with Bipartite Graph Encodings (BGEs) being the de-facto standard at the current stage. Nevertheless, existing BGEs neglect the impact of the inputs, and do not characterize the SNN in an end-to-end manner. Additionally, thanks to a thorough study of the Ramanujan-based metrics, we discover that they are only as good as the layer-wise density as performance predictors, when paired with BGEs. To close both gaps, we design a comprehensive topological analysis for SNNs with both linear and convolutional layers, via (i) a new input-aware Multipartite Graph Encoding (MGE) for SNNs and (ii) the design of new end-to-end topological metrics over the MGE. With these novelties, we show the following: (a) The proposed MGE allows to extract topological metrics that are much better predictors of the accuracy drop than metrics computed from current input-agnostic BGEs; (b) Which metrics are important at different sparsity levels and for different architectures; (c) A mixture of our topological metrics can rank PaI algorithms more effectively than Ramanujan-based metrics.
Chapter
Neural architecture search (NAS) is a challenging problem. Hierarchical search spaces allow for cheap evaluations of neural network sub modules to serve as surrogate for architecture evaluations. Yet, sometimes the hierarchy is too restrictive or the surrogate fails to generalize. We present FaDE which uses differentiable architecture search to obtain relative performance predictions on finite regions of a hierarchical NAS space. The relative nature of these ranks calls for a memory-less, batch-wise outer search algorithm for which we use an evolutionary algorithm with pseudo-gradient descent. FaDE is especially suited on deep hierarchical, respectively multi-cell search spaces, which it can explore by linear instead of exponential cost and therefore eliminates the need for a proxy search space. Our experiments show that firstly, FaDE-ranks on finite regions of the search space correlate with corresponding architecture performances and secondly, the ranks can empower a pseudo-gradient evolutionary search on the complete neural architecture search space.
Chapter
This paper studies the accuracy and the structural properties of sparse neural networks (SNNs) generated by weight pruning and by using Watts-Strogatz network priors. The study involves Multi-Layer Perceptron (MLP) and Long-Short Term Memory (LSTM) architectures, trained on the MNIST dataset. The paper replicates and extends previous work, showing that networks generated by appropriately selected WS priors guarantee high-quality results, and that these networks outperform pruned networks in terms of accuracy. In addition, observations are made with regard to the structural change induced by network pruning and its implications for accuracy. The findings of this study provide important insights for creating lighter models with lower computational needs, which can achieve results comparable to more complex models.
Chapter
Sparsity in the structure of Neural Networks can lead to less energy consumption, less memory usage, faster computation times on convenient hardware, and automated machine learning. If sparsity gives rise to certain kinds of structure, it can explain automatically obtained features during learning.We provide insights into experiments in which we show how sparsity can be achieved through prior initialization, pruning, and during learning, and answer questions on the relationship between the structure of Neural Networks and their performance. This includes the first work of inducing priors from network theory into Recurrent Neural Networks and an architectural performance prediction during a Neural Architecture Search. Within our experiments, we show how magnitude class blinded pruning achieves 97.5% on MNIST with 80% compression and re-training, which is 0.5 points more than without compression, that magnitude class uniform pruning is significantly inferior to it and how a genetic search enhanced with performance prediction achieves 82.4% on CIFAR10. Further, performance prediction for Recurrent Networks learning the Reber grammar shows an R2R^2 of up to 0.81 given only structural information. KeywordsSparse recurrent neural networksPruningHidden structural priorNeural architecture searchArchitecture performance prediction
Article
Full-text available
Artificial Neural Networks have shown impressive success in very different application cases. Choosing a proper network architecture is a critical decision for a network's success, usually done in a manual manner. As a straightforward strategy, large, mostly fully connected architectures are selected, thereby relying on a good optimization strategy to find proper weights while at the same time avoiding overfitting. However, large parts of the final network are redundant. In the best case, large parts of the network become simply irrelevant for later inferencing. In the worst case, highly parameterized architectures hinder proper optimization and allow the easy creation of adverserial examples fooling the network. A first step in removing irrelevant architectural parts lies in identifying those parts, which requires measuring the contribution of individual components such as neurons. In previous work, heuristics based on using the weight distribution of a neuron as contribution measure have shown some success, but do not provide a proper theoretical understanding. Therefore, in our work we investigate game theoretic measures, namely the Shapley value (SV), in order to separate relevant from irrelevant parts of an artificial neural network. We begin by designing a coalitional game for an artificial neural network, where neurons form coalitions and the average contributions of neurons to coalitions yield to the Shapley value. In order to measure how well the Shapley value measures the contribution of individual neurons, we remove low-contributing neurons and measure its impact on the network performance. In our experiments we show that the Shapley value outperforms other heuristics for measuring the contribution of neurons.
Article
Full-text available
A matter of network topologyIt is commonly assumed that the brain is a small-world network (e.g., Sporns and Honey 2006). Indeed, one of the present authors claimed as much 15 years ago (Hilgetag et al. 2000). The small-worldness is believed to be a crucial aspect of efficient brain organization that confers significant advantages in signal processing (e.g., Lago-Fernández et al. 2000). Correspondingly, the small-world organization is deemed essential for healthy brain function, as alterations of small-world features are observed in patient groups with Alzheimer’s disease (Stam et al. 2007), autism (Barttfeld et al. 2011) or schizophrenia spectrum diseases (Liu et al. 2008; Wang et al. 2012; Zalesky et al. 2011).While the colloquial idea of a small, interconnected world has a long tradition (e.g., Klemperer 1938), the present concept of small-world features of networks is frequently associated with the Milgram experiment (Milgram 1967) that demonstrated surprisingly short paths across ...
Article
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems. Also, conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy, by learning only the important connections. Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections. On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9x, from 61 million to 6.7 million, without incurring accuracy loss. Similar experiments with VGG16 found that the network as a whole can be reduced 6.8x just by pruning the fully-connected layers, again with no loss of accuracy.
Article
Let An\mathcal{A}_n denote a random acyclic directed graph which is obtained from a random graph with vertex set {1,2,,n}\{ 1,2, \cdots ,n\} , such that each edge is present with a prescribed probability p and all the edges are directed from higher to lower indexed vertices. Define a subset of vertices in An\mathcal{A}_n to be strongly independent if there is no directed path between any pair of vertices in the subset. We show that the sequence J(An)\mathcal{J}(\mathcal{A}_n ), the number of vertices in the largest strongly independent vertex subset of An\mathcal{A}_n satisfies with probability tending to 1, J(An)logn2log1/qasn, \frac{\mathcal{J} (\mathcal{A}_n )}{\sqrt{\log n}} \to \frac{\sqrt{2 }}{\sqrt{\log 1/q}}\quad {\text{as}}\,n \to \infty , where q=1pq=1-p.