The Internet Topology Zoo
ABSTRACT The study of network topology has attracted a great deal of attention in the last decade, but has been hampered by a lack of accurate data. Existing methods for measuring topology have flaws, and arguments about the importance of these have overshadowed the more interesting questions about network structure. The Internet Topology Zoo is a store of network data created from the information that network operators make public. As such it is the most accurate large-scale collection of network topologies available, and includes meta-data that couldn't have been measured. With this data we can answer questions about network structure with more certainty than ever before - we illustrate its power through a preliminary analysis of the PoP-level topology of over 140 networks. We find a wide range of network designs not conforming as a whole to any obvious model.
The Internet Topology Zoo
Simon Knight, Hung X. Nguyen,Nickolas Falkner,
University of Adelaide, Australia
Rhys Bowden, Matthew Roughan
Abstract—The study of network topology has attracted a great
deal of attention in the last decade, but has been hampered
by a lack of accurate data. Existing methods for measuring
topology have flaws, and arguments about the importance of these
have overshadowed the more interesting questions about network
structure. The Internet Topology Zoo is a store of network
data created from the information that network operators make
public. As such it is the most accurate large-scale collection
of network topologies available, and includes meta-data that
couldn’t have been measured. With this data we can answer
questions about network structure with more certainty than ever
before — we illustrate its power through a preliminary analysis
of the PoP-level topology of over 140 networks. We find a wide
range of network designs not conforming as a whole to any
kept and exhibited to the public – the earliest of which was
that of the London Zoological Society, established 1828 in
Regent’s Park. There are many much older collections of
animals, but they were described by terms such as menagerie
and, apart from the name change, modern zoos differ from
a typical menagerie in their guiding principles. The modern
zoo is not just concerned with entertainment, but also with
conservation, education and scientific research.
The collection described in this paper is a set of network
topologies, not animals, so perhaps the term zoo (from the
Greek z¯ oon or “animal”) is inappropriate. However, our goals
are those of a modern zoo.
Our Zoo presently consists of over two hundred network
topologies from network providers. It is distinguished from
other such datasets by the method of collection — we do not
use an automated procedure such as a traceroute survey as in
previous studies (for example, see –). The problems with
such surveys are well documented , . Here we base our
collection on promotional data: maps and other information
self published by the owner or manager of the network in
question. The results add to the evidence that traceroute is a
difficult tool to use for determining topology. For instance, we
found one case  where the authors declared that a network
(Cogent) had 35 PoPs when the network operator themselves
advertises 184 PoPs1.
Natural questions arise concerning the accuracy of our
data as well, and we discuss this issue later, but our central
argument is that although a published network map may not
reflect the exact nature of the underlying network at the current
time, it certainly does show the network that the company
hat is a zoo? The term zoo is a common abbreviation
for “zoological garden” – a place where animals are
intended. Such a map reveals something that no measurement
can see: what was in the mind of the network engineer when
the network was designed, rather than what was built to meet
the realities of day-to-day operation.
The ability to see what a network engineer, manager, or
operator believed was important about their network provides
insights that traceroute studies lack. Of course, for some
purposes a precise map of exactly what currently exists is
more useful, and so we see collections of data such as this as
complementary to good measurements. In fact, we hope that
the Zoo can actually help improve the standard of measured
topologies by providing a dataset against which to compare
results. However, the network maps we use often provide
meta-data about a network that is otherwise unavailable, or
at best, subject to large inference errors. For instance we
can often see link capacities, node locations, node roles,
interconnect locations and so on.
It may be surprising to some that we can collect so many
network topologies in this way, but the 200+ networks that
we have at present do not even cover all of those that we know
(conversion is a time consuming process which is ongoing),
and there are no doubt many others that we have not yet found.
It seems to be quite common for network operators to publish
some form of information about their network. Some even
describe their network in legally binding documents such as
companyreports (e.g., –).Moreover,as networks evolve
we expect that companies will update network maps, so we
will see the development of these networks. For these reasons,
we see the Zoo as an ongoing project. The web page allows
for contributions, so in addition to our ongoing efforts we hope
to recruit others to add networks to the Zoo.
As with other zoos, there are three primary goals for
this collection, starting with scientific research. There is an
extensive debate under way on the nature of network topology.
On the one hand lie the random graph models (starting with
Erdos-Renyi and Gilbert  and going forward through
Waxman , and more recently power-law graphs –
). On the other hand lie “designed networks” such as the
structured networks of GT-ITM ,  or HOT (Highly
Optimized Tolerance) graphs . Proponents of power-law
and HOT graphs seem convincing, but both are hampered by
lack of accurate data. In the few cases where a commercial
network has been used the data have not been published. This
lack of accurate, public data has been a severe constraint,
preventing performance of repeatable scientific research. One
may reasonably argue that scientific progress cannot be made
in this area without an accurate, public set of data.
In addition to scientific interest in the very nature of these
networks, the provision of network datasets in an easily
accessed format may prove useful to the broader network
community: network topologies are necessary for testing many
networking algorithms. At present, very few are available apart
from those based on traceroute studies.
The Zoo’s second goal is education, though this is obviously
related to the research value of the data. We will learn
lessons from this research, and those lessons will help educate
researchers2. The Zoo will also provide a set of network
examples to help educate the next generation of network
engineers. Most current examples are contrived. Real examples
are more compelling.
The third goal of the Zoo is conservation, though not
in the ecological sense. The maps and other promotional
materials that we use here are ephemeral. Once their use-
by date is exceeded they are removed from public view, but
more than that many companies do not make any attempt
to store historical archives of their network designs. Apart
from scientific uses, such data may even be useful to those
companies at some point in the future to understand the
development of their own network. The Zoo currently includes
a number of historical network maps, going back to those of
the ARPANET, published by Cerf and Kahn . The Zoo
contains several other examples for which we have multiple
views of that network’s development over time.
This paper is more than a description and classification of
the Zoo itself. We also present, in Section IV, some prelim-
inary analysis of these networks. We focus in this paper on
maps at the Point of Presence (PoP) level, which are interesting
because this level relates to the network design problem. It
is also the level which concerns peers and customers as it
determines where they can connect to the network, and it’s
also the level at which reliability and redundancy are often
considered. What we see in the 141 networks studied is that
there is no “one true network model”. There are a very wide
range of networks ranging from hub-and-spoke networks, to
trees, to more highly connected graphs. However, we do
observe some trends. In particular we see more hierarchy with
increasing network size. We also make one new observation
that the neighborhoodof a PoP seems to be limited to about 20
or so other PoPs. There is no physical or technical constraint
that enforces this at the PoP level, so it will be interesting in
the future to explore the reason for the presence of this limit.
II. THE DATA
topology is a woefully abused term. We define topology to
mean an undirected graph G = (N,E), which abstracts
the connectivity of a data communications network. In fact,
we really mean a multigraph, as multiple edges are allowed
between a single pair of nodes (formally, E is a multiset).
Care must be taken to define the nature of the nodes
and edges of the graph. Internet topologies have been given
for each of the seven OSI layers, for instance edges may
refer to physical cables, virtual network layer connections, or
efore we begin discussing the details of our topological
data, let us first define our terminology precisely as
2In fact, some of what we learn from this data is already accepted by
network engineers, and so part of the value of this data is in educating
researchers. However, it is dangerous to be too trusting of received wisdom.
We should maintain skepticism, and verify even that which is “well-known”.
even the HTML links between WWW pages. Other types of
topology are also possible, such as those reflecting hierarchical
approximations, say by combining some groups of routers
into Autonomous Systems (ASs) or Points-of-Presence (PoPs).
The Zoo contains topologies of various levels of detail, from
physical fiber, through to virtual/logical connectivity between
ASs. We admit most types of networks to the Zoo, but ensure
that in the data the type of nodes and links are precisely
There are various strategies available for measuring network
topology. The most direct way is to ask the network itself.
IP routers are managed through configuration files describing
the current operation of the router, and which can be used to
measure a network 3. Precisely because of the quality of
information contained in these files they are considered highly
sensitive and are rarely allowed outside an organization. Such
data may be used to construct the type of map we use here,
but is otherwise rarely available to researchers.
The second class of techniques involve IP-level hacks that
ideally return the path between two points. The IP header
option field “record route” ,  returns the route of a
packet as it traverses the network, but is often not enabled
due to security and performance concerns. The more common
approach is traceroute , . Despite being com-
monly used, traceroute has many well-known deficiencies
(summarized in , ). There are nevertheless many studies
of network topology using traceroutes (for examples see –
, ), but the resulting network topologies are not very
accurate , . Moreover, verification of these topologies is
made difficult by lack of ground-truth. One of the potential
uses of the Zoo data is to establish ground-truth data to use
in improving measurement-based approaches, which can in
principle survey a wide range of networks.
We performed comparisons between our dataset and one
of the most recent and advanced traceroute based methods
and found large differences. For instance, the example of
Cogent’s network in . There are several possible causes for
this discrepancy, the most likely being traceroute measurement
errors, differing definitions of a PoP, and differing network
boundaries. However, we maintain that a network operator is
in a better position to define details such as the edge of its
network, and so their view should take primacy.
The third group of strategies for topology inference is
based on network tomography. The statistical nature of these
approaches again leads to errors.
Instead of the existing automated methods we adopt here a
simple, manual approach. Many companies present public ma-
terial about their network, primarily for promotional purposes.
They wish to sell their network. We capture this information,
and manually transcribe it into a common data format – in the
following section we describe in detail our process of capture
and analysis for these datasets.
3A related approach is to use a routing monitor (e.g., ), which observes
routing protocols and uses this information to construct a network topology,
but this also requires privileged access to the network in question.
B. Data Collection and Formatting
Some network operators provide a piece by piece descrip-
tion of their network, but the most common form of published
information available to us is a network map. Such maps often
show PoPs and their interconnects, but may provide much
more detail. Some care goes into such maps because they are
a form of advertisement and therefore have legal requirements
for accuracy; they are highly visible to potential customers;
and finally, network engineers are often proud of their work,
and many would very much like to display it at its best.
Often these maps are simple images, but in some cases
they are interactive maps (for example using Flash). In the
case of images, we manually download the map and then
transcribe it into an annotated graph. Dynamic maps are more
difficult, and often require several passes to zoom in on details
and transcribe. Supplemental data in the form of equipment
registers or other descriptions of the network are used where
available to label links and nodes. We have collected over 200
such maps and associated data, and make no claim that we
have an exhaustive list.
Maps are converted using yEd , a freeware tool for use
in graph-drawing. It allows us to trace the network elements
such as routers and PoPs directly overlayed on the map itself,
with annotations such as node names and edge capacities. A
graphical diagram editor speeds up the tracing and annotation
steps, and reduces errors by allowing a visual comparison of
the original source image and the transcribed network.
Once we are satisfied with the transcribed network, we
export the topology into GML (Graph Markup Language)
format. yEd supports a number of graph formats (for instance
GML and GraphML), but GML meets our immediate needs
for a flexible, easily readable format. We wanted a format
that was easily computer readable, but also human readable. A
graph can be most easily understood pictorially, but it enhances
our ability to double-check data if we can read it without the
intervention of third-party software.
We also needed a graph format that was extensible. Dif-
ferent network operators provide different information in their
network descriptions. Some may provide PoP-level or router-
level maps, or detailed information about the physical media
used for a link, while others may show links that are planned
for the future. We don’t know all of the data that we might
need to store, and so we need a format that can be extended.
Binary file formats are compact, but are difficult to read and
to extend. Adjacency matrices capture the graph’s structure but
have limited scope for storing attributes such as node names.
GML  is a simple, text-based format. It simply lists
nodes and edges, with extensions to allow node and edge
attributes to be stored. Attribute information is represented
inside square brackets as key-value pairs. GML is also sup-
ported by a number of tools, and easily ported to other formats
(we provide simple scripts to do so). Hence we use GML as
our core file format. We provide a simple network example in
Figure 1, to illustrate the data format.
However, we understand that other users of this data may
find other formats more convenient. XML-based languages
such as GraphML  are easily parsed by machines —
XML processing libraries exist for most popular programming
languages, making it simple to work with data from the
Zoo. XML is also extensible by design, allowing it to handle
arbitrary attributes. We use GML as our core file format, but
provide the data in GraphML format as well. We also provide
scripts to read and convert the graph data into other formats
such as a simplified adjacency matrix representation.
One of the major advantages of GML is that it can be read
by NetworkX , an open-sourcegraph library for the Python
programming language. NetworkX is fast, well supported and
includes many graph analysis algorithms. It is these we use to
perform much of the statistical analysis presented later.
One of the chief advantages of our approach is that many
network maps contain additional data. We include such meta-
data in the records, for instance:
• link types or speeds;
• longitudes and latitudes of nodes obtained through
geocoding of PoP locations;
• a URL (Uniform Resource Locator) showing where the
data was obtained;
• the date-of-record, i.e., the date that the map was repre-
sentative of the network (in cases where the network map
was dynamically generated we record that);
• we also record the date we obtained the network map;
• a classification of the type of network. This last point
requires much more discussion and we will do so in
• a link to other related networks.
How accurate is the Zoo’s data? The maps are created by
network companies themselves, so they are directly based on
ground truth. However, some network operators clearly pro-
duce these maps manually, potentially leading to inaccuracies
in their depiction of their own network. There are two reasons
that these errors are less significant than those of prior studies.
• The network maps we use are all public documents, and
so must satisfy standard due diligence requirements for
an advertisement or official corporate publication. That
is not to say that all corporations are perfect – it is easy
to make mistakes in drawing the map – but a network
operator is unlikely to publish a worse map than the one
they use in their own network operations.
• Some network maps may idealize the network. However,
we argue that in these cases, we are seeing what was in
the mind of the network engineer when the network was
designed. In this sense, the idealized view of the network
may be more interesting than its implementation (though
for some purposes it may be preferable to see exactly
what was implemented).
On the other hand, network operators do perform simplifica-
tions in some cases, most notable, many of the maps report
PoP-level, not router-level topologies. The datasets include the
level at which they are defined, and it is important to be aware
of this issue when using this data for research. For a very
simple instance, consider a network reliability study. A single
PoP may consist of a number of redundant routers, so the
117oE 126oE 135oE 144oE 153oE
(a) Network aus simple.
(b) GML(c) GraphML
Network ”aus simple”
Classification ”Backbone, Transit”
Creator ”Topology Zoo Toolset”
label ”aus simple”
Label ”2.5 Gbps, Ethernet”
<?xml version=”1.0” encoding=”utf−8”?>
<key attr.name=”key” attr.type=”int” for=”edge” id=”d23” />
<key attr.name=”Label” attr.type=”string” for=”edge” id=”d22” />
<key attr.name=”Speed” attr.type=”string” for=”edge” id=”d21” />
<graph edgedefault=”undirected” id=””>
<data key=”d0” />
<data key=”d3”>aus simple</data>
<data key=”d9” />
<data key=”d12”>aus simple</data>
<edge source=”0” target=”1”>
<data key=”d22”>2.5 Gbps, Ethernet</data>
Fig. 1: Example of GML and GraphML file formats. Many of the tags will be explained in the following section.
likelihood of the whole PoP failing is much smaller than for
a node in a router-level graph.
Most of the maps in the Zoo come directly from the network
operators, but some have been derived from secondary sources.
We don’t wish to exclude any interesting data from the Zoo,
however, in these cases, the data is potentially less reliable.
Hence, we include in the data a “provenance” field taking the
form: primary (meaning it comes from the operator itself),
secondary (from a reputable secondary source, for instance the
scientific literature) or unknown. For studies requiring accurate
maps we suggest that only data with primary provenance are
A second question of accuracy is “How accurate are our
transcriptions?” We have transcribed a large number of maps
so it is inevitable that some errors occur. However, we have
tried to minimize errors by (i) using a graphical tool so that the
transcription process is closely matched to the maps, and (ii)
making sure that each network is transcribed by one person,
and then checked by at least one other person.
Despite this care, there are still difficulties in interpreting
some maps. The most pernicious problem is links that join
without a node. There are two possible explanations for such
joins: (i) that there is really a three way link, and that the
correct graph representation is to join three nodes (with three
links), or (ii) that there is a y-junction, and one node is
connected to the two others, but there is no third link. However,
we do not know which is the reality, and so we introduce a
“blank” node at the join. This is the biggest source of potential
inaccuracy in the Zoo, but these nodes are flagged in the data,
and so it is possible to take the appropriate care to eliminate
problems caused by such ambiguities.
Another potential source of error is a network where it’s
too complicated to follow the tangle of links, or where it’s
unclear whether some nodes are real or logical. In such cases,
we exclude that network from the Zoo.
Ongoing quality control is an important part of this project,
and the web page also has links to a discussion forum, to
allow ongoing contributions to the accuracy of the dataset.
The forum provides a way for users of the data to point out
flaws, either in transcription or interpretation of the data. The
ongoing improvement of the quality of the Zoo’s data is as
important as care in the initial collection.
E. Visiting the Zoo
The data is stored at www.topology-zoo.org. It is viewable
through a table containing meta-data about the networks, or
as a single archive file. Scripts are provided for easy access
and translation of the data. An addition goal of the web page
is to allow contributions to the Zoo from third parties.
The data at present consists only of data provided by oper-
ators, however, we see no reason in principle why we could
not include crawled topologies, for instance, social network
topologies. Obviously such data would need to be classified
and tagged appropriately, and details of the data’s limitations
published. Further, we require that additional datasets conform
to the same data format, though writing translation scripts is
not usually difficult (the tools we currently provide include
a translation script for Rocketfuel data ). GML is flexible
enough to allow for such extension.
Finally, because the dataset represents a growing collection,
we use a version control system to keep track of the state.
Archives of the dataset at particular time points will be kept
for comparisons with past studies. We ask of any researchers
who make use of this data that, apart from taking care to first
understand the limitations of the data as documented above,
they cite the exact version of the dataset they use.
of our new inhabitants. Classification of species or entities is
important for several reasons. First, we wish to describe what
the Zoo contains. Second, we suspect that different types of
networks will have different qualities, and we wish to test
that hypothesis. Third, we can now identify when we have
discovered something new and hence extend our classification.
There have been a number of prior works on classifying
networks, in particular, Autonomous Systems (ASs) , .
The focus of these has been machine learning techniques
applicable to classification of all the tens of thousands of ASs
in the Internet. Here our focus on a smaller subset allows us to
classify the networks manually. We also have a more detailed
data source in the information a network operator provides.
Past classification efforts ,  have tried to cluster
networks into disjoint categories. We could easily extend this
into tree-like classification resembling Linnaen taxonomy of
living organisms . However, the tree-like classification
popularized by Linnaeus in his Systema Naturae  is hard
to apply. Even in biology where the tree based taxonomy
was later ratified by the theory of evolution there are many
cases where purely description taxonomy fails to identify the
evolutionary tree (for instance compare anteaters in Australia
and South America. Both have similar adaptations for their
exclusive diet of ants and termites, but the Australian Anteater
or Echidna is not even a placental mammal – it lays eggs),
and Linnaen taxonomy completely fails at corner cases such
as the Duck-billed Platypus (this and the Echidna are the only
Monotremes). We see similar problems in the existing work on
network classifications, and here there is, as yet, no underlying
tree-structure to justify a tree-like classification system.
We must remember that Linnaeus’ system was not originally
proposed as a true categorization of natural groups, but was
to provide clear identification. The initial point of such clas-
sification is to simply list features of organisms, and it is this
approach we adopt. We create a set of binary classifications
describing whether a network has a particular feature, but these
features are not disjoint. The advantage is that we can easily
handle corner cases that would be problematic for an exclusive
classification scheme, and do so with far fewer classes.
Classification tags have the additional advantage that it
is easy to add new types without changing the existing
classifications, something that would be impractical with a
disjoint class model. This has proved useful as we learnt of
new classifications that could be usefully added to the data.
The classifications we have added so far are described below.
aving collected a number of inhabitants for our Zoo,
we are left with the question of how to classify each