Empirical Models of TCP and UDP End-User Network Traffic from NETI@home Data Analysis
ABSTRACT The simulation of computer networks requires accurate models of user behavior. To this end, we present empirical models of end-user network traffic derived from the analysis of NETI@home data. There are two forms of models presented. The first models traffic for a specific TCP or UDP port. The second models all TCP or UDP traffic for an end-user. These models are meant to be network- independent and contain aspects such as bytes sent, bytes received, and user think time. The empirical models derived in this study can then be used to enable more realistic simulations of computer networks.
-
Citations (0)
-
Cited In (0)
Page 1
Empirical Models of TCP and UDP End–User Network Traffic from
NETI@home Data Analysis
Charles R. Simpson, Jr., Dheeraj Reddy, George F. Riley
School of Electrical and Computer Engineering
Georgia Institute of Technology
Atlanta, Georgia 30332–0250
{rsimpson,dheeraj,riley}@ece.gatech.edu
Abstract
The simulation of computer networks requires accurate
models of user behavior. To this end, we present empirical
models of end–user network traffic derived from the anal-
ysis of NETI@home data. There are two forms of mod-
els presented. The first models traffic for a specific TCP
or UDP port. The second models all TCP or UDP traffic
for an end–user. These models are meant to be network–
independent and contain aspects such as bytes sent, bytes
received, and user think time. The empirical models de-
rived in this study can then be used to enable more realistic
simulations of computer networks.
1. Introduction
The simulation of computernetworks has becomea pop-
ular method to evaluate characteristics of these networks
across a wide range of topics, including protocol analysis,
routing stability, and topological dependencies, to name a
few. However, for these simulations to yield meaningful
results, they must incorporate accurate models of their sim-
ulated components.
One such componentis end–user traffic generation. This
component should be network–independent so that it can
be used in a wide variety of simulation configurations with-
outdependencyonthe simulatedenvironment. Thesetraffic
modelsshouldbe updatedfrequently,using recentmeasure-
ments, to accurately reflect the changing nature and uses
of the Internet. Further, such measurements should rep-
resent the heterogeneous connection methods and diverse
locations of Internet users. To this aim, we have devel-
oped network–independenttraffic models for network users
based on data gathered by the NETI@home infrastructure.
The remainderof this paperis organizedas follows. Sec-
tion 2 presents work related to this study. Next, Section 3
describes the dataset used for this study and the methodol-
ogy used to create our models. Section 4 discusses the ex-
perimental results of our study and Section 5 describes the
simulationused to demonstrateandvalidateourmodels. Fi-
nally, Section 6 discusses several areas of future work and
we conclude in Section 7.
2. Background and related work
Portionsofthis workarebasedonworkpresentedin[13]
and [17] and we have chosen to adoptmuch of their nomen-
clature. However, we have attempted to expand upon their
work in several ways. First, the work in [13] is based on
packet traces collected from a campus network. In an at-
tempt to represent more typical end–users, we use data col-
lected by the NETI@home project. Also, the studies con-
ducted in [13, 17] were specific to TCP connections on port
80. In this study, we model any given TCP or UDP port, as
well as all TCP or UDP traffic aggregated.
NETI@home[16] (Network Intelligence at home) is an
open–source software package named after the popular
SETI@home[1] software. The NETI@home client is avail-
able on the NETI@home website[15] and is designed to be
run by any client machine connected to the Internet. When
run on a client machine, the NETI@home software reports
end–to–end flow summary statistics to a server at the Geor-
gia Institute of Technology. The statistics collected and the
functionality of the software are discussed in [16]. Since
NETI@home is designed to run on end–user systems, it
providesauniqueperspectiveintothebehaviorofbothend–
users and their systems.
Previously, NETI@home data analysis has focused on
aspects relating to security[9]. In this paper, we utilize
the measurements made by NETI@home to generate traf-
fic models based on end–user behavior. NETI@home users
represent a heterogeneous mixture of network users from
various networks and geographical locations.
Page 2
The need for accurate simulation models was discussed
in [8].Several other studies have discussed modeling
of either application–specific [3, 4, 5, 6, 17] or general
[2, 10, 11, 18] end–user network traffic. Also, several stud-
ies have used network traffic models in simulation environ-
ments including [7, 12, 19, 20].
3. Methodology
The models developed for this work are intended to be
network–independent. To this aim, we define several char-
acteristics of TCP and UDP flows that reflect this design
choice and attempt to wholly represent network client be-
havior.
There are two categories of models created in this study.
The first is specific to a TCP or UDP port, that is we cre-
ate a model of client behaviorfor a given TCP or UDP port.
Throughoutmostofthis paper,we use themodelcreatedfor
TCP port 80, the most common port used by World Wide
Web servers, as an example. The second category of model
created is an aggregate of all port–specific models. This
model can be likened to a TCP or UDP client model. Such
a model may prove useful for studies that are more generic
and are not attempting to study a particular type of network
traffic. All of these models incorporate empirical distribu-
tions directly interpreted from the NETI@home dataset.
The dataset used in this study consists of NETI@home
data collected over a one year period from October 1, 2004
to September 30, 2005. This dataset includes over 36 mil-
lion TCP flows and 93 million UDP flows, which form the
basisofthis work,as wellasvariousotherflowtypesandin-
formationabout their correspondinghosts. Although an ex-
actcalculationis notpossibleduetoprivacysettingsanddy-
namically assigned IP addresses, we estimate that this data
was collected by approximately 1700 users. These users
represent a heterogeneous sampling of Internet users run-
ning some 8 different operating systems and reporting from
approximately 28 nations and 43 US ZIP Codes.
The first two aspects we model are empirical distribu-
tions of bytes sent and bytes received. These values are
based only on the payload of the packets and thus do not
represent the sizes of the TCP or UDP headers and their un-
derlying headers or TCP’s flow control and congestion con-
trol algorithms, merely transferred application information.
This allows our models to be used in simulations where
variations of TCP or UDP are employed.
The next aspect modeled is user think time. User think
time is the term we use for the amountof time a client waits
beforeinitiating anotherflow. Forthis aspect, we developed
two empirical distributions. One distribution describes the
userthinktimewhenconsecutivelyaccessingaspecificdes-
tination and the other describes the user think time when
contacting a new destination.
Another aspect modeled is consecutive contacts. Con-
secutive contacts is the term we use for the probability that
a client will chooseto initiate anotherflow with the last des-
tination contacted, or the client will choose to initiate a flow
with a new destination. For this aspect, we developeda sin-
gle empirical distribution.
Finally, the last aspect modeled is contact selection.
Contactselection is the termwe use forthefrequencydistri-
bution of contacting specific destinations. This distribution
can be thought of as modeling the popularity of a destina-
tion. For this aspect, we developeda single empirical distri-
bution.
One other aspect that we believe to be worth modeling
is related to idle time. For applications such as World Wide
Web transfers, this aspect has little meaning, as web pages
are simply requested and served. However, for interactive
applicationssuchas SSH or telnet, thereare periodsoftime,
during the flow, when there is no data transferred. How-
ever, using the NETI@home data, it is difficult to differen-
tiate between network–dependent flow time and network–
independent flow time. We are aware of work [10, 11] that
attempts to capture this behavior and are considering im-
plementing a similar technique into the NETI@home client
software so that future models can incorporate this aspect
of user behavior.
4. Experimental results
From the analysis of the NETI@home dataset described
previously, we were able to generate a set of empirical dis-
tributions for each component of our models. To download
thecompleteset ofdistributionsandforanyupdatestothese
distributions please visit http://neti.gatech.edu/
research/user.html.
4.1. Bytes sent
The amount of bytes sent varies dependent on the port
modeled. However, upon investigation of each modeled
port, our findings seem intuitive.
Figure 1 depicts the cumulative distribution function of
bytes sent forTCP port80. Comparedwith previousstudies
[13], these results contain many more flows with zero bytes
sent. However, upon investigation it does not appear that
these results are due to a single NETI@home user or are
anomalous. This difference in results is most likely due to
the fact that[13] was basedondata collectedfroma campus
network, whereas NETI@home data contains users with
less reliable networkconnections. The zero bytessent flows
typicallyrepresentflows in whichthe connectionfailed dur-
ing the TCP three–way handshake. Although these flows
do not generate much network traffic (usually no more than
Page 3
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Proportion
Bytes
Figure 1. CDF of bytes sent for TCP port 80
three packets), they are significant in terms of numbers of
flows and most likely influence a user’s behavior.
As can be seen in the figure, approximately40 percentof
flows to TCP port80 sendlittle ornodata. Thereare several
possible causes for the large number of flows sending little
or no data. First, many of these flows are failed connection
attempts. Many NETI@home users are utilizing less reli-
able network connectionssuch as dial–up or wireless. Also,
someoftheseflowsmaybetoblockedsites. Manybrowsers
andthird–partysoftwareblockadvertisementsandsome or-
ganizations restrict the viewing of certain websites. Finally,
a handful of NETI@home users periodically scan hosts on
the Internet[9]. Considering that these users know that their
network connections are monitored, it is unlikely that this
scanning is intentional and may be the result of a virus or
worm. While these results could be considered anomalous,
we believe that this does indeed represent typical end–user
behavioras seenontheInternet. Almostall remainingflows
send no more than 10 KB of data to the server.
4.2. Bytes received
The amount of bytes received by the client is also de-
pendent on the port modeled. Figure 2 depicts the cumu-
lative distribution function of bytes received for TCP port
80. Compared with [13], we also find that there are many
more flows with zero bytes received. As with our findings
for bytes sent, this is most likely due to failed connection
attempts.
The distribution for bytes received has a much longer
tail than that for the bytes sent. Approximately 40 percent
of flows with a remote TCP port of 80 receive little or no
data. However, more than 10 percent of these flows receive
greater than 10KB of data.
0
0.2
0.4
0.6
0.8
1
0 10000 20000 30000 40000 50000
Proportion
Bytes
Figure 2. CDF of bytes received for TCP port
80
4.3. User think time
The cumulative distribution function for user think time
to the same destination is given in Figure 3 and to differ-
ing destinations is given in Figure 4 for TCP ports 23 and
80. These findings show a tendency towards shorter user
think times than was found in [13] for TCP Port 80. We can
think of several reasons for this shortened user think time.
First, the World Wide Web has become much more popu-
lar since the time of [13]’s publication. Also, it is likely
that NETI@home captures data from users who are active
more often than it does for inactive users as many users
would simply turn off their machines while not using them,
thus disabling NETI@home’s monitoring. This would arti-
ficially inflate our numbers to show users that appear to be
more active and is a source of bias.
We chose to model the user think time to the same desti-
nation separatelyfrom the user think time to a differentdes-
tination. Figures 3(a)and 4(a)appearto be similar however.
We believe that it is still appropriate to model these think
times separately as these distributions can differ greatly for
other TCP or UDP ports as is shown in Figures 3(b) and
4(b). These figures show the distributions for think times
for TCP Port 23, the port commonly used for telnet.
For connections to TCP port 80, the majority of user
think times tends to be less than 1 second. However, for
connections to TCP port 23 (telnet), the user think times
have a much heavier tail, with only approximately 40 per-
cent of flows having think times less than 100 seconds.
4.4. Consecutive contacts
In Figure 5, we present the cumulative distribution func-
tion for consecutive contacts for TCP port 80. These re-
sults also show a tendency towards a lower number of con-
secutive contacts than was found in [13]. However, this is
Page 4
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50
Time
60 70 80 90 100
Proportion
(a) TCP port 80
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50
Time
60 70 80 90 100
Proportion
(b) TCP port 23
Figure 3. CDF of user think time to same IPs
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50
Time
60 70 80 90 100
Proportion
(a) TCP port 80
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50
Time
60 70 80 90 100
Proportion
(b) TCP port 23
Figure 4. CDF of user think time to differing
IPs
Page 5
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40 45 50
Proportion
Number of Contacts
Figure 5. CDF of number of times an IP is con-
tacted consecutively for TCP port 80
intuitive considering the number of “failed” connection at-
tempts observed previously.
Approximately 80 percent of the flows to TCP port 80
are not consecutive, that is the destination is contacted only
once in a row. Further, over 99 percent of visits to a specific
destination on TCP port 80 lasted for 10 or less flows in
a row. Therefore, it appears that users tend to switch web
destinations fairly often as was noted in [13].
4.5. Contact selection
Unlike [13], which used a Zipf distribution, we were
able to construct a cumulative distribution function for con-
tact selection due to the wide sampling offered by the
NETI@home dataset. Figure 6 presents this CDF for TCP
port 80. One possible source of inaccuracyfor this aspect is
the fact that we are unableto determineif a specific destina-
tion uses multiple IP addresses, thus reducingthe frequency
of selection a given contact may appear to have.
As can be seen in the figure, for TCP port 80 servers the
distribution of the overall number of visits by NETI@home
users is quite varied and has a heavy tail. Many servers are
only visited a handful of times, however many other servers
tendtobecontactedquiteoften,withsomeserversreceiving
millions of visits over the year studied.
5. Simulation results
To judge the usefulness of our models, we have incor-
porated the above derived TCP traffic models into the GT-
NetS environment[14]. The GTNetS environment already
has some HTTP traffic models as described in [13]. We
incorporated the models derived from the analysis of the
NETI@home datasets into GTNetS. We consider this ap-
proach to be a better one for traffic generation in network
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80
Number of Visits
100 120 140 160 180 200
Proportion
Figure 6. CDF of relative frequency of server
visits for TCP port 80 over a one year period
simulations, because NETI@home datasets are more cur-
rent and continue to be so [16]. An analysis program gen-
erates these models automatically from the NETI@home
datasets. The traffic distribution models can then be easily
used by the application layer models which drive a network
simulation. Inoursimulationexperiments,we haveconcen-
trated onthe World Wide Web traffic andthe HTTP models.
Our implementation samples the empirical distributions to
determine the particular values used at a given time. This
seems a logical choice since any single distribution doesn’t
seem to fit the complete dataset verifiably. We model the
behavior of a web browser in GTNetS which sends a HTTP
request to a designated webserver asking it to send a cer-
tain length of data that constitutes the response. When the
simulation starts, the browser application chooses a server
randomly from a list of target servers. It then chooses a
response size that it wants to obtain from the webserver
from the CDF that describes the received bytes. The size
of the HTTP request packet is chosen from the sent bytes
CDF plot. It may request one or more objects within the
same TCP connection. Once the web browser application
has received the appropriate response, it proceeds to select
a different server or the same server for its next request and
waits for an amount of time. This amount of time, which is
obtained from the CDF that describes the user think time,
depends on whether the same server is chosen or a different
server is chosen.
The network topology for simulations is obtained from
[7]. It consists of a large set of web browsers connected via
a series of three routers to a webserver as shown in Figure
7. We have chosen this to be our baseline topology because
we have earlier simulationexperimentsconductedusing the
models and datasets proposed in [7].
The simulation experiment is run using two HTTP traf-
fic models. One of the traffic models is obtained from the
datasets suggested in [13] and [7]. The other traffic model