Conference PaperPDF Available

Effects of Shared Bandwidth on Anonymity of the I2P Network Users

Authors:

Figures

Content may be subject to copyright.
Effects of Shared Bandwidth on Anonymity of the I2P Network Users
Khalid Shahbar and A. Nur Zincir-Heywood
Faculty of Computer Science
Dalhousie University
Halifax, Canada
Email: {Shahbar, Zincir}@ cs.dal.ca
Abstract—In this paper, we explored what could be achieved
by a potential attacker on the I2P Network in terms of applica-
tion and user profiling. In both cases, the effect of bandwidth
sharing and participation have been analyzed. To explore this,
we used a machine learning based approach to analyze the
flows extracted from the traffic generated by the applications
and the users. Our results show that profiling the users and
applications on the I2P network is possible. The amount of
shared bandwidth has an effect on the accuracy of profiling
the users and the applications. Furthermore, applications that
do not use the shared clients tunnels increases the possibility
to profile the behavior of the flows.
Keywords-I2P network; Traffic Flow; Anonymity; Data An-
alytics
I. INTRODUCTION
The available anonymity systems on the Internet work on
the concept of separating the user’s identity and his/her final
destination to provide anonymity. This separation is achieved
by indirectly connecting the user to the final destination
through multiple stations. The number of stations that are
employed for this separation varies based on the anonymity
system used. For each station the user connects, another
layer of encryption is added to the user’s information.
Therefore, the information that could potentially link the
user to the final destination (e.g. the website that the user
browses) is not known by any of the intermediate stations
(nodes) that carry the user’s data. The stations on the path to
the final destination can only see the necessary part (network
header information) to carry the data to the next station.
Tor [1], JonDonym [2], and I2P [3] are examples of such
networks that use this mechanism to separate the user from
his/her final destination.
There are many differences between the anonymity net-
works in terms of their designs and the applications they
support. As an example, I2P network is different than Tor
in its structure as a private network. The websites on the
I2P network ”called Eepsites” are hosted within the network
itself and have the .i2p names based on the naming and
addressing [4] on the I2P network. Even though I2P supports
and enables the browsing to websites outside of I2P by using
outproxies, I2P is designed to work more anonymously when
accessing the resources within the I2P network. The services
(applications) that I2P supports are not limited to browsing,
but also include file sharing, Internet Relay Chat, E-mail etc.
I2P is a decentralized network, there is no central server
managing the network. The network database is stored in
”netDb”[5]. The netDb contains ”routerInfo” and ”leaseSet”.
The rounterInfo contains the required information to contact
a router. The leaseSet contains the required information
to reach to a destination. The user builds his/her knowl-
edge about the network by using the information from
the netDb. Sending and receiving data on the I2P network
and building the knowledge about the network is done by
building ”Inbound and Outbound Tunnels” [6]. The tunnels
are unidirectional, the inbound tunnels are employed by
the users to receive messages and the outbound tunnels are
employed to send messages. The default configuration of the
users’ agents (clients) enables the bandwidth participation,
that means in addition to the user building his/her tunnels,
the user can also participate on building other users’ tunnels.
The tunnels consist of two or more routers based on the
client configuration and the tunnel type. Therefore, when
the user participates in building tunnels, his/her role could
be the first or the last or one in the middle in forming
the tunnel. At the same time, the user could continue to
send/receive his/her messages (if any). This aims to enhance
the anonymity because in this way, it is harder to separate a
specific user’s tunnels from the other participating tunnels.
The tunnels are used to send and receive messages, to
communicate with the netDb, and to manage the tunnels.
Therefore, the messages that travel through the user’s tun-
nels, do not always represent only the messages traveling
between the users. So, if the tunnels contain this type
of control, the user messages are mixed together, and the
incoming / outgoing tunnels are separated, then we aim to
study the following research questions: What is the effect of
such a design in terms of anonymizing the netflow behavior
of a user’s activities? Can a user’s activities be completely
anonymized by this design? Or do they rely on the amount
of other users’ traffic that shares the bandwidth?
The rest of the paper is organized as follows: The related
work is summarized in Section II. Data collection are in-
troduced in Section III. Section IV presents the experiments
and results, while Section V presents our observations on the
I2P network. Finally, conclusions are drawn and the future
work is discussed in Section VI.
II. RE LATE D WORK
Timpanaro et al.[7] proposed a monitoring architecture
for the I2P network to describe how it is used. The pro-
posed system analyzed what type of applications are used
on the I2P network. The applications that the monitoring
architecture can identify are limited to web browsing and
I2PSnark. The results showed that the proposed monitoring
architecture could identify 32% of all running applications.
The experiments performed depended on using a router
on the I2P network to work as a FloodFill router. After
collecting numbers of leaseSet of the networks, the leaseSet
was tested to determine if it belongs to a web server or
I2PSnark. Their results showed that the classification of the
leaseSet does not link the type of the application with the
user.
Egger et al.[8] presented several attacks that could be
implemented against I2P network. The authors claimed that
their attacks against the I2P network could reveal the ser-
vices that the I2P user accesses, the time of access, and the
time spent using the service. The attacks first control most
of the nodes that host the decentralized database (netDB) on
the I2P network. Then, they monitor the network activities to
link the related ones. Denial of Service (DoS) attacks could
be used to disable the nodes hosting the netDB and speed
the takeover process.
Liu et al.[9] presented four methods to discover the I2P
routers. They discovered around 95% of all the I2P routers
in their two weeks long experiment. One of their methods
to discover the I2P router was to run an I2P router and
monitor the communications with other I2P routers to collect
information about them. Another method was to run an I2P
FloodFill (the method used to distribute the netDb) router
to monitor and collect information about routers that make
communication with their FloodFill router. The third method
to discover the I2P router was the ”crawling reseed URL”.
This method used the reseed option (Initial set of I2P nodes
needed for Bootstrap) in the I2P network to collect the
I2P routers information. The fourth method was ”exploiting
NetDB”, where the I2P mechanism of a router query and a
response were used to collect routers’ information.
Herrmann and Grothoff[10] presented an attack that de-
termines the identity of the HTTP hosting peers (routers) on
the I2P network. The attack required using three types of
routers. The first type is used to provide information about
the tunnel operations to the attacker. The second type is
used to direct the user to select the attacker’s routers by
performing a DOS attack. The third type is used to perform
requests to the Eepsite. The combination of using the three
types of routers was then used to identify the hosing router
on the I2P network.
On the other hand, AlSabah et al [11] employed machine
learning algorithms to study the type of application Tor
user runs in the Tor network. The applications studied were
web browsing, video streaming, and BitTorrent. They used
the circuit and cell level information to extract features
that could be used to classify the type of application the
user running. The result showed 91% accuracy for offline
classification and 97.8% accuracy for online classification.
Shahbar el al [12] built and evaluated two approaches to
classify the type of application used by the user on the Tor
network: Flow level and Circuit level [11] classification. The
circuit level classification employed different set of features
related to the circuits that the user creates when using the
Tor network. The flow level classification employed the
traffic flows between the user and the first node on the
Tor network. The results showed up to 100% accuracy in
both approaches, demonstrating the strength of flow analysis
under such circumstances.
In this research, we investigate the effects of the band-
width sharing on the I2P network and its potential to be used
by an attacker to identify both the user and the application
on the I2P network.
III. DATA COLLECTION AND SET UP
We used three machines (computers) to collect data on
the I2P network [13]. The version of the I2P software used
on these machines was (0.9.16). The applications we aim to
study in this work (on the I2P network) are browsing, chat,
and file downloading. The reason behind choosing these
applications is that they are the most used applications. On
each machine, we only run one application at a time while
collecting the data. This is to ensure the ground truth of
the data. All the traffic of the applications and the traffic of
the users are our traffic and do not include any other users’
traffic. For the part where we participate on other users’
tunnels, the users’ privacy is preserved. The encryption used
on the I2P keeps the users’ data private. In addition, before
analyzing the traffic, all the IP addresses and payloads are
removed.
A. Browsing
To collect the browsing data, we prepared a list with the
available Eepsites on the I2P by default. This list includes the
built-in (bookmarked) Eepsites on the I2P software such as
(i2p-projekt.i2p). Moreover, we added some other Eepsites
to the list by using Eepsites that provide a ”search” service
on the I2P network. Once the list was ready, we used iMacro
[14] to automate the browsing. To this end, we wrote a
script that browses the first address on the list. Then it waits
for a random period of time before it navigates through the
Eepsite by clicking randomly on a link on the Eepsite. After
moving (traversing) from one link to another multiple times
by using this approach, the script picks the second link in
the list and so on. We collected data using this set up for
seven days.
B. Instant Relay Chat (IRC)
For IRC, each machine in this research was also set up to
work independently from the others. Again, only one type of
application was working while collecting the data. During
this process, we chose jIRCii [15] plugin and installed it
on the three machines. Then the machine connected to
the Irc2P network (this is the Instant Relay Chat for I2P)
by using the Irc2P Tunnel and used one of these servers:
irc.dg.i2p, irc.postman.i2p, or irc.echelon.i2p. The machine
stayed connected 24/7 on the Irc2P network and joined
multiple channels such as #i2p, #i2pchat, #i2people and so
on, during this process for five days.
C. Downloading Files Using Torrent (I2PSnark)
To download files on the I2P network, we used I2PSnark
[16] on all machines. It is one of the built-in applications
within the I2P network. The downloaded files included files
of videos, documents, music, movies. etc. The size of the
files varies from small to big. We got the torrent files from
the Eepsite diftracker.i2p and tracker.postman.i2p. The data
of the torrent include both the uplink and the downlink of
the files. We collected data using this service for seven days.
IV. EXPERIMENTS AND RES ULT S
There are many machine learning algorithms used for
the purpose of classification. In our previous work [12],
we employed different supervised learning algorithms and
approaches to identify applications used on the Tor net-
work. The evaluated algorithms were C4.5, Random Forest,
Naive Bayes, and Bayes Net. Among these algorithms,
C4.5 Decision Tree was the best performing algorithm to
classify Tor traffic flows. Therefore, in this research, we used
Tranalyzer [17] to export the flows and the C4.5 Decision
Tree classifier (by the open source data mining tool, Weka
[18]) to construct our traffic analysis system.
Tranalyzer has 91 features; the features include flow
direction, duration, frequencies related to the packets in a
flow such as the number of packets sent and the number
of packets received, IP header information such as TOS
and TTL, TCP header information such as window size and
sequence number, packet length statistics such as the mean
and the minimum packet length, inter arrival time statistics
such as the median and the quartile. Tranalyzer also includes
features related to ICMP, VLAN, MAC addresses which we
removed from the data because they are not relevant for our
experiments. The complete list of Tranalyzer features can be
found in [17]. It should be noted here that we did not use
the IP addresses and the port numbers in the analysis of the
collected data not to bias the learning algorithm. Given that
the data set is not big and only three machines are used in
the collection of the data, the C4.5 learning algorithm may
easily link the applications to port numbers or IP addresses,
if they are used as features in the analysis.
Table I: Binary Classifier on the Tunnels
TP
Rate
FP
Rate
TN
Rate
FN
Rate
Applications
Tunnels 0.875 0.288 0.712 0.125
Others 0.712 0.125 0.875 0.288
Accuracy 82.04%
A. Tunnel Based Data Analysis
In this case, we focused on differentiating Application
tunnels from Exploratory and Participating Tunnels. Ex-
ploratory Tunnels are used for the management (administra-
tion/control traffic of the I2P network) and also for testing
purposes. The Participating Tunnels are the tunnels that the
users employ to relay other users’ traffic. To train a decision
tree model in order to differentiate the application tunnels
from Exploratory and Participating Tunnels, we labeled the
I2Psnark, Irc2p, and the shared clients tunnels as Applica-
tions tunnels class. We also labeled the Exploratory tunnels
and the Participating tunnels (when the bandwidth is set to
the default value of 80% participation) as one class, called
”Others”. The reason behind this is to investigate the ability
to distinguish the application traffic from the management or
other users’ traffic. This way we have a binary classification
problem, one class represents the ”applications” and the
other class represents ”others” shared traffic. Analysis shows
that we can differentiate these two classes of traffic in I2P
tunnels up to 82% accuracy. Table I shows the performance
of our classifier on the test data, which was unseen by the
classifier during the training, for this analysis.
The results are calculated using the following performance
measurements: The metric ”Accuracy” is defined as the sum-
mation of True Positive (TP) and True Negative (TN) values
divided by the total number of instances (N). For example,
when measuring the accuracy of the classification for the
”applications tunnels” traffic, TP is the total number of
correctly classified instances as ”applications tunnels”. TN is
the total number of correctly classified instances as ”Others”.
If an ”Others” instance is classified as ”applications tunnels”
instance, then this is considered as a False Positive (FP).
The opposite is when the classifier classifies an instance as
an ”Others” instance while it is an ”applications tunnels”
instance. Then, this is a False Negative (FN). The TPR, FPR,
TNR, and FNR are calculated using the following equations:
TPR =T rueP ositive(T P )
T rueP ositive(T P ) + F al seNeg ative(F N )
FPR =F alseP ositive(F P )
F alseP ositive(F P ) + T r ueNeg ative(T N )
T N R =T rueN egative(T N )
F alseP ositive(F P ) + T r ueNeg ative(T N )
F N R =F alseN egative(F N )
F alseN egative(F N ) + T rueP ositiv e(T P )
Table II: Classification Results for the Tunnel Based Traffic Analysis
TP
Rate
FP
Rate
TN
Rate
FN
Rate
I2Psnark 0.661 0.033 0.967 0.339
jIRCii 0.778 0.084 0.916 0.222
Eepsites 0.531 0.143 0.857 0.469
Exploratory &
Participating Tunnels 0.755 0.152 0.848 0.245
Accuracy 70.30%
We also aimed to analyze for what purpose a tunnel might
be used. In this case, if we were running an application, for
example I2Psnark, then we extracted the tunnels related to
the I2Psnark and labeled them as I2Psnark. We did the same
for jIRCii and Eepsites. The Eepsites tunnels, which are the
client tunnels, might be used for another application on the
I2P network. They also stay alive all the time that the user
is online. On the other hand, the I2Psnark (Irc2P) tunnels
stay alive as long as the user uses the application. The shared
client tunnels could be used for I2Psnark, if the user changes
the setting, but the default setting is to use the Irc2P tunnels.
The Exploratory and Participating tunnels stay alive as they
are. Aiming to shed light into for what purpose a tunnel
might be used, is a very challenging problem. However, we
could still achieve 70% accuracy (on the unseen test data)
in predicting the potential purpose of a tunnel on the I2P
network by just analyzing the flow features. Table II presents
the results for this analysis.
B. Applications and User Based Data Analysis
In our experiments, we examined the effect of the band-
width participation on the I2P network based on two sce-
narios: the first one is the ability to identify the application
type the user is running (Traffic Profiling); and the second
one is the ability to profile the users under the effect of the
amount of shared bandwidth (User Profiling).
For Traffic Profiling, we labeled our data as Eepsites,
I2PSnark, and jIRCii. This way, the traffic of one application
includes the behavior of the traffic of multiple users using the
same application. The important difference in this scenario is
that when we run an application, for example I2PSnark, we
intentionally label all the tunnels (exploratory, shared client,
and participant if any) as I2PSnark. This enables us to test if
the overhead of the exploratory tunnels and the participant
tunnels would affect the ability to distinguish the application
type.
For User Profiling, we labeled our data as Machine 1,
Machine 2, and Machine 3, since each machine was used
by only one user. In this case, the Machine 1 traffic includes
the I2PSnark, jIRCii, and Eepsites generated from Machine
1. The same applies on Machine 2 and Machine 3. The
purpose of combining different traffic from each machine
into one class is to mimic the user behavior on using multiple
applications. Subsequently, measuring the ability to analyze
the I2P users’ behaviors.
Table III: Summary of Traffic and User Profiling Performance
80 % Bandwidth Participation
No. of flows Accuracy (%)
Traffic Profiling 190,000 47.4
Traffic Profiling TCP Only 61,453 61.7
Traffic Profiling UDP Only 128,547 56.3
User Profiling 189,906 81.8
User Profiling TCP Only 62,882 86
User Profiling UDP Only 127,024 79.8
On the I2P network, the traffic could be in the form of
TCP or UDP traffic. Therefore, we also include the sepa-
ration of the traffic based on the protocol in both scenarios
and on both bandwidth cases. The following summarizes
the results of both scenarios in addition to the effect of the
protocol separation on the test data.
1) With Bandwidth Participation: Table III shows the
accuracy per class for the Traffic and User profiling when the
amount of shared bandwidth is 80%. This is the default case
on the I2P network. The accuracy measures the percentage
of correctly classified instances out of all instances. It
should be noted here that even though we do not use any
IP addresses and port numbers in our analysis, we can
achieve 80% - 86% accuracy for differentiating one user
from another. However, it seems like differentiating traffic
behavior in terms of protocols is much more challenging.
We hypothesize that this may be due to two main reasons:
(i) Many different application behaviors are bundled up
together in each of TCP and UDP traffic tunnels; and (ii) I2P
garlic routing approach is better in anonymizing the protocol
behaviors. In this case, further analysis is necessary to study
the effect of each component.
2) Without Bandwidth Participation: The configuration
we used in our experiments in section III was by activating
the default bandwidth configuration (300 KBps In, 60 KBps
Out) of an I2P client. Under this setting, the bandwidth
participation is 80% which equals to 48KBps. To observe
and study the effect of this amount of participation on the
anonymity, we set this bandwidth participation parameter
on the I2P client to 0%. In both cases, the FloodFill was
disabled. Table IV presents the results of our analysis for the
traffic and user profiling when the bandwidth participation
is set to 0% and effectively not allowing any bandwidth
sharing. In this case, while the user profiling drops by
15%, traffic profiling increases by 20%. Intuitively, this
was expected because under no traffic sharing scenario,
finding patterns in the tunnels is more likely to happen.
However, under the same conditions differentiating users /
machines without using IP addresses and port numbers is
more challenging.
C. Clustering Tunnels Using SOM
Based on our analysis in sections IV (A and B), the
classification of tunnels seems to be more challenging than
the classification of users. Also the confusion matrices of our
Table IV: Summary of Traffic and Users Profiling Performance without
Bandwidth Sharing
0 % Bandwidth Participation
No. of flows Accuracy (%)
Traffic Profiling 195,081 73.7
Traffic Profiling TCP Only 40,075 65.6
Traffic Profiling UDP Only 155,006 75.7
User Profiling 195,081 66.7
User Profiling TCP Only 40,075 81.7
User Profiling UDP Only 155,006 63.2
classifiers show that there is an overlap between the classes
of the tunnels. Therefore, we employed an artificial neu-
ral network based unsupervised learning algorithm, namely
Self-Organization Map (SOM) [19], to cluster and visualize
the different patterns (if any) that may exist in the data
of the tunnels captured in this research. For this purpose,
we used the Matlab SOM toolbox [20]. Fig. 1 presents
the visualization of SOM Clusters (groupings) on our data
consisting of four classes: I2PSnark, jIRCii, Eepsites, and
Exploratory & Participating Tunnels. In this figure, one can
see the four clusters in four different colors. SOM is an
unsupervised learning technique, therefore no labeled data is
used during the training phase. However, we used the labels
post training to analyze the performance of this clustering
algorithm on our data sets. Fig. 2 shows the hits of the four
classes, post training, on the SOM Map introduced in Fig.
1. This means, we projected the instances of the labeled
data on Fig. 1 to obtain Fig. 2. The ideal case is when
each class is represented by a separate cluster on the map.
This implies that the map has a good representation of the
data. In Fig. 2, we have one cluster, the yellow hexagons,
representing I2PSnark tunnels. We have another cluster, ma-
genta hexagons, representing the Exploratory & Participating
Tunnels. The third cluster shown in red represents the hits
of the jIRCii tunnels on the Map. The green ones represent
the Eepsites hits on the map. Based on how these clusters
are distributed on the SOM, the Eepsites data flows seem
to overlap with the Exploratory & Participating Tunnels
data flows, namely, magenta ones. Thus, based on the SOM
output, the Eepsite and the Exploratory & Participating
tunnels (green and magenta) seem to be grouped together.
Actually, this matches with how the I2P tunnels are used.
The I2PSnark and the jIRCii both use separate tunnels
(I2PSnark & Irc2P Tunnels). The client Tunnels are the
tunnels that are used for the Eepsites. Therefore, in Fig. 3 we
grouped the Eepsites with the Exploratory & Participating
tunnels to form one class.
V. DISCUSSION
When we collected the data, we used the information
of the client tunnels to label the data for a better level
of accuracy. For example, if we are running jIRCii and
we are connecting to a participant in one of our inbound
or outbound tunnels and we label that participant for IRC
traffic, that does not mean that participant will not be part
Figure 1: Tunnels on the SOM Map ”sheet” shape.
Figure 2: Hits on the SOM Map for all classes.
Figure 3: Hits for the merged Eepsites and Exploratory &
Participating tunnels- ”cyl” shape.
of any other tunnel, e.g. one of our client tunnels for shared
clients. Indeed, this adds a challenge to the data analysis
problem we undertake in this research. In our analysis, we
do not use the IP addresses, port numbers and the protocol
features. Therefore, when we combine both transport layer
protocols (TCP and UDP) in the data set, the accuracy of
our analysis drops. This is expected, because in network
traffic analysis, transport layer protocol filters are shown
to be useful. So this means that when in real life, the
protocol feature is used in the analysis, the accuracy will
increase. However, our data collection network consists of
only three machines / users. In short, any classifier using the
protocol, port number and IP address features could reach to
100% accuracy but would be very specific to our network.
Therefore, it will not generalize well to larger networks
where more machines and protocols exist.
In our experiments, the resource sharing (bandwidth par-
ticipation) increases the anonymity level when profiling
the applications. The shared client tunnels are used for
Eepsites application. They could be also configured for
other applications. The default is to use client tunnels for
Eepsites, while separate tunnels are used for I2PSnark and
Irc2P. Furthermore, when application tunnels are grouped
as one class and the Exploratory tunnels in another class,
this increases the accuracy of profiling the applications.
Thus, we think that forcing all the applications to use the
client tunnels will improve the users’ anonymity on the
I2P network. On the other hand, based on our experiments,
increasing the bandwidth participation improves the ability
to profile the users. When the user allocates more resources
to participate on the network that means more traffic flows
on the network belong to the user. This seems to enhance the
profiling of the users. Therefore, we think that decreasing
the bandwidth participation (but still keeping it more than
50%) will improve the users’ anonymity on the I2P network.
However, more analysis on bigger data sets is necessary to
get better understanding of such user behaviors.
Moreover, the unsupervised learning algorithm SOM
shows that the Eepsites tunnels tend to have similar behav-
iors with the exploratory and participating tunnels. When the
Eepsites tunnels are merged with the exploratory and partic-
ipating tunnels to find the hits on the SOM, more consistent
behavior is observed. This reinforces that the separation
between the tunnels for different applications seems to
enhance the application profiling. This implies that changing
the default setting on the I2P client to force applications
such as IRC to use the ”shared clients” tunnels hardens the
application profiling and improves the anonymity level.
VI. CONCLUSION AND FUTURE WO RK
The I2P network works differently from other anonymity
networks such as Tor [21] and JonDoNym [2] in terms of its
design which is based on the private network approach. The
connection of the users to the I2P network is not hidden, but
the users’ activities within the network (type of applications)
are supposed to be anonymous. Based on our analysis on the
I2P data, the resource sharing (bandwidth participation) of
the users on the I2P network improves the anonymity level
of the users. On the other hand, using the default setting for
not using the shared client tunnels for all applications seems
to reduce the anonymity level and enables the application
profiling ability of a potential attacker. For future work, we
will expand our research to study effects of the bandwidth
on the I2P network on a larger scale. This will include more
types of applications (or plug-ins) and more users.
ACKNOWLEDGMENT
The authors thank the anonymous reviewers and the I2P
team for their feedback. This research is partially supported
by the Natural Science and Engineering Research Council
of Canada (NSERC) grant, and is conducted as part of the
Dalhousie NIMS Lab at http://projects.cs.dal.ca/projectx/.
The first author would like to thank the Ministry of Higher
Education in Saudi Arabia for his scholarship.
REFERENCES
[1] R. Dingledine, N. Mathewson, and P. Syverson, “Tor: The
second-generation onion router,” in Proceedings of the 13th
Conference on USENIX Security Symposium - Volume 13, ser.
SSYM’04. Berkeley, CA, USA: USENIX Association, 2004,
pp. 21–21.
[2] Project: An.on anonymity. [Online]. Available: http:
//anon.inf.tu-dresden.de/index
[3] The invisible internet project (i2p). [Online]. Available:
https://geti2p.net/en/
[4] I2p: Naming and addressbook. [Online]. Available: https:
//geti2p.net/en/docs/naming
[5] I2p: The network database. [Online]. Available: https:
//geti2p.net/en/docs/how/network-database
[6] Tunnel implementation. [Online]. Available: https://geti2p.
net/en/docs/naming
[7] J. P. Timpanaro, I. Chrisment, and O. Festor, “Monitoring the
I2P network,” Preprint, October 2011. [Online]. Available:
http://hal.inria.fr/inria-00632259
[8] C. Egger, J. Schlumberger, C. Kruegel, and G. Vigna, “Prac-
tical attacks against the i2p network,” in Proceedings of
the 16th International Symposium on Research in Attacks,
Intrusions and Defenses (RAID 2013), October 2013.
[9] P. L. et al, “Empirical measurement and analysis of i2p
routers,” Journal of Networks, vol. 9, pp. 2269–2278, Septem-
ber 2014.
[10] M. Herrmann and C. Grothoff, “Privacy implications of
performance-based peer selection by onion routers: A real-
world case study using i2p,” in Proceedings of the 11th
Privacy Enhancing Technologies Symposium (PETS 2011),
July 2011.
[11] M. AlSabah, K. Bauer, and I. Goldberg, “Enhancing tor’s
performance using real-time traffic classification,” in Pro-
ceedings of the 2012 ACM Conference on Computer and
Communications Security, ser. CCS ’12. New York, NY,
USA: ACM, 2012, pp. 73–84.
[12] K. Shahbar and A. N. Zincir-Heywood, “Benchmarking two
techniques for tor classification: Flow level and circuit level
classification,” in 2014 IEEE Symposium on Computational
Intelligence in Cyber Security (CICS), Dec 2014, pp. 1–8.
[13] K. Shahbar and A. N. Zincir-Heywood, “Anon17: Network
traffic dataset of anonymity services,” Dalhousie University
of Halifax, Tech. Rep. CS-2017-03, Feb. 2017.
[14] imacros. [Online]. Available: http://imacros.net/overview
[15] jircii: The ultimate irc client. [Online]. Available: http:
//www.oldschoolirc.com/
[16] I2psnark. [Online]. Available: https://geti2p.net/en/docs/how/
tech-intro#app.i2psnark
[17] Tranalyzer2. [Online]. Available: http://tranalyzer.com/
[18] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
and I. H. Witten, “The weka data mining software: An
update,” SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10–18,
Nov. 2009.
[19] T. Kohonen, Self-organizing maps. Berlin, Germany:
Springer Berlin Heidelberg, 2001.
[20] Som toolbox. [Online]. Available: http://www.cis.hut.fi/
somtoolbox/
[21] K. Shahbar and A. N. Zincir-Heywood, “Traffic flow analysis
of tor pluggable transports,” in 2015 11th International Con-
ference on Network and Service Management (CNSM), Nov
2015, pp. 178–181.
... Additionally, several works have been proposed based on the use of Machine Learning (ML) techniques and algorithms with the aim of analyzing and classifying darknet's network traffic in some sense. For instance, a ML-based approach to analyze traffic flows generated by I2P applications and users is introduced in [36]. The work concludes that it is possible to create both user and application profiles, and that the accuracy in creating such profiles depends on the amount of shared bandwidth. ...
... As in the previous works, the Anon17 dataset is used to validate the solution. A relevant improvement, in terms of classification accuracy, is achieved in comparison to other works, e.g., [36]. Moreover, they also tested the suitability of the approach for classifying different but similar network traffic coming from Virtual Private Networks (VPN) or intentionally encrypted. ...
Preprint
Full-text available
Web is a primary and essential service to share information among users and organizations at present all over the world. Despite the current significance of such a kind of traffic on the Internet, the so-called Surface Web traffic has been estimated in just about 5% of the total. The rest of the volume of this type of traffic corresponds to the portion of Web known as Deep Web. These contents are not accessible by search engines because they are authentication protected contents or pages that are only reachable through the well known as darknets. To browse through darknets websites special authorization or specific software and configurations are needed. Despite TOR is the most used darknet nowadays, there are other alternatives such as I2P or Freenet, which offer different features for end users. In this work, we perform an analysis of the connectivity of websites in the I2P network (named eepsites) aimed to discover if different patterns and relationships from those used in legacy web are followed in I2P, and also to get insights about its dimension and structure. For that, a novel tool is specifically developed by the authors and deployed on a distributed scenario. Main results conclude the decentralized nature of the I2P network, where there is a structural part of interconnected eepsites while other several nodes are isolated probably due to their intermittent presence in the network.
... In recent years, researchers have carried out a lot of fruitful work in darknet encrypted traffic analysis. Shahbar and Zincir-Heywood [15] proposed to use the C4.5 decision tree based on the statistical characteristics of network flow to analyze the user behavior traffic on the I2P network. Rao et al. [16] proposed an unsupervised method based on gravitational clustering to identify Tor anonymous traffic from normal network traffic. ...
Article
Full-text available
Darknet traffic classification is significantly important to network management and security. To achieve fast and accurate classification performance, this paper proposes an online classification model based on multimodal self-attention chaotic mapping features. On the one hand, the payload content of the packet is input into the network integrating CNN and BiGRU to extract local space-time features. On the other hand, the flow level abstract features processed by the MLP are introduced. To make up for the lack of the indistinct feature learning, a feature amplification module that uses logistic chaotic mapping to amplify fuzzy features is introduced. In addition, a multi-head attention mechanism is used to excavate the hidden relationships between different features. Besides, to better support new traffic classes, a class incremental learning model is developed with the weighted loss function to achieve continuous learning with reduced network parameters. The experimental results on the public CICDarketSec2020 dataset show that the accuracy of the proposed model is improved in multiple categories; however, the time and memory consumption is reduced by about 50$ % $. Compared with the existing state-of-the-art traffic classification models, the proposed model has better classification performance.
... Based on the analysis on the I2P data, Khalid [4] concluded that the resource sharing (bandwidth participation) of the users on the I2P network improves the anonymity level of the users. On the other hand, using the default setting for not using the shared client tunnels for all applications seems to reduce the anonymity level and enables the application profiling ability of a potential attacker. ...
Article
I2P is an anonymous P2P distributed communication layer used to send messages to each other anonymously and safely. It is built on top of the internet and can be considered as an internet within the internet. Even though I2P is developed with an intention to create censorship resistant environment for the free flow of information, it is misused for illegal activities now a days. The possible misuses are less known among law enforcement agencies and existing industry approved software programs have no detection functionality for I2P. Because of the increased use of I2P in criminal purposes, there is a need for methods and tools to acquire and analyze digital evidence related to I2P. We conducted a detailed live memory dump analysis in order to find out the I2P related artifacts from a host machine. Furthermore, we propose a tool that will analyze the memory dump and system local files to find out the I2P related artifacts and provide a detailed report to the investigator.
... What is the population of I2P peers in the network? While Tor relies on a centralized architecture for tracking its public relays, which are indexed by a set of hard-coded authority servers, I2P is a distributed P2P network in which no single centralized authority can keep track of all active peers [36,83,157,320,342,391]. Tor developers can easily collect information about the network and even visualize it, as part of the Tor Metrics project [243]. ...
Thesis
With the Internet having become an indispensable means of communication in modern society, censorship and surveillance in cyberspace are getting more prevalent. Malicious actors around the world, ranging from nation states to private organizations, are increasingly making use of technologies to not only control the free flow of information, but also eavesdrop on Internet users' online activities. Internet censorship and online surveillance have led to severe human rights violations, including the freedom of expression, the right to information, and privacy. In this dissertation, we present two related lines of research that seek to tackle the twin problems of Internet censorship and online surveillance via an empirical lens. We show that empirical network measurement, when conducted at scale and in a longitudinal manner, is an essential approach to gain insights into (1) censors' blocking behaviors and (2) key characteristics of anti-censorship and privacy-enhancing technologies. These insights can then be used to not only aid in the development of effective censorship circumvention tools, but also help related stakeholders making informed decisions to maximize the privacy benefits of privacy-enhancing technologies. With a focus on measuring Internet censorship, we first conduct an empirical study of the I2P anonymity network, shedding light on important properties of the network and its censorship resistance. By measuring the state of I2P censorship around the globe, we then expose numerous censorship regimes (e.g., China, Iran, Oman, Qatar, and Kuwait) where I2P are blocked by various techniques. As a result of this work, I2P has adopted DNS over HTTPS, which is one of the domain name encryption protocols introduced recently, to prevent passive snooping and make the bootstrapping process more resistant to DNS-based network filtering and surveillance. Of the censors discovered above, we find that China is the most sophisticated one, having developed an advanced network filtering system, known as the Great Firewall (GFW). Continuing the same line of work, we have developed GFWatch, a large-scale, longitudinal measurement platform capable of testing hundreds of millions of domains daily, enabling continuous monitoring of the DNS filtering behavior of China's GFW. Data collected by GFWatch does not only cast new light on technical observations, but also timely inform the public about changes in the GFW’s blocking policy and assist other detection and circumvention efforts. We then focus on measuring and improving the privacy benefits provided by domain name encryption technologies, such as DNS over TLS (DoT), DNS over HTTPS (DoH), and Encrypted Client Hello (ECH). Although the security benefits of these technologies are clear, their positive impact on user privacy is weakened by—the still exposed—IP address information. We assess the privacy benefits of these new technologies by considering the relationship between hostnames and their hosting IP addresses. We show that encryption alone is not enough to protect web users' privacy. Especially when it comes to preventing nosy network observers from tracking users' browsing activities, the IP address information of remote servers being contacted is still visible, which can then be employed to infer the visited websites. Our findings help raise awareness about the remaining effort that must be undertaken by related stakeholders (i.e., website owners and hosting providers) to ensure a meaningful privacy benefit from the universal deployment of domain name encryption technologies. Nevertheless, the benefits provided by DoT/DoH against threats ``under the recursive resolver'' come with the cost of trusting the DoT/DoH operator with the entire web browsing history of users. As a step towards mitigating the privacy concerns stemming from the exposure of all DNS resolutions of a user—effectively the user's entire domain-level browsing history—to an additional third-party entity, we proposed K-resolver, a resolution mechanism in which DNS queries are dispersed across multiple (K) DoH servers, allowing each of them to individually learn only a fraction (1/K) of a user's browsing history. Our experimental results show that our approach incurs negligible overhead while improving user privacy. Last, but not least, given that the visibility into plaintext domain information is lost due to the introduction of domain name encryption protocols, it is important to investigate whether and how network traffic of these protocols is interfered with by different Internet filtering systems. We created DNEye, a measurement system built on top of a network of distributed vantage points, which we used to study the accessibility of DoT/DoH and ESNI, and to investigate whether these protocols are tampered with by network providers (e.g., for censorship). We find evidence of blocking efforts against domain name encryption technologies in several countries, including China, Russia, and Saudi Arabia. On the bright side, we discover that domain name encryption can help with unblocking more than 55% and 95% of censored domains in China and other countries where DNS-based filtering is heavily employed.
... The standard evaluation criteria explicitly show that GCA is a better choice for TOR traffic recognition. Similarly, Shahbar et al. [44] studied the mechanism of I2P network in terms of anonymizing a user's activities and the effect of bandwidth shared by the user's traffic on the I2P network. Huang et al. [45] proposed a CNN based multitask learning model to classify VPN network traffic recognition, Trojan classification, and malware detection. ...
Article
Full-text available
Network management is facing a great challenge to analyze and identify encrypted network traffic with specific applications and protocols. A significant number of network users applying different encryption techniques to network applications and services to hide the true nature of the network communication. These challenges attract the network community to improve network security and enhance network service quality. Network managers need novel techniques to cope with the failure and shortcomings of the port-based and payload-based classification methods of encrypted network traffic due to emergent security technologies. Mainly, the famous network hopping mechanisms used to make network traffic unknown and anonymous are VPN (virtual private network) and TOR (Onion Router). This paper presents a novel scheme to unveil encrypted network traffic and easily identify the tunneled and anonymous network traffic. The proposed identification scheme uses the highly desirable deep learning techniques to easily and efficiently identify the anonymous network traffic and extract the Voice over IP (VoIP) and Non VoIP ones within encrypted traffic flows. Finally, the captured traffic has been classified into four different categories, i-e., VPN VoIP, VPN Non-VoIP, TOR VoIP, and TOR Non-VoIP. The experimental results show that our identification engine is extremely robust to VPN and TOR network traffic.
Article
Darknet traffic classification is crucial for identifying anonymous network applications and defensing cyber crimes. Although notable research efforts have been dedicated to classifying darknet traffic by combining machine learning algorithms and elaborately designed features, current methods either heavily depend on hand-crafted features or overlook the global intrinsic relationships among the local features automatically extracted from different data positions, leading to limited classification performance. To tackle this issue, we propose DarknetSec, a novel self-attentive deep learning method for darknet traffic classification and application identification. Concretely, DarknetSec utilizes a cascaded model with a 1-dimensional Convolutional Neural Network (1D CNN) and a bidirectional Long Short-Term Memory (Bi-LSTM) network to capture local spatial-temporal features from the payload content of packets, while the self-attention mechanism is integrated into the abovementioned feature extraction network to mine the intrinsic relationships and hidden connections among the previously extracted content features. In addition, DarknetSec extracts side-channel features from payload statistics to enhance its classification performance. We evaluate DarknetSec on the CICDarknet2020 dataset, which is a representative of darknet traffic covering both Virtual Private Network (VPN) and The Onion Router (Tor) applications. Thorough experiments show that DarknetSec is superior to other state-of-the-art methods, achieving a multiclass accuracy of 92.22% and a macro-F1-score of 92.10%. Additionally, DarknetSec maintains its high accuracy when applied to other encrypted traffic classification tasks.
Article
Full-text available
As encrypted traffic grows, network flow classification has become a significant issue because of the impossibility to parse the payload in an encrypted packet. A possible packet sniffing location for organizations is an under control gateway between intranet and internet to inspect network traffic. However, when an intranet user uses an identity obfuscation protocol such as VPN or TOR, the packet IP and port would be rewritten to preserve user privacy. The same user’s packet sniffed between a user and TOR entry node/VPN proxy always has the same 5-tuples (packets with the same source IP, destination IP, source port, destination port, and IP protocol defined as flow). Thus, we cannot rely on the 5-tuples rule to split traffic into flows. This challenge is called the "only one flow problem" and poses an obstacle for flow classification. A previous solution uses timeout value to determine flow separation points to address this issue. However, the predefined static time threshold cannot fit all user habits, which leads to separation errors. To overcome timeout limitations, we propose a flexible method called AI-FlowDet by leveraging the scene change concept and a CNN model to find behavior change points of traffic based on learning data. AI-FlowDet can apply to the traffic of the identity obfuscation protocols. Next, we propose 294 size-based and direction-based features that can be used with AI-FlowDet to evaluate flow type classification performance. Every experiment leverages different machine learning algorithms. The results show that AI-FlowDet with the proposed features can achieve 98.5% weighted accuracy, which is increased by 12.6% versus the previous timeout method with baseline features. We proved that the proposed splitting methods for the only one flow problem and proposed features for flow type classification are effective based on the good results obtained for both the VPN and TOR datasets.
Article
Web is a primary and essential service to share information among users and organizations at present all over the world. Despite the current significance of such a kind of traffic on the Internet, the so-called Surface Web traffic has been estimated in just about 5% of the total. The rest of the volume of this type of traffic corresponds to the portion of the Web known as Deep Web. These contents are not accessible by usual search engines because they are authentication protected contents or pages only reachable through technologies denoted as darknets. To browse through darknet websites, special authorization or specific software and configurations are needed. TOR is one of the most used darknet nowadays, but there are several other alternatives such as I2P or Freenet, which offer different features for end users. In this work, we perform a connectivity analysis of the websites in the I2P network (named eepsites) aimed to discover if different patterns and relationships from those used in legacy webs are followed in I2P, as well as to get insights about the dimension and structure of this darknet. For that, a novel tool is specifically developed by the authors and deployed on a distributed scenario. Main results conclude the decentralized nature of the I2P network, where there is a structural part of interconnected eepsites while other several nodes are isolated probably due to their intermittent presence in the network.
Chapter
The I2P (Invisible Internet Project) network is a low-latency anonymous network composed of I2P routers based on garlic routing, which is mainly to protect privacy and prevent tracking, such as evading censorship and hiding whistle-blowers. As opposed to well-known and well-research Tor network, I2P aims to organize itself and distribute its anonymity. To our best knowledge, the study of I2P measurements is still insufficient. Thus, this paper proposed a novel method to measure I2P anonymous network nodes, including passive measurement and active measurement, and designed a local I2P node analysis system. Through experiments, we collected 16040 I2P nodes and analyzed properties including country distribution, bandwidth distribution and FloodingFill node attributes.
Article
Full-text available
We present the rst monitoring study aiming to characterize the usage of the I2P network, a low-latency anonymous network based on garlic routing. We design a distributed monitoring architecture for the I2P network and show through a one week long experiment the ability of the system identify a signi cant number of all running applications, among web servers and le- sharing clients. Additionally, we identify that 37% of published I2P applications, which turn out to be unreachable after their publication on the I2P distributed database.
Conference Paper
Anonymity networks, such as Tor or I2P, were built to allow users to access network resources without revealing their identity. Newer designs, like I2P, run in a completely decentralized fashion, while older systems, like Tor, are built around central authorities. The decentralized approach has advantages (no trusted central party, better scalability), but there are also security risks associated with the use of distributed hash tables (DHTs) in this environment. I2P was built with these security problems in mind, and the network is considered to provide anonymity for all practical purposes. Unfortunately, this is not entirely justified. In this paper, we present a group of attacks that can be used to deanonymize I2P users. Specifically, we show that an attacker, with relatively limited resources, is able to deanonymize a I2P user that accesses a resource of interest with high probability.
Article
The Self-Organizing Map (SOM) is a computational projection method that usually maps a high-dimensional data manifold onto a regular, low-dimensional (say, 2D) grid. A model of some observation is associated with every node. The SOM algorithm computes the collection of the models in such a way that an arbitrary observation will be represented by the closest model with an optimal average overall accuracy. At the same time, the models will be ordered over the grid according to their similarities, which creates an abstract order and allows effective browsing of the collection. Very different kinds of data can be analyzed and visualized by the SOM: the first example discussed in detail is a similarity graph of a vast number of documents, viz. seven million patent abstracts, which will be ordered according to their contents. Unlike the other neural-network methods, however, the SOM can also organize nonvectorial data. An example of this is the SOM of 77 977 protein sequences. Methods by which such huge mappings can be computed will be explained in this paper.
Conference Paper
Tor is a low-latency anonymity-preserving network that enables its users to protect their privacy online. It consists of volunteer-operated routers from all around the world that serve hundreds of thousands of users every day. Due to congestion and a low relay-to-client ratio, Tor suffers from performance issues that can potentially discourage its wider adoption, and result in an overall weaker anonymity to all users. We seek to improve the performance of Tor by defining different classes of service for its traffic. We recognize that although the majority of Tor traffic is interactive web browsing, a relatively small amount of bulk downloading consumes an unfair amount of Tor's scarce bandwidth. Furthermore, these traffic classes have different time and bandwidth constraints; therefore, they should not be given the same Quality of Service (QoS), which Tor offers them today. We propose and evaluate DiffTor, a machine-learning-based approach that classifies Tor's encrypted circuits by application in real time and subsequently assigns distinct classes of service to each application. Our experiments confirm that we are able to classify circuits we generated on the live Tor network with an extremely high accuracy that exceeds 95%. We show that our real-time classification in combination with QoS can considerably improve the experience of Tor clients, as our simple techniques result in a 75% improvement in responsiveness and an 86% reduction in download times at the median for interactive users.
Book
The Self-Organising Map (SOM) algorithm was introduced by the author in 1981. Its theory and many applications form one of the major approaches to the contemporary artificial neural networks field, and new technologies have already been based on it. The most important practical applications are in exploratory data analysis, pattern recognition, speech analysis, robotics, industrial and medical diagnostics, instrumentation, and control, and literally hundreds of other tasks. In this monograph the mathematical preliminaries, background, basic ideas, and implications are expounded in a manner which is accessible without prior expert knowledge.