Conference PaperPDF Available

ProxyTorrent: Untangling the Free HTTP(S) Proxy Ecosystem

Authors:

Abstract and Figures

Free web proxies promise anonymity and censorship circumvention at no cost. Several websites publish lists of free proxies organized by country, anonymity level, and performance. These lists index hundreds of thousand of hosts discovered via automated tools and crowd-sourcing. A complex free proxy ecosystem has been forming over the years, of which very little is known. In this paper we shed light on this ecosystem via ProxyTorrent, a distributed measurement platform that leverages both active and passive measurements. Active measurements discover free proxies, assess their performance, and detect potential malicious activities. Passive measurements relate to proxy performance and usage in the wild, and are collected by free proxies users via a Chrome plugin we developed. ProxyTorrent has been running since January 2017, monitoring up to 180,000 free proxies and totaling more than 1,500 users over a 10 months period. Our analysis shows that less than 2% of the proxies announced on the Web indeed proxy traffic on behalf of users; further, only half of these proxies have decent performance and can be used reliably. Around 10% of the working proxies exhibit malicious behaviors, e.g., ads injection and TLS interception, and these proxies are also the ones providing the best performance. Through the analysis of more than 2 Terabytes of proxied traffic, we show that web browsing is the primary user activity. Geo-blocking avoidance is not a prominent use-case, with the exception of proxies located in countries hosting popular geo-blocked content.
Content may be subject to copyright.
ProxyTorrent: Untangling the Free HTTP(S) Proxy Ecosystem
Diego Perino
Telefónica Research
diego.perino@telefonica.com
Matteo Varvello
AT&T Labs - Research
varvello@research.att.com
Claudio Soriente
NEC Labs Europe
claudio.soriente@emea.nec.com
ABSTRACT
Free web proxies promise anonymity and censorship circumvention
at no cost. Several websites publish lists of free proxies organized
by country, anonymity level, and performance. These lists index
hundreds of thousand of hosts discovered via automated tools and
crowd-sourcing. A complex free proxy ecosystem has been forming
over the years, of which very little is known. In this paper we shed
light on this ecosystem via ProxyTorrent, a distributed measurement
platform that leverages both active and passive measurements. Ac-
tive measurements discover free proxies, assess their performance,
and detect potential malicious activities. Passive measurements
relate to proxy performance and usage in the wild, and are col-
lected by free proxies users via a Chrome plugin we developed.
ProxyTorrent has been running since January 2017, monitoring up
to 180,000 free proxies and totaling more than 1,500 users over a 10
months period. Our analysis shows that less than 2% of the proxies
announced on the Web indeed proxy trac on behalf of users; fur-
ther, only half of these proxies have decent performance and can be
used reliably. Around 10% of the working proxies exhibit malicious
behaviors, e.g., ads injection and TLS interception, and these prox-
ies are also the ones providing the best performance. Through the
analysis of more than 2 Terabytes of proxied trac, we show that
web browsing is the primary user activity. Geo-blocking avoidance
is not a prominent use-case, with the exception of proxies located
in countries hosting popular geo-blocked content.
ACM Reference Format:
Diego Perino, Matteo Varvello, and Claudio Soriente. 2018. ProxyTorrent:
Untangling the Free HTTP(S) Proxy Ecosystem. In Proceedings of The Web
Conference 2018 (WWW 2018). ACM, New York, NY, USA, 10 pages. https:
//doi.org/10.1145/3178876.3186086
1 INTRODUCTION
Web proxies are intermediary boxes enabling HTTP (sometimes
also HTTPS) connections between a client and a server. They are
widely used for security, privacy, performance optimization or pol-
icy enforcement, to cite a few use cases. Many web proxies are
free of charge and publicly available. Such proxies can be used, for
example, for private web surng and to access content that would
be blocked otherwise (e.g., due to geographical restrictions).
Specialized forums, websites, and even VPN service providers
1
compile daily lists of free web proxies. When tested, most of these
Work done while at Telefonica Research
1For example, https://hide.me
This paper is published under the Creative Commons Attribution 4.0 International
(CC BY 4.0) license. Authors reserve their rights to disseminate the work on their
personal and corporate Web sites with the appropriate attribution.
WWW 2018, April 23–27, 2018, Lyons, France
©
2018 IW3C2 (International World Wide Web Conference Committee), published
under Creative Commons CC BY 4.0 License.
ACM ISBN 978-1-4503-5639-8/18/04.
https://doi.org/10.1145/3178876.3186086
proxies are slow, unreachable or not even real proxies. Furthermore,
it is folklore that free web proxies perform malicious activities, e.g.,
injection of advertisements and user ngerprinting. It is fair to say
that free proxies form a massive and complex ecosystem of which
very little is known. For example, what is the magnitude of the
ecosystem and how many proxies are safe to use? How and for
what are these proxies used? Answering these questions is hard
because of the scale of the ecosystem and because it involves two
players out of reach: free proxies and their users.
In this work we tackle the above challenge by building Prox-
yTorrent, a distributed measurement platform of the free proxy
ecosystem. ProxyTorrent leverages our premises to actively dis-
cover and assess the performance of the core of the ecosystem that
can be used safely. Usage statistics are instead passively (and anony-
mously) collected at free proxy users in exchange of high-quality
proxies list compiled by ProxyTorrent.
ProxyTorrent is daily fed with tens of thousands potential prox-
ies obtained by crawling the most popular free proxies aggregator
websites. Potential proxies are tested in order to discard the ones
that are unreachable, do not proxy trac, or perform malicious ac-
tivities. This is done by loading a “bait” webpage we have crafted as
well as few popular webpages, and comparing the content received
via the proxy with the one received when no proxy was set. The
same approach is used to detect issues with X.509 certicates in
case of TLS connections. These operations run daily at our premises
and generate a few thousand trusted proxies. Next, we test the per-
formance of trusted proxies from
30 network locations (Planetlab
nodes [
17
]) while fetching the landing pages of popular websites
via HTTP/HTTPS. Collected data is nally used to populate a list
of good proxies, i.e., working and trustworthy free proxies.
This list of good proxies is then oered to a Chrome plugin
(Ciao [
3
,
18
]) we developed to help users interacting with the free
proxy ecosystem. Ciao users select a target anonymity level and
country, and the plugin automatically identies the best free proxy
for the task, if any. As the user browse the Internet through the
proxy, we collect anonymous statistics on free proxy performance
and how they are used in the wild.
We use data collected by ProxyTorrent to provide a unique
overview of the free proxy ecosystem. In this paper we present
ten months worth of data spanning up to 180,000 free web proxies
and more than 1,500 users. The analysis of this data-set reveals the
following key ndings:
The free proxy ecosystem is large and ever-growing, but only
a small fraction of the announced proxies actually works.
While thousands of new free proxies are announced daily, over-
all, less than 2% of them are reachable and correctly proxy trac.
Further, half of these proxies stop working after few days. Many rea-
sons are behind such ephemeral behavior: host miscongurations,
dynamic addressing, and even bait proxies from VPN providers
aiming at attracting more customers.
A non negligible percentage of working proxies are suspi-
cious, but provide better performance than safe proxies.
Ev-
ery day, around 10% of the working proxies announced on the Web
exhibit suspicious behavior, from injection of advertisements to
TLS man-in-the-middle attempts. On average, these proxies are
twice as fast as non-malicious ones. Fast connectivity is likely used
to attract potential “victims”, supporting the general belief that free
proxies are “free for a reason”.
The geographical distribution of proxies is fairly skewed.
Half of the working free proxies reside in a handful of countries,
with US, France, and China at the top. While US and China are at
the top due to their sizes, the presence of large cloud providers in
France is the reason behind such large number of proxies.
Geo-blocking avoidance is not a prominent use-case for free
web proxies.
By analyzing 2 TBytes of trac generated by 1,500
Ciao users over 7 months, we conclude that web browsing is the
most prominent activity. Proxy are rarely selected in the same
location where a visited website resides, which suggests that cir-
cumvention of potential geo-blocking rules is not a primary user
concern. Some countries like the US are an exception though, likely
because hosting a lot of popular geo-blocked content.
2 BACKGROUND AND RELATED WORK
Background
– A web proxy is a device/application that acts as
an intermediary for HTTP(S) requests, such as
GET
and
CONNECT
,
issued by clients seeking resources on servers. Web proxies are com-
monly classied as transparent,anonymous, and elite, depending
on the degree of anonymity they provide.
Transparent proxies reveal the IP address of the client to the
origin server, e.g., by adding the
X-FORWARDED-FOR
header which
species the address of the client. Anonymous proxies block head-
ers that may allow the origin server to detect the identity of the
client, but still announce themselves as proxies, e.g., by adding the
HTTP_VIA
header. Elite proxies do not send any of the above head-
ers and look like regular clients to the origin server. Yet, the origin
server may detect that a proxy is used by probing the IP address
extracted from the received trac to check if it acts as a proxy.
Related Work
– We briey overview relevant results from related
work and highlight dierences with ProxyTorrent.
Free Web Proxies. Scott et al., [
20
] also study free web proxies but
both their goal and methodology dier from ours. While our goal
is a complete view of the free proxy ecosystem, they mostly focus
on how and for what free proxies are used. They do so by scanning
the IPv4 address space at popular proxy ports (e.g., 3128, 8080, and
8123) looking for open management interfaces (i.e., proxy interfaces
with no authentication required) from which they can “steal” usage
statistics. This approach is intended to run seldomly due to the
cost associated with IPv4 scanning. Further, it raises some ethical
concerns related to exposing hosts found via scanning as well as
intruding their management interfaces. ProxyTorrent was instead
designed with both scalability and user privacy in mind. Note that
we once ran a full scan of the IPv4 address space at popular proxy
ports to compare our methodology with the one in [
20
] (see Table 2).
ProxyTorrent shares some similarities with
proxycheck
[
7
], a
tool that can check the behavior of a proxy by using it to down-
load few distinct objects hosted on a private webserver. Next, it
labels a proxy as untrusted if the retrieved objects dier from the
original ones even by a single bit, potentially generating a large
number of false positives. Proxies are tested one at a time, which
only allows to test
10,000 proxies a day. Despite some similar-
ities, our approach is fundamentally dierent since we designed
a funnel-shaped methodology (see Figure 1) aiming to minimize
false positives while maximizing performance, e.g., scaling up to
hundreds of thousands proxies per day.
In a parallel research work, Tsirantonakis et al. [
21
] analyze
about 66,000 open proxies over a two months period. We share a
similar methodology for proxy discovery and content manipulation
detection, and our results are aligned. However, we present a larger
observation period (i.e., 10 months, 180,000 proxies), and also con-
sider TLS certicate manipulations and proxy performance. Further,
we augment this similar methodology with passive experiments to
understand proxy performance and usage in the wild.
In-path Manipulations. A number of papers study in-path web con-
tent manipulation by leveraging bait content served from a con-
trolled host. Reis et al., [
19
] focus on middleboxes and serve a page
with an embedded JavaScript that detects and reports modica-
tions. Their data-set contains 50,000 unique visits to their website,
totaling 650 instances of content manipulation. Chung et al., [
2
]
use the paid version of Hola [
8
] (a peer-to-peer proxy network) to
detect end-to-end violation in DNS, HTTP, and HTTPS trac. They
witness DNS hijacking, HTTP manipulations, image transcoding,
and a few cases of TLS man-in-the-middle attempts. Many of the
violations reported in [
2
] are attributed to ISPs and to (malicious)
software running at Hola proxying peers. Tyson et al., [
22
] use the
same approach to investigate HTTP header manipulations. They
leverage Hola to gather 143k vantage points in 3,818 Autonomous
Systems (ASes) and detect header manipulation in about 25% of the
ASes. Weaver et al., [
24
] detect, using Netalyzr [
11
], that 14% of the
HTTP connections are manipulated by in-network middleboxes,
i.e., devices that intercept trac without informing the user.
Dierently from all of the above, we look at performance and
content manipulations of free Web proxies that users explicitly
insert in their trac to provide privacy, censorship circumvention,
etc. Further, we use bait content served from a controlled host, as
well as real websites. Our measurement platform also leverages
real users by means of a plugin that provides easy proxy usage in
exchange of anonymous statistics of proxy usage in the wild.
Virtual Private Networks. Perta et al., [
15
] study privacy leaks in
commercial VPN systems. Despite a VPN tunnel, they discover the
following trac leakages. First, IPv6 trac is usually not tunneled.
Second, poor management of the DNS conguration at the client
may result in an adversary hijacking DNS requests and learning
which websites a user visit. Similar issues are also reported by
Ikram et al., [
10
] that analyze 283 Android VPN apps. The authors
of [
10
] also detect VPN apps with embedded tracking libraries
and malware. Dierently from these works, we focus on free web
proxies that are a valid alternative to commercial VPNs in use cases
such as accessing geoblocked content. Apart from their behavior,
we further assess their performance.
Phase I Phase II Phase III.A Phase III.B Phase IV
# clients 1 1 1 30 up to 1,500
tools beautifoulsoup curl PhantomJS/curl/OpenSSL curl Chrome plugin
main task web-crawling fetch 1KB synthetic object fetch syntetic webpage fetch real webpages interface with free proxies
main goal nd potential proxies nd working proxies test behavior test performance monitor perf. and usage
frequency daily daily, on-demand daily every 5 minutes user-controlled
classication potential working/unresponsible/
unreachable/other trusted/suspicious/unrated trusted/suspicious/unrated
Table 1: Key aspects of each phase in ProxyTorrent.
3 PROXYTORRENT
This section describes ProxyTorrent, a distributed measurement
platform built to monitor the free proxy ecosystem (see Figure 1).
Due to the scale of the proxy ecosystem — potentially millions of
machines [
20
] — we use a funnel-shaped methodology with several
phases (see Figure 1). Proxies are fed into the funnel and, at each
phase, go through a series of tests of increasing complexity. Only
proxies that pass a given phase are admitted to the next one. Since
each phase decreases the number of proxies under test, we can
progressively increase test complexity. The last phase takes place
at real proxy users, allowing to complement results of controlled
experiments with measurements in the wild. Table 1 lists the key
aspects of each phase. In the remainder of this section, we describe
all phases in detail.
Phase I
discovers free proxies (
<
ip, port
>
pairs) on the Internet
by crawling several aggregator websites which regularly publish
free proxy lists. Daily crawling runs from a single machine at our
premises. The hosts discovered are used to populate a list of “po-
tential proxies” sorted by the last day when each proxy appeared
on any of the websites we crawl.
Phase II
tests the potential proxies populated by Phase I for prox-
ying capability. We use curl [
4
], instrumented for full statistics and
headers collection, to fetch a 1KB object—served via nginx [
14
]
from a server hosted by Amazon Ireland—via each potential proxy.
Curl’s user agent (UA) is set to a recent Chrome’s UA in order to
appear as a standard browser. Phase II runs daily from a single
machine. It traverses the potential proxies list in order, and runs
for up to 24 hours until either all proxies have been tested or time
Phase I
Phase II
potential
proxies
unresponsive
unreachable
other
trusted
proxies
Phase III.B
Phase III.A
working
proxies
usage
statistics
performance
ranking
behavior
statistics
performance
statistics
suspicious
unrated
Phase IV
Figure 1: ProxyTorrent system overview.
is over. This strategy rules out the least recently crawled potential
proxies, in case the list becomes too big to be processed in a day.
Each proxy is associated with a similarity score computed as
the ratio of common content between the webpage retrieved with
and without the proxy. Accordingly, a similarity score of 1 means
that the content fetched through the proxy is identical to the one
fetched without a proxy.
Phase II categorizes hosts as follows. Unresponsive: hosts for
which either a connection or max duration timeout was triggered.
2
Unreachable: hosts that either closed the TCP connection with a
reset message or sent ICMP messages declaring the network or
the requested host as unreachable. Working: hosts with a similarity
score
0
.
5,i.e., that have correctly proxied at least 50% of our
synthetic 1KB object. This threshold was chosen to discard proxies
returning errors or login pages, for which we empirically measured
similarity scores lower than 0.3 (on average). Note that proxies
that largely alter a webpage might be caught in this rule as well.
This is ne as far as nding safe working proxies, but it prevents
the full behavioral analysis from phase III thus generating false
negatives (see Section 4.2). We further classify working proxies
as transparent,anonymous, or elite (see Section 2) using HTTP
headers collected both at the client and at the server. HTTP headers
of all proxies are also analyzed to identify header manipulations
that can be potentially malicious. Finally, Maxmind [
13
] is used to
obtain country/AS information of each working proxy. Other: all
remaining hosts that relay content substantially dierent from the
expected one, e.g., all the hosts returning a login page (private or
paid proxies) or an error page (miscongured hosts).
Phase III
tests working proxies with respect to behavior and per-
formance. To assess a proxy behavior, we use the previous method-
ology of comparing proxied content with content received when
no proxy is used. Compared to Phase II, we introduce a headless
browser, real content, HTTPS testing, and clients at multiple loca-
tions. For performance, we measure both page download time (PDT)
and page load time (PLT). PDT is the time required to download
the index page of a website; PLT is the time from when a browser
starts fetching a website to the ring of the JavaScript
onLoad()
event, which occurs once the page’s embedded resources have been
downloaded, but possibly before all objects loaded via scripts are
downloaded. Phase III consists of two parts (A and B) which both
operate on the set of working proxies identied by Phase II within
the last 7 days.
2
We measured empirically that 3 seconds (TCP handshake) and 30 seconds (maximum
duration) are long enough for 95% of the proxies.
Phase III.A
runs daily from a single machine at our premises. It
uses PhantomJS [
16
], a popular headless browser, to fetch a realistic
website we serve. We designed this website to include elements
that could trigger content manipulation by a proxy: a landing page
index.html
(83.7KB), two
javascripts
(635B and 22.9KB), two
png
images (1.5KB and 13.5KB), and a
favicon
(4.3KB). Our bait
webpage is similar to the one set up by related work that looks for
en-route content manipulation [2].
Data is collected as an HTTP Archive (HAR); for this, we have
extended PhantomJS’s HAR capturer
3
to also dump the actual con-
tent downloaded. The HAR le includes detailed information about
which object was loaded and when, as well as PLT. We stop Phan-
tomJS either one second after the
onLoad()
event, to allow for
potentially pending objects to be downloaded, or after a 45 second
maximum duration timeout. Compared to Phase II, we increase the
maximum duration timeout to account for an overall more com-
plex operation. As in Phase II, we set PhantomJS’s US to a recent
Chrome’s UA.
Phase III.A also checks for issues with X.509 certicates. First,
we use curl to connect to our server via port 433 and compare the
X.509 certicate presented to the client with our original certicate
(provided by LetsEncrypt [
12
]). If curl detects any issue with the
certicate, we use OpenSSL to download the X.509 certicates from
our website as well as from two popular websites.4
Phase III.A classies a working proxy as trusted,suspicious, or
unrated. Trusted proxies serve the expected content with no alter-
ation and do not replace or modify X.509 certicates. Suspicious
proxies alter the relayed trac, e.g., by adding unsolicited content
or by not relaying the expected X.509 certicates. Finally, unrated
proxies operate at such a slow speed that they are incapable to serve
the full content requested within the maximum duration allowed.
The partial content they serve was not modied, otherwise we
mark them as suspicious. Phase III.A quanties the performance
of trusted and suspicious proxies using the PLT of our realistic
website.
Phase III.B
runs daily from 30 Planetlab nodes. Curl is used to
fetch, via each proxy in the working proxy list, the landing pages
of Alexa’s top websites. Precisely, we construct two 1,000-website
lists from Alexa with support for HTTP and HTTPS, respectively.
For each proxy and fetched page, Phase III.B reports both PDT and
similarity score. Proxies are tested mostly against HTTP websites;
only once every 10 tests a proxy is also tested for HTTPS support
by fetching a random website from the HTTPS list. We empiri-
cally measured that phase III.B is currently capable of testing each
working proxy at least once every 5 minutes.
Phase IV
allows to both test free web proxies in the wild, as well
as to learn how free proxies are used. It runs on the machines of
the users that installed Ciao,
5
a Chrome plugin we developed to
help users nding free proxies. Users pick the desired anonymity
level (transparent, anonymous, elite) and location, and Ciao auto-
matically sets up a free proxy based on input from ProxyTorrent. In
3http://phantomjs.org/network-monitoring.html
4https://www.theguardian.com and https://www.google.com
5https://goo.gl/y86fOy
Total Unresp. Unreach. Other Working
Crawling 0.16M 0.11M 0.04 8,000 2,895
Zmap 29.1M 17M 5.66M 6.4M. 2,518
8080 13.5M 6.6M 2.2M 4.7M 376
8081 7.2M 4.65M 1.55M 0.95M 171
8118 4.7M 3.45M 1.15 0.1M 1,093
3128 3.7M 2.29M 0.76M 0.65M 878
Table 2: Crawling and scanning (Zmap) summary, June 18th
2017. Results in the last four rows refer to scanning per port.
order to minimize risk and maximize usability, we only consider
proxies that have been labeled as trusted in Phase III.A, and that
have shown the best performance in Phase III.B. At any time the
user can request a new proxy either to reect a new preference or
in case of failure.
Ciao reports statistics per download, which captures all the events
in a browser’s tab transitioning from one URL to another, usually
in response to directly typing a URL, refreshing or aborting the
load of a webpage, clicking a link within a page, etc. We leverage
Chrome’s
webNavigation
APIs to identify the beginning and end
of a download. For each download, the following statistics are
collected: timestamps associated to the beginning and end of a
download, PLT, number of requests and bytes per protocol type
(HT TP/HTTPS), navigation errors (if any). No personal information,
such as IP address, browser/OS information, or URLs are reported at
any time. Users are informed about the collected data and potential
for anonymized research publications.
4 THE FREE PROXY ECOSYSTEM
This section characterizes the free proxy ecosystem. We rst quan-
tify its magnitude and evolution over time. Next, we provide data
supporting (or not) the preconception that free proxies are mostly
malicious and tend to manipulate served content. We then conclude
by assessing the ecosystem performance and by providing some
evidence on how free proxies are used in the wild. We report on 10
months worth of data (January-October, 2017) spanning more than
180,000 proxies and 1,500 users.
Limitations
We acknowledge from the outset the limitations of our
methodology. According to our ndings, around 10% of the working
proxies every day exhibit malicious behavior by either injecting
content, manipulating headers, or by replacing X.509 certicates.
This is a lower bound to the fraction of malicious proxy since an
exhaustive behavioral analysis by only controlling a few clients and
servers is out of reach. We stress, however, that related work using
a setup similar to ours, share the same limitations [2, 19, 22, 24].
A proxy could behave maliciously only in some cases in order to
avoid detection. For example, it may decide to manipulate content
based on contextual factors, such as the client IP address, the do-
main requested, etc. Our experiments indicate that only 20% of the
malicious proxies manipulate the content of each requested page
while many (40%) do so only for one out of ten pages requested. Fur-
thermore, there is no guarantee that a proxy that in our experiment
proxied trac without alterations, will not manipulate content
when serving other users. Perhaps the content we requested or the
Figure 2: Time evolution of host classication: unreachable,
unresponsive, working, and other.
IP address of our clients simply did not trigger content manipu-
lation at the proxy. Another form of malicious behavior that we
cannot fully assess is user tracking and proling. Our experiments
reveal several attempts to inject tracking/ngerprinting code, but
we cannot rule out that even innocent-looking proxies carry out
user proling by simply leveraging the IP address of the user and
her list of requests. We nevertheless argue that ProxyTorrent im-
proves the current situation for proxy users that are clueless on
whether a given proxy is performing any kind of malicious activity
with the relayed trac. Furthermore, ProxyTorrent raises the bar
for malicious proxies to avoid detection.
4.1 Characterization
Magnitude.
Table 2 shows a snapshot of the free proxy ecosystem
(June 18th, 2017). We chose this date since, at that time, we sup-
plemented ProxyTorrent’s crawling strategy by scanning the full
IPv4 space and targeting the most popular proxy ports according
to the aggregator websites. Our goal is to understand the coverage
of the aggregator websites we crawl. IPv4 address scanning lever-
ages Zmap [
5
] from a number of machines we control. Because of
the ethical issues related to port-scanning, we run the scan only
once. While we test the found proxies to categorize them, we do
not use proxies found exclusively via scanning in the following
experiments nor we make them available to Ciao users.
Table 2 reports proxies obtained by crawling the aggregator web-
sites (rst row), and the one found via port-scanning (second row).
The table distinguishes between four hosts categories: unreachable,
unresponsive, working, and other (see Section 3). Crawling yields
a higher ratio of working proxies (2,895 out of approximately 160k)
compared to port-scanning (2,518 out of more than 29M). Only
719 proxies appear in both data-sets. Regardless of the discovery
strategy, the table shows that most hosts are either unresponsive
or unreachable, and that only a few thousand hosts can actually be
labeled as working proxies. The last four rows of Table 2 show the
breakdown of the proxies discovered via scanning by port.
Figure 2 shows the evolution over time of each proxy category
as dened in Phase II. On the rst day, we bootstrap ProxyTorrent
with a list of potential proxies containing 118,915 hosts (
<
ip, port
>
1
10
100
1k
10k
01/01
02/01
03/01
04/01
05/01
06/01
07/01
08/01
09/01
10/01
Proxies [#]
Active
Unrated
Trusted
Manipulation
Cert. Issue
Figure 3: Time evolution of working proxies by category: un-
rated, trusted, and suspicious (either data or TLS certicate
manipulation)
pairs) collected on specialized forums. We then daily supplement
such list via crawling. Overall, the gure shows that the working
proxy category has a dierent trend than the others. While the
number of hosts in each category increases over time, the number
of working proxies oscillates between 900 and 3,000.
We now focus on the (small) core of working proxies for which
further testing was conducted. Figure 3 shows the evolution over
time of the active proxies, i.e., the set proxies that were reachable
during phase III.A at least once within a day. Figure 3 also shows
the evolution of the categories trusted, suspicious (split between
proxies that manipulate TLS certicates—cert. issue—and proxies
that manipulate actual content—manipulation), and unrated.6
According to Figure 3, every day roughly 66% of active proxies are
marked as trustworthy, while around 24% are marked as unrated.
Suspicious proxies amount to 10% of the active, where 100-300
proxies manipulate proxied content and only a handful of them is
caught replacing X.509 certicates. On average, 40% of the proxies
support HTTPS. The drop observed in all curves at mid-June is
caused by a partial failure of our system resources.
Takeaway:
The proxy ecosystem is characterized by a small and
volatile core of proxies surrounded by a large and increasing set of non-
proxy hosts that are erroneously announced on aggregator websites.
Geo-location.
Figure 4 and 5 show, for the top 20 countries and
ASes, the total number of proxies they host and the amount of
suspicious proxies. Figures are computed considering all working
proxies observed at least once during the six months monitoring
period. USA (11%), France (9%) China (6.7%), Indonesia (6.6%), Brazil
(6.5%), and Russia (6%) host 45% of the proxies, while the remainder
is scattered across 160 countries. A similar trend is noticeable for
suspicious proxies, with the main dierence being that China (130
proxies) passes the US (120) and the gap with France (70) increases.
As for the hosting ASs, about 28% of proxies are concentrated in
only six ASs, while the remaining proxies reside in 4,386 ASs. Both
ISPs and cloud service providers appear in the top 20 ASs.
6
The curve cert. issue starts from mid February, when we added HTTPS support to
ProxyTorrent.
1
10
100
1k
10k
USFRCN
ID BRRU
IT THGB
CAIN BDDEPLLBUAVESG
TRNL
0.0035
0.035
0.35
3.5
35
Proxies [#]
Proxies [%]
All
Suspicious
Figure 4: Number of proxies per top 20 countries.
1
10
100
1k
10k
OVH-FR
ARUBA-FR
CHINANET
TELKO-ID
ARUBA-IT
CHINA169
DIGITALOCEAN-US
CHOOPA-US
GOOGLE-US
AMAZON-US
CANTV-VE
BIZNED-ID
HOSTWINDS-US
BBP-LB
ISP-LB
TRIPLENET-TH
NOBIS-US
DIGITALOCEAN-GB
MICROSOFT-US
DIGITALOCEAN-NY3
0.0035
0.035
0.35
3.5
35
Proxies [#]
Proxies [%]
All
Suspicious
Figure 5: Number of proxies per top 20 ASes.
(In)stability.
Next, we explore the stability of the proxies located
in the (usable) core of the free proxy ecosystem. We report their
lifetime, the number of days between the rst and the last time
a proxy has been active, and their uptime, the number of days a
proxy was active within its lifetime. Both metrics are derived using
a proxy’s IP address and port as an identier; our estimates are thus
lower bounds in presence of dynamic addressing. Figure 6 shows
the CDF of lifetime and uptime over 10 months, distinguishing
between all proxies and the suspicious ones. Proxies tend to have a
long uptime, e.g., 55% of the proxies are available for their whole
lifetime, regardless if they are suspicious or not. The gure also
shows that suspicious proxies have a signicantly shorter lifetime
compared to the rest of the ecosystem, e.g., a median lifetime of 15
versus 35 days.
Roughly half of the monitored proxies last up to a month. This
result suggests that free proxies are fairly unstable over time. This
can be due to dynamic addressing, for example when proxies run
on residential hosts where they get their IP assigned by a dhcp
server. Another possible reason is that some proxies serve public
trac due to miscongurations that are eventually discovered and
xed by their administrators. The shorter lifetime measured for
suspicious proxies could also be intentional, i.e., frequent changes
to the IP address might be used as a mean to circumvent banning
from remote servers.
Takeaway:
The core of the free proxy ecosystem is characterized by an
high level of instability which makes locating a usable proxy extremely
challenging. Half of this core resides in a handful of countries, with
the US leading the pack of trusted proxies and China the pack of
suspicious ones.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50 100 150 200 250 300
CDF
Days [#]
Uptime (suspicious)
Lifetime (suspicious)
Uptime (all)
Lifetime (all)
Figure 6: CDF of lifetime and uptime for all proxies and the
suspicious ones.
4.2 Behavior
Dierently from above, the following gures are aggregated statis-
tics over the 10 month monitoring period. We discovered 39,143
working proxies of which 16,700 (42%) are classied as unrated,
1,833 (4.5%) as suspicious and 20,610 (53.5%) as trusted. Exclud-
ing unrated proxies—that do not serve enough content to enable a
classication—8.2% of proxies are suspicious, and 91.8% are trusted.
This subsection focuses on suspicious proxies to comment on their
behavior in detail.
Suspicious Behavior Classication
Content manipulated by sus-
picious proxies can be summarized as follows:
html
(74% of all ma-
nipulated trac),
javascripts
(24%), and
images
(2%). Unsolicited
content injection mostly consists of
javascripts
, though we also
spotted few
php
and
image
injections. Overall, we witnessed 228
unique content manipulations — this implies that several proxies
manipulate trac in the same way. Also, suspicious proxies do not
manipulate trac at each request: only 20% of them manipulate
trac all the time, while 40% do it less than 10% of the time.
To better understand the purpose of content manipulation, we
resort to visual inspection. To minimize the eort, we rst clus-
ter manipulated content using anity propagation clustering [
6
].
Specically, we consider each piece of altered or injected content as
a string and compute the distance matrix required by the clustering
algorithm using the edit distance between each pair of strings.
Among the output clusters, two of them cover about 60% of
the content manipulation instances. The rst cluster contains 84
instances of ad injection code, of which 50 can be linked to two
companies that provide hotspot monetization services. The second
cluster contains 47 instances of ngerprinting/tracking code, mostly
javascripts attempting to identify a user; 30 out of these 47 instances
include
rum.js
, a popular library to monitor user-webpage interac-
tions. Although
rum.js
is commonly used by CDN providers, there
is no apparent motivation for a free proxy to inject such code.
The remaining clusters include the following instances of in-
jected code. Nine instances, imputable to only two proxies, display
religious-related content. Four times we witness metadata of pyweb,
a popular proxy rewriting tool for live web content. Pyweb’s meta-
data triggerered our detection, but further inspection shows no
0.1
1
10
100
X-Forwarded-For
Via
Connection
X-Proxy-ID
Proxy-Connection
Cache-Control
Server
Date
Content-Type
If-Modified-S.
Headers [%]
Added
Modified
Figure 7: Header manipulation: request headers.
actual content rewriting. Finally, we could not gure out the se-
mantics of the remaining 84 content manipulations either because
they were obfuscated or because they were only a few bytes in size.
Takeaway:
Few content manipulation strategies exist that are shared
among many proxies, advertisement injection being the most frequent
one. Suspicious proxies do not manipulate trac constantly; ProxyTor-
rent’s continuous monitoring is thus paramount to detect such proxies.
Invalid X.509 Certicates
HTTPS is supported by 17,350 proxies
(about 44% of the working proxies) and 0.9% of them (173 prox-
ies) were caught interfering with TLS handshakes. The most com-
mon behavior among such proxies is to replace the original certi-
cate with a self-signed one showing vague
CommonName
attributes
such as “https” or “US”. Three proxies provide certicates with
CommonName
matching the original domain but signed by “Zecu-
rion Zgate Web”, a company oering corporate gateways to miti-
gate information exltration, and “Olofeo.com”, a French company
that oers managed security services. Only one proxy delivers a
certicate chain of size two, where the leaf certicate has the ex-
pected
CommonName
but the root certicate has
CommonName
set to
“STATESTATESTATESTATESTATE”. The issuer of this certicate is
wscert.com, a domain expired as of February 2017.
Takeaway:
Attempts of TLS interception are rare in the free proxy
ecosystem. Modern browsers would easily detect these potential at-
tacks and inform the user. Yet previous work has shown that users
tend to click through warnings [1].
Header Analysis
We now analyze HTTP request and response
headers with the two-fold objective of understanding the level of
anonymity provided by proxies, and if header manipulation by free
proxies goes beyond trac anonymization. First, we focus on the
working proxies observed at least once during six months. Then, we
extend our analysis to proxies categorized as other,i.e., proxies that
relay a webpage that diers more than 50% from our bait webpage
(see Phase II in Section 3).
Figure 7 shows the top 10 request header modications and injec-
tions observed;
Via
,
X-Proxy-ID
,
X-Forwarded-For
, and
Connection
are the most frequently added headers. The rst two headers are
used by proxies to announce themselves to origin servers, while
the third one species the client IP address to the origin server,
0.1
1
10
100
Connection
X-Cache
Via
Proxy-Connection
Server
ETag
Accept-Ranges
Vary
Content-Type
Age
Headers [%]
Added
Modified
Figure 8: Header manipulation: response headers.
when the proxy acts transparently. By leveraging those headers
we classify proxies as: 1) transparent (77%), proxies that reveal the
original client IP to the server; 2) anonymous (6%), proxies that
preserve client anonymity but reveal their presence to the server;
3) elite (17%), proxies that preserve client anonymity and do not
announce themselves to the origin server.
Connection
is another frequently injected header. Roughly 60%
of the proxies tested set it to
close
or
keep-alive
. This behavior is
not surprising as this header is reserved for point-to-point commu-
nication, i.e., between client and proxy or between server and proxy.
The
Proxy-Connection
header plays a similar role, and it is also
added in about 10% of cases.
Cache-Control
is the only request
header which is altered; about 10% of proxies modify this header
to accept cached content with a given
max-age
value, despite our
testing tools explicitly specify not to serve cached content. We also
observe that less than 1% of proxies (not shown in Figure 7) modify
the user-agent by either removing it or specifying their own agents.
While the exposure of the client user-agent reduces anonymity, it
allows the server to optimize the content served based on the user
device and application.
Figure 8 shows the top 10 response header modications and
injections performed by working proxies. As for the request headers,
the
Via
header is among the most frequently injected one; this
is used by proxies to to announce themselves and their protocol
capabilities to clients. About 30% of proxies also add the
X-Cache
header to specify if the requested content was served from the
proxy’s cache or if a previously cached response is available. The
most frequently modied header is the
Connection
header, that is
either removed (50% of cases) or set to
close
. As previously stated,
this is a common behavior as this header is connection specic and
does not need to be propagated to the client. Finally, less than 10%
of the proxies modify the
Server
header to reect the software
they use, rather than the one of the origin server.
We now focus on proxies categorized as others. Similar obser-
vations as above hold; in addition, we observe a non negligible
amount of
Set-Cookie
(5%),
Access-Control-Allow-*
(1%), and
X-Adblock-Key
(0.5%) headers injected in the responses to clients.
The
Set-Cookie
header pushes a cookie to the client that may be
used for tracking.
The
Access-Control-Allow-*
headers are used to grant per-
mission to clients to access resources from a dierent origin domain
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35 40 45
CDF
Average PLT [sec]
Suspicious
Top
All
Figure 9: CDF of average PLT per proxy distinguishing be-
tween suspicious, all, and best performing proxies.
than the one currently in use. Both headers expose clients to mali-
cious or unintended activities; however, they are also frequent for
private and enterprise proxies. Because similar headers were not ob-
served, at this scale, for the working proxies, we conclude that this
behavior is unlikely malicious. Conversely, the
X-Adblock-Key
response header allows ads to be displayed at clients bypassing
ad-blocker tools. Proxies injecting this header likely return a modi-
ed version of our “bait” webpage including extra advertisements,
which largely departs from the original page (similarity score
<
0
.
5).
Proxies categorized as others were between 30,000 and 40,000; this
analysis suggests that the similarity score rule introduces about
150-200 false negatives or about 1%.
Takeaway:
HTTP header analysis reveals that the free proxy ecosys-
tem is mostly composed of transparent proxies which announce them-
selves and/or reveal the client’s IP address to the origin server. Suspi-
cious header manipulation is rare; when present, it aims at ensuring
that injected advertisements are not ltered by ad-blockers.
4.3 Performance
We now investigate the performance of the free proxy ecosystem,
or how fast can free proxies deliver content to their users. We use
page load time (PLT) as a performance metric since it accurately
quanties end user experience [
23
]. However, PLT also depends on
the composition of a webpage, i.e., its overall size and complexity
in terms of number of objects. Accordingly, it has to be noted that
PLT values from experiments in Phase III refer to our synthetic
webpage—small size and only few objects—while PLT values for
experiments in Phase IV refer to proxies usage in the wild, i.e.,
overall bigger webpages with hundreds of embedded objects.
Figure 9 shows the CDF of the average PLT measured through
each proxy, distinguishing between suspicious proxies (suspicious),
all proxies in the ecosystem (working), and the best performing
proxies ProxyTorrent oers to its users via Ciao (top). PLT values for
both all and suspicious proxies are measured in Phase III, while PLT
values for top proxies are measured in Phase IV. Failed downloads,
where no PLT was measured are not taken into account.
Figure 9 shows that suspicious proxies are faster than other
proxies in the ecosystem, e.g., the median PLT they provide is 2.5x
faster (7 seconds versus 18). The gure also shows that ProxyTorrent
correctly identies the best performing proxies since their PLT
measured at the user is 15% faster than the rest of the ecosystem.
Takeaway:
Suspicious proxies are, on average, twice as fast as safe
proxies. Faster connectivity may be used by malicious proxy as a bait
to attract more potential victims. Our nding support the popular
belief that free proxies are “free for a reason”.
4.4 Usage
We released Ciao—our Chrome plugin to facilitate discovery and
usage of free proxies—on the Chrome Web store on March 17th
2017, and announced it via email, social media, and few forums on
free proxies, anonymity, censorship circumvention, etc. Between
March 2017 and October 2017 Ciao has been installed by more than
1,500 users who generated about 1,3 Millions downloads, totaling
2 TBytes of HTTP/HTTPS trac (1.5/0.5 TBytes, respectively).
We start by investigating user preferences in terms of both proxy
location and anonymity level. While for 70% of the queries the
users did specify a country preference, they requested a specic
anonymity level only for 16% of their queries. This indicates that
users are overall more interested in the proxy location than its
anonymity level. According to user preferences, anonymity levels
can be ranked as follows: transparent proxies (7%), elite (5%), and
anonymous (4%). With respect to proxy locations, only 20% of the
queries are concentrated in the 10 most popular locations (see Fig-
ure 10(a)), while the remainder 80% are spread across 120 countries.
These top 10 countries are also among the ones where most proxies
are located (see Figure 4). Since Ciao shows how many proxies are
available per country, it is possible that user preferences have been
inuenced by this information.
Next, we investigate where most of the trac is proxied. Fig-
ure 10(b) shows the fraction of downloads and bytes for the top 10
countries only considering downloads where a country preference
was set (70% of the time). In this case, the distribution is heavily
skewed towards the top 10 locations accounting, overall, for about
60% of all downloads and bytes transferred. However, the ranking
between the two gures is fairly similar.
Figure 10(c) shows the distribution of “geo-localized” Ciao trac,
i.e., trac associated with a webpage hosted in the same country of
the proxy being used. For this analysis, we temporarily (one month)
extended the statistics collected by Ciao (see Section 3, phase IV)
to infer whether the location of a requested website is the same of
the proxy used. Ciao users have been informed of this temporary
change in the data collection.
Figure 10(c) shows that, on average, the websites accessed via
a free proxy and the proxy itself are hosted in the same country
for 30% of downloads and 20% of bytes. Further, we observe no
geo-localized trac for half of the countries. This results suggests
that geo-blocking avoidance is not a prominent use-case for free
web proxies. However, the gure also shows some countries with
high percentages of geo-localized trac, e.g., 60% in the US.
The geo-localized trac observed in Figure 10(c) could be at-
tempts to access popular geo-blocked services like Hulu or Netix
in the US. In absence of URL visibility, we investigate whether these
users are particularly concerned about leaking their IP/location, i.e.,
are more likely to request anonymous and elite proxies. We nd
0.1
1
10
US
FR
DE
CA
GB
BR
RU
NL
IN AU
JP
User preferences [%]
(a) Fraction of queries for the top 10 countries based on
user preferences.
0
2
4
6
8
10
12
14
16
18
20
US
FRDE
NLGB
DK
IN AR
GR
KR
Percentage [%]
Downloads
Bytes
(b) Fraction of downloads and bytes proxied by the top 10
countries.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100
CDF
Percentage of geolocalized traffic [%]
Downloads
Bytes
(c) CDF of the percentage of geo-localized trac.
Figure 10: Proxy usage and geo-location analysis.
that for these downloads anonymous proxies are the most popu-
lar choice (70%)—while normally being the least popular choice—
followed by transparent (23%) and elite (7%). Even if elite proxies
provide a higher anonymity level than anonymous ones, they are
less likely used to access geo-blocked content. This may be due
to their name that does not clearly highlight strong anonymity to
non-expert users, dierently from anonymous proxies.
Finally, we investigate which type of content is downloaded
when using free web proxies. Our analysis relies on the little infor-
mation Ciao collects to preserve its users privacy, i.e., download size
and duration. Figure 11 shows a scatterplot of the size of each down-
load (bytes) as a function of its duration (seconds). 95% of downloads
are short (
<
1 minute) and contain, on average, 500 KBytes. Even
though 500 KBytes is less than the size of an average webpage
(2.9 MBytes according to httparchive [
9
]) these downloads relate
to regular web browsing. The smaller download size we observe
is due to: 1) httparchive derives its statistics from crawling Alexa’s
top webpages while our workload is driven by real users that may
visit a dierent set of websites, 2) our download size estimation is a
lower bound on the actual webpage size as Ciao is oblivious to data
retrieved from the browser’s cache. Figure 11 also shows a non-
negligible amount of downloads lasting several minutes (0.1%) and
containing few 100 MBytes, as well as two very long downloads (up
to couple of hours) containing few GBytes. These large downloads
could be due to software or video downloads, live streaming, etc.
Figure 11: Scatterplot of download size and duration.
We speculate the latter since no additional browsing activity was
observed during these long sessions, i.e., the user did not perform
any other download suggesting that she could be watching the
content being retrieved.
Takeaway:
Ciao has proven to be a valuable tool to shed some lights
on how free proxies are used. By analyzing 2 TBytes of trac generated
by 1,500 users over 7 months, we identify web browsing as the most
prominent user activity. Overall, geo-blocking avoidance is not a
prominent use-case for free web proxies, with exception of countries
hosting a lot of geo-blocked content like the US.
5 CONCLUSION
Fueled by an increasing need of anonymity and censorship circum-
vention, the (free) web proxy ecosystem has been growing wild in
the last decade. Such ecosystem consists, potentially, of millions of
hosts, whose reachability and performance information are scat-
tered across multiple forums and websites. Studying this ecosystem
is hard because of its large scale, and because it involves two players
out of reach: free proxies and their users. The key contributions of
this work are ProxyTorrent, a distributed measurement platform
for the free proxy ecosystem, and an analysis of 10 months of data
spanning up to 180,000 free proxies and 1,500 users. ProxyTorrent
leverages a funnel-based testing methodology to actively monitor
hundreds of thousand free proxies every day. Further, it leverages
free proxies users to understand how proxies perform and how they
are used in the wild. The latter is achieved via a Chrome plugin we
developed which simplies the (hard) task of nding a working and
safe free proxy in exchange of anonymous proxy usage statistics.
Our analysis shows that the free proxy ecosystem consists of a very
small and volatile core, less than 2% of all announced proxies with
a lifetime of few days. Only half of the proxies in this core have
good enough performance to be used. However, users should be
aware that about 10% of the best working proxies are “free for a
reason”: ads injection and TLS interception are two examples of
malicious behaviors we observed from such proxies. Finally, the
analysis of more than 2 Terabytes of proxied trac shows that free
proxies are mostly used for web browsing and that geo-blocking
avoidance is not a prominent use-case.
REFERENCES
[1]
Devdatta Akhawe and Adrienne Porter Felt. 2013. Alice in Warningland: A
Large-Scale Field Study of Browser Security Warning Eectiveness. In USENIX
Security Symposium. 257–272.
[2]
Taejoong Chung, David R. Chones, and Alan Mislove. 2016. Tunneling for
Transparency: A Large-Scale Analysis of End-to-End Violations in the Internet.
In ACM Internet Measurement Conference,IMC. 199–213.
[3]
CIAO. 2017. Automated free proxies discovery/usage. (2017). https://goo.gl/
NgJmLE.
[4]
CURL. 2017. Command line tool and library for transferring data with URLs.
(2017). https://curl.haxx.se/.
[5]
Zakir Durumeric, Eric Wustrow, and J. Alex Halderman. 2013. ZMap: Fast
Internet-Wide Scanning and its Security Applications. In USENIX Security Sym-
posium. 605–620.
[6]
Brendan J Frey and Delbert Dueck. 2007. Clustering by passing messages between
data points. science 315, 5814 (2007), 972–976.
[7]
Haschek Solutions. 2017. ProxyChecker. (2017). https://github.com/chrisiaut/
proxycheck_script.
[8]
Hola. 2017. Free VPN, Secure Browsing, Unrestricted Access. (2017). http:
//hola.org/.
[9]
HTTP Archive. [n. d.]. The HTTP Archive tracks how the Web is built. ([n. d.]).
http://httparchive.org/.
[10]
Muhammad Ikram, Narseo Vallina-Rodriguez, Suranga Seneviratne, Mohamed Ali
Kâafar, and Vern Paxson. 2016. An Analysis of the Privacy and Security Risks of
Android VPN Permission-enabled Apps. In ACM Internet Measurement Confer-
ence,IMC. 349–364.
[11]
Christian Kreibich, Nicholas Weaver, Boris Nechaev, and Vern Paxson. 2010.
Netalyzr: illuminating the edge network. In ACM nternet Measurement Conference,
IMC. 246–259.
[12]
letsencrypt. 2017. A free, automated, and open Certicate Authority. (2017).
https://letsencrypt.org/.
[13]
MAXMIND. 2017. IP Geolocation and Online Fraud Prevention. (2017). https:
//www.maxmind.com/.
[14]
NGINX. 2017. A free, open-source, high-performance HTTP server. (2017).
https://nginx.org/.
[15]
Vasile Claudiu Perta, Marco Valerio Barbera, Gareth Tyson, Hamed Haddadi, and
Alessandro Mei. 2015. A Glance through the VPN Looking Glass: IPv6 Leakage
and DNS Hijacking in Commercial VPN clients. PoPETs 2015, 1 (2015), 77–91.
[16] PhantomJS. 2017. Headless Browser. (2017). http://phantomjs.org/.
[17]
PLANETLAB. 2017. An open platform for developing, deploying, and accessing
planetary-scale services. (2017). https://www.planet-lab.org/.
[18]
ProxyTorrent team. 2017. Ciao code. (2017). https://github.com/ciao-dev/CIAO.
[19]
Charles Reis, Steven D. Gribble, Tadayoshi Kohno, and Nicholas C. Weaver. 2008.
Detecting In-Flight Page Changes with Web Tripwires. In USENIX Symposium on
Networked Systems Design & Implementation, NSDI. 31–44.
[20]
Will Scott, Ravi Bhoraskar, , and Arvind Krishnamurthy. 2015. Understanding
Open Proxies in the Wild. In Chaos Communication Camp.
[21]
Georgios Tsirantonakis, Panagiotis Ilia, Sotiris Ioannidis, Elias Athanasopoulos,
and Michalis Polychronakis. 2018. A Large-scale Analysis of Content Modication
by Open HTTP Proxies. Network and Distributed System Security Symposium
(NDSS). (2018).
[22]
Gareth Tyson, Shan Huang, Félix Cuadrado, Ignacio Castro, Vasile Claudiu Perta,
Arjuna Sathiaseelan, and Steve Uhlig. 2017. Exploring HTTP Header Manip-
ulation In-The-Wild. In International Conference on World Wide Web, WWW.
451–458.
[23]
Matteo Varvello, Jeremy Blackburn, David Naylor, and Konstantina Papagiannaki.
2016. EYEORG: A Platform For Crowdsourcing Web Quality Of Experience
Measurements. In CONEXT.
[24]
Nicholas Weaver, Christian Kreibich, Martin Dam, and Vern Paxson. 2014. Here
Be Web Proxies. In Passive and Active Measurement, PAM. 183–192.
... Since they are publicly advertised, free proxies will often be under heavy utilization as many users will try to access them at the same time, and might be used for both benign and malicious traffic. Specialized websites have emerged over the years, providing extensive lists of free proxies, and while the ecosystem is growing, only a small fraction of the announced proxies actually works, as reported by Perino et al. with their ProxyTorrent study [20]. On the 180K proxies, they tested over a 10-months period, less than 2% were indeed proxying traffic and about 10% of them exhibited malicious behaviors by injecting ads or intercepting TLS connections. ...
... As some proxies tend to be slow and resolve after a long time, we set a timeout of 60 seconds for each test. Our timeout value aligns with the value chosen in related works [20], [26], which use timeouts ranging from 45 seconds to 180 seconds. ...
... Similarly to Perino et al. [20], we define the uptime as the number of days a proxy was active within its lifetime, and the lifetime as the number of days between the first and last time a proxy has been seen active. We present in Figure 5 the CDFs of the lifetimes and uptimes of the proxies over 30 months. ...
... Those enormous numbers of proxies have formed a giant and complicated ecosystem. Researchers have conducted studies to explore and characterize the open proxies in various aspects, such as performance, behaviors, security, and distributions [112,107,82,97,50]. However, the ownership of those malicious proxies and corresponding campaigns have not been well studied before. ...
... In recent years, researchers have conducted studies to explore and characterize the open proxies in various aspects, such as performance, behaviors, security, and distributions [112,107,82,97,50]. They analyzed how the proxies can modify or manipulate the requested resources, such as HTML contents, image files, and executable files. ...
Thesis
Full-text available
The Internet has become a central part of our daily lives. In the meantime, the Internet is a very complex system and it is challenging to understand the nature of the Internet ecosystem from different perspectives. To extend our knowledge of the Global Internet and better understand the nature of the Internet, we design unique active and passive measurements to study several crucial components of the Internet, including anycast in global routing, open proxy ecosystem, and transparent proxy systems. Anycast has been widely adopted by today’s Internet services, including DNS, CDN, and DDoS protection. Prior research has focused on various aspects of anycast, either its usage in particular services such as DNS or characterizing its adoption by Internet-wide active probing methods. We first explore an alternative approach to characterize anycast based on previously collected global BGP routing information. Leveraging state-of-the-art active measurement results as near-ground-truth, our passive method without requiring any Internet-wide probes can achieve high accuracy in detecting anycast prefixes. While investigating the root causes of inaccuracy, we reveal that anycast routing has been entangled with the increased adoption of remote peering. The invisibility of remote peering from layer-3 breaks the assumption of the shortest AS paths on BGP and causes an unintended impact on anycast performance. Open proxies provide free relay services and are widely used to anonymously browse the Internet, avoid geographic restrictions, and circumvent censorship. To shed light on the ecosystem of open proxies and characterize the behaviors of open proxies, we conduct a large-scale, comprehensive study. We characterize open proxies based on active and passive measurements and examine their network and geographic distributions, performance, and deployment. In particular, to obtain a more in-depth and broader understanding of open proxies, we analyze two particular groups of open proxies—cloud-based proxies and long-term proxies. To process and analyze the enormous amount of responses, we design a lightweight method that classifies and labels the proxies based on DOM structure which defines the logical structure of Web documents. Furthermore, we parse the contents to extract information to identify the owners of proxies and track their activities for deploying malicious proxies. We reveal that some owners regularly change the proxy deployment to avoid being blocked and deploy more proxies to expand their malicious attacks. Transparent proxies are one type of web proxies that host between clients and servers. Transparent proxies intercept requests and responses between clients and web servers. In this work, we study an overlooked issue around web browsing, the hidden interception of the HTTP path by on-path devices, which is not yet thoroughly studied and well understood by previous works. We propose a novel method that utilizes designed requests to detect the interception to discover the hidden transparent proxies. We characterize various aspects of transparent proxies– geographically and AS level distribution, server hosting, software, and services. We investigate the vulnerabilities of transparent proxies and examine the impact on end-users.
... L4 vs. L7 Discrepancies. Several studies have noted significant discrepancies between L4 and L7 responsiveness [33,49,53,85,95,107]. Izhikevich et al. showed that TCP liveness does not reliably indicate the presence of an application-layer service because of pervasive middlebox deployment [56]. ...
Preprint
Since ZMap's debut in 2013, networking and security researchers have used the open-source scanner to write hundreds of research papers that study Internet behavior. In addition, ZMap powers much of the attack-surface management and security ratings industries, and more than a dozen security companies have built products on top of ZMap. Behind the scenes, much of ZMap's behavior - ranging from its pseudorandom IP generation to its packet construction - has quietly evolved as we have learned more about how to scan the Internet. In this work, we quantify ZMap's adoption over the ten years since its release, describe its modern behavior (and the measurements that motivated those changes), and offer lessons from releasing and maintaining ZMap.
... There are various reasons why a region ranks higher, and we have concluded some of the possibilities below. The region uses Portuguese as the official language (Portugal, Brazil, Angola, Mozambique, etc.), it has Portuguese language schools (China, Macao, Portugal, etc.), it is a tourism region (Hong Kong, United States, United Kingdom, Taiwan, etc.), and it provides many proxy servers or VPNs to use (United States, France, etc.) [55]. ...
Article
Full-text available
This study explores the “Diz lá!” mobile application, an innovative tool released in 2018, enabling users, especially Chinese speakers, to learn Portuguese. This mobile application harnesses the principles of Mobile-Assisted Language Learning (MALL) and Self-Determination Theory (SDT), facilitating continued language education amid the COVID-19 pandemic. Our research exploits user habits reflected in more than five years of data to build multi-dimensional models for visualizing large datasets and focuses on learning patterns related to verb conjugations. Insights reveal that most users are language school students from Macao, China, and Portuguese-speaking counties/regions, with a remarkable preference for learning Portuguese verb conjugations. The research results also show that people like to learn a group of verbs for comparison. Notably, despite the pandemic, an upward trend was observed in the learning of these conjugations. Our findings offer crucial implications for the design of pedagogical strategies and refinement of language learning (MALL) apps, highlighting areas of difficulty and learner preferences. We also used machine learning (ML) technologies to create a predictive model to recommend relevant learning materials to users. As a result, this study stands at the intersection of technology-enhanced language learning and educational research, demonstrating how they can synergistically contribute to optimizing language learning outcomes.
... Free web proxies are increasingly being utilized for anonymity and censorship circumvention at zero-cost to their users. Perino et al. (2018) provided an in-depth examination of this complex free-proxy ecosystem. First, the authors performed active measurements to detect free proxies, assess their performance, and look for potential malicious activities. ...
Article
Full-text available
As the Internet has transformed into a critical infrastructure, society has become more vulnerable to its security flaws. Despite substantial efforts to address many of these vulnerabilities by industry, government, and academia, cyber security attacks continue to increase in intensity, diversity, and impact. Thus, it becomes intuitive to investigate the current cyber security threats, assess the extent to which corresponding defenses have been deployed, and evaluate the effectiveness of risk mitigation efforts. Addressing these issues in a sound manner requires large-scale empirical data to be collected and analyzed via numerous Internet measurement techniques. Although such measurements can generate comprehensive and reliable insights, doing so encompasses complex procedures involving the development of novel methodologies to ensure accuracy and completeness. Therefore, a systematic examination of recently developed Internet measurement approaches for cyber security must be conducted to enable thorough studies that employ several vantage points, correlate multiple data sources, and potentially leverage past successful techniques for more recent issues. Unfortunately, performing such an examination is challenging, as the literature is highly scattered. In large part, this is due to each research effort only focusing on a small portion of the many constituent parts of the Internet measurement domain. Moreover, to the best of our knowledge, no studies have offered an in-depth examination of this critical research domain in order to promote future advancements. To bridge these gaps, we explore all pertinent facets of utilizing Internet measurement techniques for cyber security, ranging from threats within specific application domains to threats themselves. We provide a taxonomy of cyber security-related Internet measurement studies across two dimensions. One dimension relates to the many vertical layers (and components) of the Internet ecosystem, while the other relates to internal normal functions vs. the negative impact of external parties in the Internet and physical world. A comprehensive comparison of the gathered studies is also offered in terms of measurement technique, scope, measurement size, vantage size, and the analysis approach that was leveraged. Finally, a discussion of the roadblocks to performing effective Internet measurements and possible future research directions is elaborated.
Preprint
Full-text available
p>Reliance on commercial virtual private networks has increased manifold in the recent past. This paradigm shift of global internet users is primarily driven by their need of online privacy, censorship circumvention, and accessing geo-filtered content among other motivations. Variety of VPN services underscored by their opaque nature captures huge user base which is technically unsophisticated, and has very limited means to verify claims of a given VPN service. Capitalizing on extensive literature and online review, we have selected a sample of 60 x VPNs (from categories of free as well as paid services) which adequately represents the commercial VPN market from cost-quality perspective. An elaborate eco-system analysis reveals significant variations against the claims of service providers. Our crystallized infrastructure testing methodology executed over selected VPNs reveals: (1) Anonymity index of free versus paid VPNs differs widely at all levels i.e., IP, subnet, country and ISP; (2) 54% of selected VPN’s server locations in attractive geographic locations are virtual instead of physical; (3) 4/10 VPN tested services have been found to share their vantage points at subnet level; and (4) Selective co-location of 22% vantage points have also been observed. </p
Preprint
Full-text available
p>Reliance on commercial virtual private networks has increased manifold in the recent past. This paradigm shift of global internet users is primarily driven by their need of online privacy, censorship circumvention, and accessing geo-filtered content among other motivations. Variety of VPN services underscored by their opaque nature captures huge user base which is technically unsophisticated, and has very limited means to verify claims of a given VPN service. Capitalizing on extensive literature and online review, we have selected a sample of 60 x VPNs (from categories of free as well as paid services) which adequately represents the commercial VPN market from cost-quality perspective. An elaborate eco-system analysis reveals significant variations against the claims of service providers. Our crystallized infrastructure testing methodology executed over selected VPNs reveals: (1) Anonymity index of free versus paid VPNs differs widely at all levels i.e., IP, subnet, country and ISP; (2) 54% of selected VPN’s server locations in attractive geographic locations are virtual instead of physical; (3) 4/10 VPN tested services have been found to share their vantage points at subnet level; and (4) Selective co-location of 22% vantage points have also been observed. </p
Preprint
Full-text available
We carry out the first in-depth characterization of residential proxies (RESIPs) in China, for which little is studied in previous works. Our study is made possible through a semantic-based classifier to automatically capture RESIP services. In addition to the classifier, new techniques have also been identified to capture RESIPs without interacting with and relaying traffic through RESIP services, which can significantly lower the cost and thus allow a continuous monitoring of RESIPs. Our RESIP service classifier has achieved a good performance with a recall of 99.7% and a precision of 97.6% in 10-fold cross validation. Applying the classifier has identified 399 RESIP services, a much larger set compared to 38 RESIP services collected in all previous works. Our effort of RESIP capturing lead to a collection of 9,077,278 RESIP IPs (51.36% are located in China), 96.70% of which are not covered in publicly available RESIP datasets. An extensive measurement on RESIPs and their services has uncovered a set of interesting findings as well as several security implications. Especially, 80.05% RESIP IPs located in China have sourced at least one malicious traffic flows during 2021, resulting in 52-million malicious traffic flows in total. And RESIPs have also been observed in corporation networks of 559 sensitive organizations including government agencies, education institutions and enterprises. Also, 3,232,698 China RESIP IPs have opened at least one TCP/UDP ports for accepting relaying requests, which incurs non-negligible security risks to the local network of RESIPs. Besides, 91% China RESIP IPs are of a lifetime less than 10 days while most China RESIP services show up a crest-trough pattern in terms of the daily active RESIPs across time.
Conference Paper
Full-text available
Headers are a critical part of HTTP, and it has been shown that they are increasingly subject to middlebox manipulation. Although this is well known, little is understood about the general regional and network trends that underpin these manipulations. In this paper, we collect data on thousands of networks to understand how they intercept HTTP headers in-the-wild. Our analysis reveals that 25% of measured ASes modify HTTP headers. Beyond this, we witness distinct trends among different regions and AS types; e.g., we observe high numbers of cache headers in poorly connected regions. Finally, we perform an in-depth analysis of the types of manipulations and how they differ across regions.
Conference Paper
Full-text available
Millions of users worldwide resort to mobile VPN clients to either circumvent censorship or to access geo-blocked content, and more generally for privacy and security purposes. In practice, however, users have little if any guarantees about the corresponding security and privacy settings, and perhaps no practical knowledge about the entities accessing their mobile traffic. In this paper we provide a first comprehensive analysis of 283 Android apps that use the Android VPN permission, which we extracted from a corpus of more than 1.4 million apps on the Google Play store. We perform a number of passive and active measurements designed to investigate a wide range of security and privacy features and to study the behavior of each VPN-based app. Our analysis includes investigation of possible malware presence, third-party library embedding, and traffic manipulation, as well as gauging user perception of the security and privacy of such apps. Our experiments reveal several instances of VPN apps that expose users to serious privacy and security vulnerabilities, such as use of insecure VPN tunneling protocols, as well as IPv6 and DNS traffic leakage. We also report on a number of apps actively performing TLS interception. Of particular concern are instances of apps that inject JavaScript programs for tracking, advertising, and for redirecting e-commerce traffic to external partners.
Conference Paper
Full-text available
Commercial Virtual Private Network (VPN) services have become a popular and convenient technology for users seeking privacy and anonymity. They have been applied to a wide range of use cases, with commercial providers often making bold claims regarding their ability to fulfil each of these needs, e.g., censorship circumvention, anonymity and protection from monitoring and tracking. However, as of yet, the claims made by these providers have not received a sufficiently detailed scrutiny. This paper thus investigates the claims of privacy and anonymity in commercial VPN services. We analyse 14 of the most popular ones, inspecting their internals and their infrastructures. Despite being a known issue, our experimental study reveals that the majority of VPN services suffer from IPv6 traffic leakage. The work is extended by developing more sophisticated DNS hijacking attacks that allow all traffic to be transparently captured. We conclude discussing a range of best practices and countermeasures that can address these vulnerabilities.
Conference Paper
Full-text available
While web pages sent over HTTP have no integrity guarantees, it is commonly assumed that such pages are not modified in transit. In this paper, we provide evidence of surprisingly widespread and diverse changes made to web pages between the server and client. Over 1% of web clients in our study received altered pages, and we show that these changes often have undesirable consequences for web publishers or end users. Such changes include popup blocking scripts inserted by client software, advertisements injected by ISPs, and even malicious code likely inserted by malware using ARP poisoning. Additionally, we find that changes introduced by client software can inadvertently cause harm, such as introducing cross-site scripting vulnerabilities into most pages a client visits. To help publishers understand and react appropriately to such changes, we introduce web tripwires--client-side JavaScript code that can detect most in-flight modifications to a web page. We discuss several web tripwire designs intended to provide basic integrity checks for web servers. We show that they are more flexible and less expensive than switching to HTTPS and do not require changes to current browsers.
Conference Paper
Tremendous effort has gone into the ongoing battle to make webpages load faster. This effort has culminated in new protocols (QUIC, SPDY, and HTTP/2) as well as novel content delivery mechanisms. In addition, companies like Google and SpeedCurve investigated how to measure "page load time" (PLT) in a way that captures human perception. In this paper we present Eyeorg (https://eyeorg.net), a platform for crowdsourcing web quality of experience measurements. Eyeorg overcomes the scaling and automation challenges of recruiting users and collecting consistent user-perceived quality measurements. We validate Eyeorg's capabilities via a set of 100 trusted participants. Next, we showcase its functionalities via three measurement campaigns, each involving 1,000 paid participants, to 1) study the quality of several PLT metrics, 2) compare HTTP/1.1 and HTTP/2 performance, and 3) assess the impact of online advertisements and ad blockers on user experience. We find that commonly used, and even novel and sophisticated PLT metrics fail to represent actual human perception of PLT, that the performance gains from HTTP/2 are imperceivable in some circumstances, and that not all ad blockers are created equal.
Conference Paper
Detecting violations of application-level end-to-end connectivity on the Internet is of significant interest to researchers and end users; recent studies have revealed cases of HTTP ad injection and HTTPS man-in-the-middle attacks. Unfortunately, detecting such end-to-end violations at scale remains difficult, as it generally requires having the cooperation of many nodes spread across the globe. Most successful approaches have relied either on dedicated hardware, user-installed software, or privileged access to a popular web site. In this paper, we present an alternate approach for detecting end-to-end violations based on Luminati, a HTTP/S proxy service that routes traffic through millions of end hosts. We develop measurement techniques that allow Luminati to be used to detect end-to-end violations of DNS, HTTP, and HTTPS, and, in many cases, enable us to identify the culprit. We present results from over 1.2m nodes across 14k ASes in 172 countries, finding that up to 4.8% of nodes are subject to some type of end-to-end connectivity violation. Finally, we are able to use Luminati to identify and measure the incidence of content monitoring, where end-host software or ISP middleboxes record users' HTTP requests and later re-download the content to third-party servers.
Conference Paper
HTTP proxies serve numerous roles, from performance enhancement to access control to network censorship, but often operate stealthily without explicitly indicating their presence to the communicating endpoints. In this paper we present an analysis of the evidence of proxying manifest in executions of the ICSI Netalyzr spanning 646,000 distinct IP addresses (“clients”). To identify proxies we employ a range of detectors at the transport and application layer, and report in detail on the extent to which they allow us to fingerprint and map proxies to their likely intended uses. We also analyze 17,000 clients that include a novel proxy location technique based on traceroutes of the responses to TCP connection establishment requests, which provides additional clues regarding the purpose of the identified web proxies. Overall, we see 14% of Netalyzr-analyzed clients with results that suggest the presence of web proxies.
Conference Paper
We empirically assess whether browser security warnings are as ineffective as suggested by popular opinion and previous literature. We used Mozilla Firefox and Google Chrome's in-browser telemetry to observe over 25 million warning impressions in situ. During our field study, users continued through a tenth of Mozilla Firefox's malware and phishing warnings, a quarter of Google Chrome's malware and phishing warnings, and a third of Mozilla Firefox's SSL warnings. This demonstrates that security warnings can be effective in practice; security experts and system architects should not dismiss the goal of communicating security information to end users. We also find that user behavior varies across warnings. In contrast to the other warnings, users continued through 70.2% of Google Chrome's SSL warnings. This indicates that the user experience of a warning can have a significant impact on user behavior. Based on our findings, we make recommendations for warning designers and researchers.
Conference Paper
Internet-wide network scanning has numerous security applications, including exposing new vulnerabilities and tracking the adoption of defensive mechanisms, but probing the entire public address space with existing tools is both difficult and slow. We introduce ZMap, a modular, open-source network scanner specifically architected to perform Internet-wide scans and capable of surveying the entire IPv4 address space in under 45 minutes from user space on a single machine, approaching the theoretical maximum speed of gigabit Ethernet. We present the scanner architecture, experimentally characterize its performance and accuracy, and explore the security implications of high speed Internet-scale network surveys, both offensive and defensive. We also discuss best practices for good Internet citizenship when performing Internet-wide surveys, informed by our own experiences conducting a long-term research survey over the past year.
Conference Paper
In this paper we present Netalyzr, a network measurement and debugging service that evaluates the functionality provided by people's Internet connectivity. The design aims to prove both comprehensive in terms of the properties we measure and easy to employ and understand for users with little technical background. We structure Netalyzr as a signed Java applet (which users access via their Web browser) that communicates with a suite of measurement-specific servers. Traffic between the two then probes for a diverse set of network properties, including outbound port filtering, hidden in-network HTTP caches, DNS manipulations, NAT behavior, path MTU issues, IPv6 support, and access-modem buffer capacity. In addition to reporting results to the user, Netalyzr also forms the foundation for an extensive measurement of edge-network properties. To this end, along with describing Netalyzr 's architecture and system implementation, we present a detailed study of 130,000 measurement sessions that the service has recorded since we made it publicly available in June 2009.