PreprintPDF Available

Beyond the Front Page: Measuring Third Party Dynamics in the Field

Authors:
  • Westphalian University of Applied Sciences
  • Stiftung Neue Verantwortung
  • Westphalia University of Applied Sciences, Gelsenkirchen
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In the modern Web, service providers often rely heavily on third parties to run their services. For example, they make use of ad networks to finance their services, externally hosted libraries to develop features quickly, and analytics providers to gain insights into visitor behavior. For security and privacy, website owners need to be aware of the content they provide their users. However, in reality, they often do not know which third parties are embedded, for example, when these third parties request additional content as it is common in real-time ad auctions. In this paper, we present a large-scale measurement study to analyze the magnitude of these new challenges. To better reflect the connectedness of third parties, we measured their relations in a model we call third party trees, which reflects an approximation of the loading dependencies of all third parties embedded into a given website. Using this concept, we show that including a single third party can lead to subsequent requests from up to eight additional services. Furthermore, our findings indicate that the third parties embedded on a page load are not always deterministic, as 50% of the branches in the third party trees change between repeated visits. In addition, we found that 93% of the analyzed websites embedded third parties that are located in regions that might not be in line with the current legal framework. Our study also replicates previous work that mostly focused on landing pages of websites. We show that this method is only able to measure a lower bound as subsites show a significant increase of privacy-invasive techniques. For example, our results show an increase of used cookies by about 36% when crawling websites more deeply.
Content may be subject to copyright.
Beyond the Front Page:
Measuring Third Party Dynamics in the Field
Tobias Urban
urban@internet-sicherheit.de
Institute for Internet Security
Ruhr University Bochum
Martin Degeling
martin.degeling@ruhr-uni-bochum.de
Ruhr University Bochum
Thorsten Holz
thorsten.holz@ruhr-uni-bochum.de
Ruhr University Bochum
Norbert Pohlmann
pohlmann@internet-sicherheit.de
Institute for Internet Security
ABSTRACT
In the modern Web, service providers often rely heavily on third
parties to run their services. For example, they make use of ad
networks to nance their services, externally hosted libraries to
develop features quickly, and analytics providers to gain insights
into visitor behavior.
For security and privacy, website owners need to be aware of
the content they provide their users. However, in reality, they often
do not know which third parties are embedded, for example, when
these third parties request additional content as it is common in
real-time ad auctions.
In this paper, we present a large-scale measurement study to
analyze the magnitude of these new challenges. To better reect
the connectedness of third parties, we measured their relations in a
model we call third party trees, which reects an approximation of
the loading dependencies of all third parties embedded into a given
website. Using this concept, we show that including a single third
party can lead to subsequent requests from up to eight additional
services. Furthermore, our ndings indicate that the third parties
embedded on a page load are not always deterministic, as 50 %
of the branches in the third party trees change between repeated
visits. In addition, we found that 93 % of the analyzed websites
embedded third parties that are located in regions that might not be
in line with the current legal framework. Our study also replicates
previous work that mostly focused on landing pages of websites.
We show that this method is only able to measure a lower bound as
subsites show a signicant increase of privacy-invasive techniques.
For example, our results show an increase of used cookies by about
36 % when crawling websites more deeply.
KEYWORDS
third parties, cookies, privacy, web measurement
ACM Reference Format:
Tobias Urban, Martin Degeling, Thorsten Holz, and Norbert Pohlmann.
2020. Beyond the Front Page: Measuring Third Party Dynamics in the Field.
In Proceedings of The Web Conference 2020 (WWW ’20), April 20–24, 2020,
This paper is published under the Creative Commons Attribution 4.0 International
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their
personal and corporate Web sites with the appropriate attribution.
WWW ’20, April 20–24, 2020, Taipei, Taiwan
©
2020 IW3C2 (International World Wide Web Conference Committee), published
under Creative Commons CC-BY 4.0 License.
ACM ISBN 978-1-4503-7023-3/20/04.
https://doi.org/10.1145/3366423.3380203
Taipei, Taiwan. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/
3366423.3380203
1 INTRODUCTION
A majority of today’s online services are a combination of original
content and—to a non-negligible extent—third party resources [
45
].
Most notably, online advertising is embedded using external re-
sources that display ads to nance these services and to provide
them to users free of charge. Other third parties are included for
various means, e. g., libraries are used to develop services quickly, to
decrease loading times, and for analytical purposes. Consequently,
this leads to a highly dynamic Web with complicated dependencies
among all participants. This trend comes with the drawback that
some service providers might not be aware of which third parties
are delivered to customers in their name when users interact with
their website. Ultimately, third parties can pose risks to users, which
is obviously unintended by the service provider. For example, third
parties can create security problems (e. g., malvertising [
28
,
43
,
44
]),
might have negative privacy implications (e. g., trackers [
1
,
9
,
10
]),
or they can include content that might impact users in other neg-
ative ways (e. g., crypto miners [
26
,
41
]). Services themselves re-
inforce these dynamics as they make use of dierent sets of third
parties in dierent sections and webpages. For example, news web-
sites often insert scripts to connect with social media below articles,
but not on the actual landing page. This raises the question of
whether This raises the question of whether previous studies that
exclusively measured the landing pages (e. g., [
6
,
9
,
21
,
34
,
45
,
50
])
captured a complete and comprehensive view of the analyzed phe-
nomenon.
We perform a measurement study on 10,000 websites on the Web
and analyze relations between third parties. We use the notion of
third party trees (TPT) as a metric for loading dependencies of all
third parties embedded into a given website. More specically, a
TPT contains information on all third parties (TP) observed when
visiting a given website and accounts for the loading sequence of
each TP. Consider the following example: adidas.com embeds a
script which loads content from Adobe (3rd party). The script again
loads a script from Tealium (4th party), which also loads a script
from Akamai (5th party). As a result, a TPT captures the hierarchical
structures of third parties on a given website and enables us to
study the typical characteristics and dynamic nature of the modern
Web. Furthermore, we show that embedding a single TP might
result in embedding a non-deterministic amount of additional TPs,
WWW ’20, April 20–24, 2020, Taipei, Taiwan Tobias Urban, Martin Degeling, Thorsten Holz, and Norbert Pohlmann
which might pose privacy or security risks. Previous work in this
area has analyzed implications of the presence of multiple third
parties on websites. Recently, Ikram etal. [
21
] raised awareness for
the problem of implicit trust created by decency chains in website
embeddings. Earlier work focused on the extent of tracking (e. g., [
6
,
10
,
45
]), or on the used mechanisms (e. g., [
1
,
9
,
29
]), and again other
works on defense mechanisms (e. g., [
36
,
39
]), or the eectiveness
of such (e. g., [
14
,
32
,
34
]). In this work, we want to asses in more
detail by whom third parties are embedded into websites and study
the extent of control service providers have on the embedded third
parties. Most importantly, we show that previous studies did not
measure the extent of the phenomenon extensively enough and
only measured a (not necessarily generalizable) lower bound of
included TP content. Our results show a signicant increase in used
cookies (36 %) and tracking techniques (6%) on subsites.
In summary, we make the following key contributions:
(1)
We introduce the concept of third party trees (TPTs) that
reects all third parties and dependencies when loading a
website. Utilizing TPTs, we show that some TPs load several
further partners and that those are not always deterministic
and possibly in conict with current legislation.
(2)
We show that only measuring the trac generated by land-
ing pages of a website or only a few subsites leads to the risk
of only capturing a (potentially limited) subset of the loaded
third parties. This implies that the obtained results might be
biased and not generalizable. For example, our study indi-
cates that subsites use substantial more cookies (over 45 %)
than the site’s landing pages.
(3)
Using our data, we try to replicate previous work to test if
they only measured an incomplete view of their studied phe-
nomenon and show that most privacy-invasive technologies
occur more often on subsites.
2 BACKGROUND
Before introducing our approach, we briey describe third party
usage and outline the privacy implications of those.
2.1 Third Party Usage
Web services make use of resources hosted by third parties for
various means. Everyday use cases for third-party usage are li-
braries used for web development, the integration of social media
content (e. g., Facebook Like button), to display ads on websites,
or to increase the service’s performance (e. g., using cached fonts).
Often these third parties are embedded by adding JavaScript code
or an iframe element into the website. After injection, these objects
perform the desired tasks independently and might even load fur-
ther resources. For example, an embedded ad might load additional
third-party code that is designed to counter ad fraud, to measure
the eectiveness of the ad, or to load additional fonts used by the
ad. As a result, embedding a single third party can lead to a long
tail of additionally embedded partners.
2.2 Online Tracking
Tracking users online is a widespread phenomenon on the Web [
9
].
It is used to re-identify users navigating the Web and a crucial part
of the modern online advertisement ecosystem as it allows them to
provide targeted ads. Techniques to track users can be divided into
stateless and stateful approaches. Stateless approaches use specic
attributes of the users’ device to identify it [
1
,
9
,
13
,
37
,
53
] (often
called “device ngerprinting”). In contrast, stateful approaches use
the machine’s state to identify users. Typically an ID is assigned
to each user and is stored in a cookie on the users’ device. The
upside of stateless approaches is that they cannot be prevented by
deleting third-party cookies. However, they are more error-prone
as device-specic attributes tend to change over time [16, 52].
3 RELATED WORK
Previous work analyzed tracking mechanisms and the eects of
privacy legislation through measurement studies.
Privacy & Tracking Measurements. Englehardt et al. introduce
OpenWPM and use it to crawl the top 1 million websites and analyze
their tracking capabilities [
9
]. They nd that many websites use
highly sophisticated ngerprinting methods (e. g., based on image
rendering) and that most companies participate in cookie syncing.
Degeling et al. analyze dierent cookie banner notications and
eects of the GDPR on privacy policies [
7
]. They nd that more
than half of websites provide a cookie consent notice, but only very
few oer users a real choice regarding cookie usage. The eects of
the GDPR have been studied extensively in the past. For example,
Utz et al. [
51
] analyzed implementations of cookie consent banners,
Urban et al. [
48
,
49
] analyzed usability of the GDPR right to ac-
cess and the eect of the GDPR on cookie syncing activities [
50
].
Dabrowski et al. test if the GDPR has an impact on cookie settings
when users access the same websites from dierent countries [
6
].
They nd that websites (around 50 %) do not set cookies when a
user from the EU visits the website while they set a cookie when
the user visits from a non-EU country. Most recently, Sørensen
et al. analyzed the eect of the GDPR regarding third parties em-
bedded into websites [
45
]. The authors measure several prominent
websites and test whether the GDPR aects their third party usage.
They conclude that the overall usage of cookies declined but that
the GDPR was not necessarily the driver for that change.
Third Party Inclusion. Closely related to our approach is the work
of Kumar et al. [
28
] and Ikram et al. [
21
]. Both works use a concept
of the implicit trust of the embedded third and further parties.
Kumar et al. show that websites heavily rely on third parties, that
almost one-third of websites embed a third party that loads further
parties, and that these dependencies are a problem if one wants to
serve a website fully via HTTPs. Ikramet al. also show that many
websites (approx. 40 %) implicitly trust parties loaded by directly
embedded third parties and see an increase in embedded malicious
or at least suspicious site or script les in these chains.
Our work diers from previous work, as most tried to measure
eects on a horizontal scale (i. e., visiting a lot of distinct domains)
while we instead analyze websites on a vertical scale (i. e., , we
visit several subsites of the same domain). Furthermore, we focus
on privacy-invasive technologies and the determinism of third
party dependencies. By this vertical approach and dependency
identication, we can (1) analyze if subsites show dierent behavior
compared to landing pages, (2) study eects of embedding dierent
Measuring Third Party Dynamics in the Field WWW ’20, April 20–24, 2020, Taipei, Taiwan
third parties to websites, and (3) understand who is responsible for
embedding specic third parties.
4 MEASUREMENT APPROACH
In this work, we conduct a large-scale measurement study of the
dynamics of the Web on application level (i. e., the browser) to gain
insights into the usage of third parties and to illuminate reasons for
how they are embedded into websites. In this section, we describe
our approach and highlight how we estimate the relations between
specic third parties that are embedded into websites. Our study
consists of a multi-stage process in which we (1) build a corpus of
websites to visit, (2) use OpenWPM [
9
] to crawl these websites and
gather rst-party links on these websites, and nally (3) visit the
crawled links and log all HTTP trac, cookie usage, the embedded
iframes, and JavaScript calls of interest.
4.1 Terminology
Before describing our approach, we dene two terms we use through-
out this work. By TLD+1 we mean the last part of the hostname
following the last dot in it. For example, the URL https://tools.ietf.org
has
TLD=org
,
hostname=tools.ietf
, and
TLD+1=ietf
. In most
cases, TLD+1 is a “second-level domain”. However, some domain
name registries use a second-level hierarchy. For example, New
Zealand uses various second level domains for dierent purposes:
.co.nz
for organisations or
.school.nz
for schools. We identied
the TLDs using Python’s tldextract [
40
] package, which accurately
splits generic or country code top-level domains (ccTLD). Further-
more, we distinguish between landing pages and subsites. A website
is a subsite (SB) of a landing page (LP) if both share the same TLD+1
but have distinct URLs. Hence, rst-party links on landing pages,
the page that is usually visited rst, lead to subsites. We chose to
use the term SB rather than “webpage” to explicitly highlight the
hierarchical relation between SBs and LPs.
4.2 Website Corpus
In our analysis, we use the top 1M Tranco list et al. [
30
], which
is an aggregation of four other domain top lists. We used the list
generated on 03/26/2019 (ID: W9L9). First, we removed all websites
with the same TLD+1 and only kept the one with the higher rank.
We did so because we wanted to remove URLs of services that
oer users the (almost) same functionality. For example if the list
contains google.com (rank 1) and google.co.uk (rank 4) we would
drop google.co.uk because both domains share the same TLD+1. In
total, we removed 607 websites in this step. From the remaining
domains, we used the top 10,000 domains and grouped them by
the category of their content and also sort them into four dierent
buckets based on their ranking.
We used the McAfee SmartFilter Internet Database service to re-
trieve a list of content categories for the websites [
33
]. We cluster
the websites by categories because we want to check if the cate-
gory of a website has an impact on the usage of cookies and other
privacy-invasive technologies. Previous work has shown that, for
example, News websites utilize more third parties (e.g., ad services)
than other categories [
45
]. In total, 85 dierent categories are as-
signed to the websites of the dataset. An overview of the 15 most
prominent categories is given in Figure 1. In the remainder, we
Figure 1: Overview of prevalent website topics in our dataset.
limit the analyzed categories to the top eight categories and com-
bine all remaining categories in “Other”. Additionally, we group
the websites by the following buckets based on the website’s rank
in the used list: (1) 1
rank
100, (2) 100
<
rank
1
,
000, (3)
1
,
000
<
rank
10
,
000, and (4) 10
,
000
<
rank
100
,
000. Due
to the removal of duplicate domains, bucket (4) holds these 607
domains, 6.1 % of all visited domains. We use the buckets to test
whether the popularity of websites has an impact on the usage of
specic technologies.
If not stated otherwise, we use the one-way analysis of variance
(one-way ANOVA) statistical model to nd dierences between the
analyzed groups. In all tests, we use a 95 % condence interval.
4.3 Measurement Framework
To measure the dynamic of websites, we utilize the OpenWPM plat-
form [
9
]. For each visit, we use the same user agent (
Mozilla/5.0
(X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
)
and desktop resolution (
1366x768
), allow all third party cookies,
do not set the “Do Not Track” HTTP header or other privacy-
preserving techniques (e. g., anti-tracking extensions), and use stan-
dard bot mitigation techniques to disguise our crawler (i. e., random
scrolling and mouse jiggling). Furthermore, the browser adopts
other properties from the operating system (
Ubuntu 18.04
). Aside
from our bot mitigation techniques, we do not interact with the
visited websites in any way, limitations of this approach are dis-
cussed in Section 6. While a website might detect our crawler, it is
not detected by current mechanisms seen in the wild, as presented
by Jonker et al. [24].
OpenWPM is congured to store all third party cookies set or
accessed via JavaScript and HTTP headers. To capture these events,
we instrumented specic JavaScript functions that access the local
storage or HTTP cookies, by adjusting the
.prototype
of the respec-
tive functions and applying a wrapper to them that logs each call
and access to these functions. Furthermore, we inspect all HTTP
headers if a cookie is accessed (
Cookie
) or set (
Set-Cookie
). For
our measurement study, we disabled Flash because, on the one hand,
the technology will be deprecated by 2020 [
2
] and on the other hand,
we did not nd a considerable usage (
<
0
.
01 %) of Flash cookies in
a pre-study we conducted (see Section 4.3.1). We passively log all
DNS responses to test if IP addresses are used, which are associated
WWW ’20, April 20–24, 2020, Taipei, Taiwan Tobias Urban, Martin Degeling, Thorsten Holz, and Norbert Pohlmann
with countries that do not automatically oer a GDPR adequate
privacy protection level. We dene all countries that are part of
the Privacy Shield [
47
] and countries part of the European Eco-
nomic Area (EEA) [
19
] to be adequate. We use MaxMind’s GeoIP
database [31] to create this association.
4.3.1 Pre-Study. As the Web is highly dynamic, any attempt to
measure it is quite challenging. To get a comprehensive view of
cookie and third party usage, we conducted a pre-study to get an
approximation of which measuring parameters to use (e. g., amount
of subsites to visit) while limiting the crawling time and generated
trac to a reasonable amount. In the following, we limit our pre-
study to TP cookies as prior work extensively analyzed those [
6
,
7
,
10
,
15
,
17
,
27
,
42
], and we want to test whether they might have
missed cookies due to their measurement setup. However, in our
primary analysis, we also analyze various tracking mechanisms
(see Section 5.2). To nd the optimal amount of subsites to visit, we
randomly selected 100 websites (TLD+1) from the top 1,000 websites
and visited 25, 50, 75, 100, 250, 500, and 1,000 subsites of these
websites. The websites were visited in a separate measurement but
using the same TLDs+1. We conducted these measurements using
a browser with a prole that already has some cookies present in
the local cookie store and once with a vanilla browser to see if
active cookies inuence cookie usage. We lled the local cookie
store by randomly visiting 100 websites from the top 1,000 websites
and used the resulting cookie store. In a separate measurement, we
visited the landing page of the selected websites 1,000 times and
recorded the used cookies to test if there is a dierence if users visit
the landing pages or subsites.
We compared the number of TP cookies set in each measurement
of the pre-study and found that subsites of websites typically set
signicantly more cookies than the respective landing page does.
In our measurement, the mean amount of cookies used increased
by approx. 20 (41 %), when visiting subsites rather than only the
landing page. This shows that if one wants to perform cookie/third
party measurements, one should always include subsites to the
measurement setup rather than only measuring landing pages. Fur-
thermore, we measured a mean increase of 12 cookies (27 %) per
website visit if a browser is used that already has cookies in the local
cookie storage. When it comes to the change of cookie usage based
on the number of visited subsites, we found that the mean amount
of accessed/set cookies stabilizes around 50 (SD: 100; median at 12)
after visiting 100 subsites (see Figure 2). In conclusion, to magnify
the number of cookies set, we use a browser prole that has cookies
set and visit 100 subsites and the landing page of each website.
4.3.2 Measurement Sequence. We used the same method to create
the browser prole for our experiment crawls that we utilized in
the pre-study. This prole is loaded before each website visit but is
not altered. Hence, each website visit uses the same prole and the
order of visited websites does not impact the results. In total, we
conduct the measurements from three dierent locations (Europe
(DE), North America (US), and Asia (JP)) to account for possible
geographical dierences [
6
]. For all measurements, we used two
computers located at a European university. For each of our regional
measurement runs, we created a new distinct browser prole. We
used a commercial VPN service (NordVPN ) to obtain an IP address
from the locations outside the EU. Using a VPN service comes
Figure 2: Mean number of cookies set in our pre-study with
the corresponding standard derivation
with the risk that it might inject content into the communication
stream [
25
]. However, we did not nd any hints of this practice for
the used service, neither in the Terms of Service nor publicly on
the Internet.
We congured OpenWPM to visit the landing page of each web-
site and to gather all rst-party hyperlinks on that site (subsites)
one day before the rst measurement. Therefore, some of these
links might not be present on the front page anymore at the time
we perform the measurements from dierent regions or might not
exit anymore after all. We did so to increase comparability between
our measurements since we visited the same landing pages and sub-
sites in each measurement. Additionally, we collect all rst-party
hyperlinks on the subsites but only use them (in random order)
if there are not enough subsites linked on the landing page. Af-
terward, we choose 100 random subsites that we used during the
experiment crawls. In each measurement, we visited 549,715 (SD
16,851) distinct URLs on average.
4.4 Cookies
A cookie is a key-value pair set on a client by a visited website or
third-party present on that website. In this work, we count every
single key-value pair as one cookie, and not all textual data stored
on the client, because each pair can be used for dierent purposes.
We heuristically group cookies in dierent categories based on
their lifetime. As for HTTP cookies, we compute the lifetime of a
cookie-based on the
expires
attribute and the timestamp when the
request/response was sent/received, or JavaScript command was
executed. If we cannot determine the lifetime of a cookie or if it
is negative, we consider a cookie as a “Session” cookie, which is
deleted by the browser when the HTTP session ends. In total, we
use four lifetime categories: (1) “Session”, (2) “Short” (
1week),
(3) “Persistent” (
1year), and (4) “Permanent” (
>
1year). We used
the evolution of maximum cookie lifetimes in the Safari browser,
enforced through the Intelligent Tracking Prevention [
3
], as an ori-
entation to determine them.
4.4.1 Cookie classification. Cookies can be used for various means.
We want to asses the specic purposes why third parties set cook-
ies and which purposes are most dominant to get a better under-
standing of real-world cookie usage. We use the following cookie
Measuring Third Party Dynamics in the Field WWW ’20, April 20–24, 2020, Taipei, Taiwan
type classes dened by the International Chamber of Commerce
UK [
22
]: (1) “Strictly Necessary Cookies” are needed to provide ba-
sic functionality of a website, (2) “Performance Cookies” aggregate
(anonymously) user’s usage of the website, (3) “Functionality Cook-
ies” personalize the website’s usage, and (4) “Targeting/Advertising
Cookies” are used to track users or to display them personalized
ads. For our analysis, we used Cookiepedia, a platform that provides
public classications of cookie classes [
38
]. This process might be
error-prone as cookie classes are assigned by hand but are—from
our point of view—the best approximation of online cookie usage
today. In total, we can classify 45.3% of all observed cookies.
4.5 Third Party Trees
In this work, we evaluate the number of partners loaded by an
embedded third-party object. To do so, we model third party trees
(TPTs) for each visited URL (for each landing page and all subpages,
respectively), which include all third parties loaded on the visited
page. A similar concept was used by Ikram et al. [
21
] and Kumar
et al. [
28
] to analyze resource loading dependencies (termed “inclu-
sion chains”). We extend this concept as we visit several sites of a
single domain, which enables us to construct a more comprehensive
and realistic view of a website’s dependencies, and we do not limit
ourselves to JavaScript inclusions. We use the term tree rather than
chain as our concept describes a more complete view of a website’s
TP relations and not a single instance of TP inclusion.
We build the trees based on the analysis of JavaScript, iframes,
and Cascading Style Sheets (CSS) that can be used to load third-
party code dynamically. Other HTML objects (e. g., images) can
also be requested from third parties, but these objects cannot load
additional code dynamically and would not spawn any children in
the tree. In our analysis, we omit these objects if they are located
right below the root (
dept h =
0) but consider them if they occur
as leaves in longer branches. We omit them because they would
make the results harder to interpret as one cannot decide if these
parties do not load further third parties or simply cannot do so.
However, we consider these objects in our general analysis (see
Section 5.1). A third party tree is designed to show which party
is responsible for loading another party. To account for HTTP
redirects, we substitute the respective TLD+1 with the redirects
TLD+1 in the trees and delete all edges that create a redirection
loop. Therefore, we add each loaded script and inserted iframe as a
child of the respective ancestor (script/frame) in the tree, if needed.
For example, if a script, which is loaded from foo.com, loads another
script from bar.com we add bar.com as a child of foo.com in the
tree. Thus, we can measure the number of third parties loaded due
to each embedded object. Regarding iframes, we use openWPM’s
feature to save the nested iframe structure of a website. Based on
this structure, we insert each frame (i. e., the source TLD+1) at
the corresponding position in the tree. For JavaScript code, we
inspect the call stack of each script, test if code from another party
is executed (e. g., a function in an external library), and include
this party at the respective position in the TPT (based on the call
stack entries). To nd CSS dependencies introduced through the
@import
command, we analyze the content type of HTTP requests
and test if the origin and target of the request URL both load CSS.
Eventually, each TPT consists of all scripts, style sheets, and iframes
Node(adidas.com) (Visited Website)
|- Node(MediaMath [C]) depth=0; breadth=4
| |- Node(Improve Digital) depth=1; breadth=0
| |- Node(PubMatic [C])
| |- Node(OpenX [C])
| |- Node(Index Exchange)
|- Node(TrustArc)
|- Node(Adobe)
| |- Node(Tealium [C])
| |- Node(Akamai [C]) depth=2
| |- Node(Instana) depth=3
| |- Node(Adobe) depth=4
Figure 3: Example of an observed third party tree. The listed
companies represent the companies operating the observed
URLs. [C] illustrates the cookie setting parties.
loaded by a website. Each branch of a tree represents the sequence
in which dierent third parties (domains) were embedded.
If not stated otherwise, we use the TLD+1 of a third party domain
as the node identier; otherwise, we use the companies associated
with the TLD+1. We use the WhoTracks.me database [
5
] to link
domains to the respective companies owning them. Thus, a branch
in the tree could consist of multiple domains operated by the same
company (e. g., foo.com
googletagmanager.com
googleapis.com
youtube.com). However, we collapsed requests stemming from
one company into one leaf. In the previous example, we would not
add googleapis.com even if youtube.com would load a script form
that domain. We did so because otherwise, the resulting trees would
result in a much deeper length if several resources were loaded
from the same TLD+1. For example, if foo.com was embedded and
would than load metric.foo.com, subsequently ad.foo.com and nally
foo.com/?ad_loaded=1 the resulting branch would be much deeper.
Overall, the maximum depth using this more lax approach would
increase by magnitudes from eight to 52. Thus, a branch consists
of all TLD+1/companies that could perform a task on the client.
An example of a third party tree is given in Figure 3, including
the companies’ names, not TLD+1s. The tree shows the visited
website (adidias.com), the directly embedded third parties (Media-
Math,TrustArc, and Adobe
dept h =
0), the partner of the third
partners (fourth parties at
dept h =
1—e. g., Improve Digital), and
further embedded services e. g., Akamai (
dept h =
2) or Instana
(
dept h =
3). The services that actively set cookies are marked with
a
[C]
. The example illustrates that by embedding a single service,
many other direct partners of that third parties might be embedded
into a website (e. g., MediaMath embeds four partners). Further-
more, embedding a single third party might implicitly lead to a
long branch of direct and indirect partners of the used third party
(e. g., Adobe that creates a branch of
dept h =
4). Note that at depth
four, a service from Adobe is embedded. This is not a loop, but
simply, the previously loaded party utilizes a dierent service of
Adobe.
5 RESULTS
We conducted our measurements in the second quarter of 2019
and found around 93 % of the landing pages in our dataset to be
accessible. The remaining websites provided services that seem not
WWW ’20, April 20–24, 2020, Taipei, Taiwan Tobias Urban, Martin Degeling, Thorsten Holz, and Norbert Pohlmann
Table 1: General overview of our three measurement crawls.
The number of visited websites and subsites with the cor-
responding number of observed TPs, cookie setting TPs (C
TPs), and used cookies is shown.
Region Websites Subsites TPs C TPs Cookies
Europe (EU) 9,267 561,087 12,076 5,393 20.6M
Asia (AS) 9,266 530,356 12,926 5,815 18.2M
N. America (US) 9,333 557,702 13,687 6,115 20.4M
to be intended for rendering in a web browser (e.g., APIs) or did
not exist anymore. In total, we visited over 1.5 million websites that
embedded over 37,000 third parties producing over 4.5 TB of data.
More than 17,000 third parties access/set over 59 million cookies
across all website visits in our experiment. An overview of our
measurements is given in Table 1.
5.1 General Overview
First, we tested how many cookies are set/accessed when visiting
subsites in contrast to the respective landing pages to test the po-
tential bias in previous studies that focused on the landing page
only. In our measurements, as shown in Figure 4, subsites set con-
siderably more (36 %) cookies than the respective landing pages.
On average, 55 cookies were set when loading a landing page while
78 were set when a subsite was accessed. The dierence between
the number of cookies used by third parties is statistically signi-
cant when comparing (1) dierent categories (ANOVA test
p
-value
<
0
.
001) and (2) when comparing landing pages to subsites (
p
-value
<
0
.
001) However, we did not nd a statistically signicant eect
of the originating region of the visit and the rank of the website on
the cookie setting behavior. Our results show that landing pages of
websites show a dierent cookie usage behavior than the respective
subsites as those make more usage of third parties. To get a better
understanding of the implications of increased cookie usage, we
analyze the primary purposes of why cookies are set.
5.1.1 Lifetime and Cookie Types. Aside from the number of cook-
ies set, it is interesting to analyze why they are set and how long
they stay active in the browser. Overall, we could classify 45.3 %
of all observed cookies in terms of distinct used keys. Regarding
absolute numbers, we could classify 74 % of all observed cookies.
Most of the observed cookies are used to track website visitors
or to provide targeted ads (99 %). The “type” of the cookie shows
a strong correlation with the amount of cookie set for this type
(
p
-value
<
0
.
0001). This means that specic types of cookies are set
more often than others. Furthermore, the purpose of a cookie is not
related to its lifetime, a
X2
test does not show a correlation between
“type” and “lifetime”. Furthermore, third parties use similar types
and lifetimes for their cookies, no matter on which website they are
embedded in. We did not nd a correlation between the “type” or
“lifetime” of a cookie and the website’s category. Our results show
that cookies are overwhelmingly used to track users or to provide
them with targeted ads. Furthermore, cookies in all categories use
various lifetimes. Given the primary purpose of cookies (“Target-
ing/Advertising”) and the measured increased usage of cookies on
subsites, we see that subsites show dierent behavior in that regard
Figure 4: Mean number of cookies used by each visited land-
ing page and each respective subsites, by category of the vis-
ited website. To increase the readability, we capped the bars
at 500. 1.8 % of sites had a higher number of cookies; this
doesn’t impact the computed values.
(see also Section 5.2). Tracking users on subsites provides a more
comprehensive view of their online activities. For example, visiting
the landing page of an online shop does not necessarily indicate
which products a user is interested in, but this information can be
extracted on subsites.
5.1.2 Legal Compliance. With the introduction of the General Data
Protection Regulation (GDPR) [
46
] and the California Consumer
Privacy Act (CCPA) [
4
], service providers have to be more aware of
business partners they work with. If a business partner tracks users
or uses personal information in other ways and is not located in a
GDPR adequate member state [
19
] or not a member of the Privacy
Shield [
47
], they need to agree on a data processing contract (Article
28 §3 GDPR) that “appropriate safeguards” (Article 46 §1 GDPR)
are taken which enforce privacy rights of EU citizens. Based on the
IP addresses observed in our measurements (see Section 4.3), we
analyzed if connections were established to IP addresses that are
associated with countries that are not a member of the EEA or part
of the Privacy Shield. In the remainder of the paper, we call these
parties “non-adequate” or “possibly problematic” to improve the
reading ow of this work. Note that every business can agree by
contract that the data of EU citizens are processed according to EU
legislation and, therefore, these parties might pose no problem at all
(Article 28 §3 GDPR). However, the current legal debate only focuses
on TPs as “joint controllers” [
11
,
20
] and does not cover fourth or
further parties. We want to highlight that a binary classication of
what is compliant with legal regulation and what is not is impossible
to make without looking at the specic service agreements between
websites and third parties.
Figure 5 shows the origins and targets of all requests for which
service providers need to make sure that they have taken appropri-
ate safeguards. These numbers only refer to our EU measurement,
and the results are not violations of the legislation, but provide
Measuring Third Party Dynamics in the Field WWW ’20, April 20–24, 2020, Taipei, Taiwan
Figure 5: Origins (left) and targets (right) of requests to ser-
vices whose IP address is not mapped to an IP in an adequate
country.
insights to potential data ows that might conict with the legal
requirements. The origins/targets are based on the observed IP
addresses in our measurements. Overall, 4.7 % of all cookies were
set by services outside adequate geolocations and only 7.1 % of the
visited domains (TLD+1) exclusively used TPs that are located at
adequate geolocations. Domains using only adequate locations are
located in the US (59 %), followed by Germany (7%), and the United
Kingdom (3 %). In our dataset, Singapore is the most prevalent tar-
get of non-adequate requests (26 %), followed by China (5 %) and
Australia (5 %). The US is the most common origin of such requests
(63 %), followed by China (6 %) and Germany (5 %). We did not nd
a statistically signicant impact of the region on the question of
whether or not a third party from non-adequate geolocation is used.
When looking at the services located in possibly non-adequate ge-
olocations, we found that almost half only used sometimes (53 %),
and the other half always used possibly non-adequate geolocations
(47 %). Overall, roughly 10 % of all observed TPs used IP addresses
in possibly problematic geolocations.
In the following, we analyze the services that use sometimes
adequate and sometimes non-adequate geolocations. This is an
interesting subset as service providers might not be aware of the
possibility that these TPs change their geolocations over time. In
contrast, third parties that always send data to possibly problematic
geolocations are more easy to identify and, therefore, the transfer
of data to these non-adequate countries are likely part of the data
processing contracts. Requests to TPs that only sometimes used
adequate geolocations were most of the time resolved to an EU IP
address but sometimes (
<
1%) to addresses outside the EU. For ex-
ample, sometimes a similar resource of a third party was requested
from dierent locations in the same measurement. Meaning, the
URL csm.ad-network.foo was resolved to sgp.csm.ad-network.foo
in Singapore and nl.csm.ad-network.foo in the Netherlands. This
is challenging as service providers cannot ensure that only EU
endpoints of the used third party are used. In our measurement,
gstatic.com (a service operated by Google) with 20 % of all inclu-
sions of possibly non-adequate services and upravel.com (a Russian
advertising service) with 15 % are the top services that might pose
a problem to service providers. The next service only accounts for
1 % of these possibly conicting services (i. e., there is a long tail
distribution). One likely explanation is that these are eects of load
balancing or similar techniques and that the servers belonging to
these IP addresses are controlled by the same third party. However,
service providers need to account for this behavior in the data pro-
cessing contracts with the TP, and the TP must assure that GDPR
adequate data processing rules are in place no matter where their
servers are located.
Summary. Our results show that measuring only landing pages
of websites might only reveal a fraction of the websites’ real use of
third parties. Furthermore, we found that websites make extensive
use of cookies, primarily to serve ads or to track users, and we
observed that some embedded TPs might be conicting with current
legislation. To further investigate the eects of visiting subsites and
not only landing pages, it is interesting to look at further areas that
might be implicated by our ndings.
5.2 Replication and Comparison
To provide a more comprehensive overview of our measurements
in comparison with previous work, we tried to replicate the main
ndings of previous work using our data set. We dierentiate be-
tween studies we could replicate using our data (
—see column
Rep.” in Table 2) and studies we would partly replicate (
H#
). Further-
more, we indicate (“Res.”) if we could produce similar results (
).
To reproduce the results, we analyzed the landing pages of each
website (if the paper did so) or used the same amount of subsites.
If we could replicate the results, we measure them on all visited
subsites to test if these studies measured a comprehensive gener-
alizable view or as shown in our study, subsites show a dierent
behavior (“Scales”). We dierentiate if visiting subsites makes a
measurable dierence in contrast to only visiting landing pages
(
). The results are given in Table 2. Our replication studies do not
aim to replicate all results of previous work, but we only focus on
the main takeaways and results closely related to our work. We do
not claim that our replications are sound or complete, but we tried
to faithfully replicate previous work as good as possible using our
data set.
In contrast to Dabrowski et al. [6], and as previously stated, we
could not nd statistical evidence that the originating region of a
request inuences cookie setting practices in general. On the one
hand, this could be a result of dierent experimental setups as we
tried to maximize the “cookie setting behavior” of each website to
achieve more generalizable results. Dabrowski et al. used a headless
browser that can be easily detected by websites and, therefore,
might aect the loaded TPs (e. g., ads might not be loaded to counter
ad fraud). On the other hand, we performed our experiment on a
larger scale and interacted (e. g., scrolling) with the websites, which
could fundamentally aect the results.
Furthermore, we found that subsites set signicantly more cook-
ies than the respective landing pages. As for the results of Sørensen
et al. [
45
], we could verify that the GDPR has no immediate eect
on third party usage. Sanchez-Rola et al. [
42
] show that opting-out
of cookies often has no measurable eect on cookie setting prac-
tices in the eld. We could only partly reproduce this work as we
never interacted with any cookie banners, but our results show
WWW ’20, April 20–24, 2020, Taipei, Taiwan Tobias Urban, Martin Degeling, Thorsten Holz, and Norbert Pohlmann
Table 2: Overview of previous work we tried to replicate (Rep.), the scale of the work (“LP” := landing page, “SB” := subsite),
the results (Res.) of our replication, and if these experiments show dierent behavior in a vertical setup (Scales).
1stAuthor Ref. Year Venue Scale Main nding Rep. Res. Scales
Dabrowski [6] 2019 PAM LP
Websites set 49% less cookies if user located in the EU
visit them.
Sørensen [45] 2019 WWW LP + 9 SB
Eects of the GDPR to third-party usage is not denite.
Sanchez-Rola [42] 2019 AsiaCCS LP Tracking is often still present even if opted-out. H#
Urban [50] 2020 AsiaCCS LP + 3–5 SB Cookie syncing reduced by around 40 %. H# ✓ ✓
Merzdovnik [34] 2017 EuroS&P LP + 2 SB State of the art tracking blocking tools can limit user
tracking but still have blind spots. H# ✓ ✓
Englehardt [9] 2016 CCS LP Websites use various ngerprintig methods. #
Kumar [28] 2017 WWW LP Implicitly included TPs pose a challenge when
upgrading to HTTPs.
Ikram [21] 2019 WWW LP
Implicitly included parties might pose a security threat.
✓ ✓
Iordanou [23] 2018 IMC user browsing
behaviour
In the EU, tracking data is transferred across countries
but rarely leaves the EU. ✓ ✓
that cookies are still widely used and that there are no regional
dierences, while in the EU users should opt-in before cookies are
being used. We used data of our prior work collected before the
GDPR became eective [
50
]. Using this data and comparing the
regional data in our experiments, we could verify that cookie sync-
ing seems to be inuenced by dierent legislation. Scaled to our
collected data, we found an increase of cookie syncing activities
on subsites in contrast to landing pages. This replication cannot
be seen as representative as our measurement misses essential fea-
tures, especially to identify IDs, to assess cookie syncing since we
only used one prole in each region.
To test whether our results of increased cookie usage on sub-
sites also applies to user tracking, we use the numbers presented
by Merzdovnik et al. [
34
] on the presence of trackers on websites
as a baseline. To test if a tracker is active on a website, we use
the EasyPrivacy List [
8
], which is a list combining URLs of known
trackers. However, we do not test whether anti-tracking tools are
useful or not. In our measurement, we found that trackers mostly
occur on subsites in comparison to their respective landing pages
(an increase of approx. 6 %). 2.5 % of the measured websites do not
embed any trackers on the landing page but use trackers on subsites.
Overall, we could show that tracking on subsites increases and that
future work concerning this area should include subsites into their
measurement. In terms of overall tracking occurrence, we produced
results comparable to the “plain” prole used by Merzdovnik et al. Fi-
nally, we tested the prevalence of device ngerprinting scripts in
our data set, as previously studied by Englehardt et al. [
9
]. As the
scripts identied by Englehardt et al. are probably outdated, we
only found four of them in our total dataset, we used the popular
Fingerprint2” library [
12
] to test for the presence of such trackers.
Hence, our results can be seen as a lower bound as we only test for
the presence of one script. We identied a mean increase of device
ngerprinting of 25 % on subsites in contrast to the respective land-
ing pages. In all three measurements, we found 13 domains (0.14 %),
which did not use the script on the front page but on subsites. Over-
all, we found the tracking script on 0.15 % of the landing pages
while Englehardt et al. identied device ngerprinting on 1.8%, and
the most common script on 0.45 % of the analyzed websites.
Summary. In this section, we demonstrated that only measuring
landing pages hides the scale of dierent phenomena observable
on the Web. Furthermore, the behavior of TPs diers on dierent
subsites, which raises the question to what extent service providers
are in control of TPs embedded into their services. To tackle this
challenge, one needs to understand relations between TPs and the
determinism of which third parties will be loaded into a service.
5.3 Third Party Trees
As described above, we are interested in understanding dependen-
cies between third parties and possibly resulting in challenges for
service providers and users. Therefore, we created third party trees
(see Section 4.5) to better understand the implications of embed-
ding a single third party into a website. Figure 6 shows the depth of
the measured third party trees by category of the visited websites.
Remember that each visited website (i. e., distinct URL) produced
its own TPT, and the directly embedded third parties are of depth
zero. The average third party branch has a depth of one (median
also one), and the deepest branch of a tree we found has a depth of
eight. In total, 43.0 % of the observed branches have a depth of one
or more, which means that these trees include parties that are not
necessarily known to the service provider. Therefore, several third
parties (in terms of TLDs+1, not distinct companies) load at least
one additional partner. Each node in the trees has, on average, 0.9
(SD 37) direct children (breadth) with a maximum of 361, and each
branch compromises on average 0.9 (SD 6.4) dierent companies
(max 127). In total, 2,901 TPs (10 %) are embedded that never in-
cluded any child. The depth of a tree is impacted by the category of
a website (
p
-value
<
0
.
0001). Similar to the results of previous work,
“News” websites tend to use more cookies and third parties [
45
].
As over 40 % of all TPs at least load one additional partner, it is
interesting to look if these use cookies, for example, to track users
or to serve them targeted ads.
Measuring Third Party Dynamics in the Field WWW ’20, April 20–24, 2020, Taipei, Taiwan
Figure 6: Relative distribution of the measured third party
tree depth split by the websites’ categories.
5.3.1 Cookies Set in Trees. Not every party in each TPT, more
specically in each branch, will necessarily set a cookie. Therefore,
we analyzed the depth of the cookie setting parties and the overall
amount of cookies set in each branch. We limit ourselves to cookies
but expect, based on our results presented in Section 5.2, that other
privacy-invasive techniques would likely produce similar results.
Starting with the depth of set cookies, on average, 1.5 parties in
each branch do not set a cookie. In 48 % of all branches no party and
only in 125 branches (approx. 0
.
002 %) all parties set a cookie. The
website’s category and its rank both show statistical signicance in
d the number of cookies set in each branch (both
p
-values
<
0
.
001).
Furthermore, we found that deeper branches do not necessarily, in
relative numbers, lead to more cookies being set. As for the depth
on which cookies are being set, we found that most cookies (72 %)
are set by the fourth party (
dept h =
1). The main reason why most
cookies are set on depth one is likely because most trees are of
depth one. Hence, deeper trees occur less often and, consequently,
in absolute numbers, set fewer cookies.
Overall, slightly more than 18 % of cookies are set on a depth
larger one (fth party or higher). If service providers want to choose
services that do not use cookies, for example, because they want to
protect their customers from tracking, they face the problem that
often the fourth party sets a cookie. Therefore, service providers
have to carefully monitor the behavior of all embedded third parties
for such behavior. Since one-fth of cookies are set in depth one,
it is worth investigating how much control or knowledge service
providers have about these parties. TPs that always include the
same third parties can be seen as more predictable because the third
parties do not change, and service providers know which third par-
ties will be included in their websites. Furthermore, TPs that do
not create deep branches are better to assess for service providers
since hierarchies and dependencies are easier to understand. There-
fore, we analyze the deterministic of branches generated by directly
embedded TPs.
5.3.2 Determinism of Third Party Trees. The determinism of each
branch that is generated by an embedded TP is import if service
providers want to understand which TPs are loaded and who is
responsible for loading them. If it is known, before loading the
third party object, which other third parties might be embedded,
Figure 7: Children included in only some of the branches
(uctuation) created by a specic TP within each visited site
(grey) and across all sites (black).
service providers can evaluate the potential risks of a TP for their
users. Therefore, we tested the uctuation of embedded companies
for each TP in the measured trees. First, we tested the uctuation
within each visited website (TLD+1) and its subsites. Meaning that
we test which third parties are embedded into the visited website
by each observed third party on a specic subsite in a specic
region. Secondly, we tested the uctuation across all websites and
all regions, meaning that we test if a wider spread view of a third
party provides more insight of the further loaded parties or if they
show dierent behavior on dierent websites.
Half of the branches (50.4 %) have at least one uctuating partner
in them. Figure 7 shows the measured uctuation of a TP within
(gray) and across (black) the visited sites. The x-axis shows the
relative amount of uctuating companies in all branches of an
embedded third party. Zero means branches of this TP always
include the same third parties, and six means that six distinct TPs
only accrued in some of the branches. These numbers exclude third
parties that never had any children because these would naturally
be zero and might lead to a false conclusion about the deterministic
of TPs. The results show that almost a third (62 %) of third parties
that embed other third parties use uctuating partners (e. g., due to
real-time ad bidding) when loaded on dierent subsites. Across all
regions, we see that there is a long tail distribution of companies
that only occur in some of the branches, note the increase in more
than six new children. Regarding the impact of the originating
region, in which the measurement was performed, we found no
statistical signicance on its impact on the uctuation. However,
the weighted mean (local) uctuation was the highest in the US
(5.78) and lowest in the EU (5.49).
On a global scale, we nd a dierent picture. We see that the
global uctuation in the EU is more distributed than it is in other
regions. We found no statistical evidence that the region aects
the local or global uctuation of children. In conclusion, we see
that measuring TPs on a global scale does not necessarily provide
a generalizable view as some TPs behave dierently on dierent
sites (e. g., due to the advertised products or partners in dierent
regions). Our results show that the list of third parties embedded
in a website is not deterministic, which makes it challenging for
service providers to account for all TPs that might be present on
their websites. Embedding some third parties leads to an often
WWW ’20, April 20–24, 2020, Taipei, Taiwan Tobias Urban, Martin Degeling, Thorsten Holz, and Norbert Pohlmann
.
Figure 8: Resulting branch depth of objects embedded by dif-
ferent companies (scaled for each individual company).
changing set of embedded third parties (e. g., dierent TPs provid-
ing ads). However, service providers only have little control over
these processes as they often depend on third parties to provide
their service. As the (non-)deterministic of these trees is related
to the embedded TP, it is interesting to analyze the depth of trees
generated by dierent TPs (companies).
5.3.3 Companies. Figure 8 shows the average, scaled branch depth
that is created by embedding a single object of dierent companies.
All values are scaled for each company, not overall, and include
all TLDs+1 operated by the company. Thus, Figure 8 presents the
resulting depth of each company and does account for the overall
occurrence of each company. Furthermore, the gure only lists the
top 15 companies, regarding absolute amounts of embeddings of
these companies. All remaining companies are combined in the cat-
egory “Other”. The top companies account for over 98 % of absolute
third party embeddings. In general, embedding most TPs results in
short trees of depth zero. However, ad-tech companies—the primary
source to nance many websites—oer a more widespread resulting
TPT depth (e. g., PubMatic or Rubicon Project ) which reduces the op-
tions to choose partners that do not load many other partners. We
found statistical signicance that the embedded company impacts
the depth of the generated tree (
p
-value
0
.
008). Regarding the
position of companies in the trees, we found that larger companies
(e. g., Google or Facebook) occur mostly at depth zero (absolute num-
bers) while service providers of TPs (e. g., companies that counter
ad fraud) occur deeper in the trees.
Summary. Our results indicate that it is quite challenging for
service providers to keep track of all third parties that might be em-
bedded into their services. Furthermore, before loading the directly
embedded TP, it is often not denite which other parties might be
loaded—especially ad networks load various uctuating partners.
6 LIMITATIONS
In the following we discuss limitations of our work. We use the
classication of Cookiepedia, which might be wrong to some ex-
tent and is incomplete. We could only classify slightly over 45 %
of all observed cookies but show that an overwhelming majority
(99 %) tracks users or serves targeted advertisements. We mapped
requests from dierent services to a single company, if possible.
If we observed multiple requests to domains owned by one com-
pany (e. g.,
ads.foo.com
and
fonts.foo.com
), we collapsed them to
a single request if they occurred in sequence. Our measurement plat-
form, a customized OpenWPM instance, does not interact with any
cookie banners that are present on the visited websites. Hence, we
do not capture cookies set by third parties that honor opt-in choices
of (European) users. However, previous works demonstrated that
cookie consent notices often do not oer choices to opt-in [
51
],
do not work at all [
42
], and that the used libraries often are not
complaint to current legislation [
7
]. Therefore, our results are a
lower bound since (1) we shortened the TPTs and (2) some cookies
might only be used after armative action of the user.
7 DISCUSSION
We have shown the challenges service providers face when they
rely on third-party code and try to account which third parties are
loaded when users use their service. It is the high dynamic and
previously nominal regulation of the Web that now presents chal-
lenges to service providers. As service providers might carefully
select the directly embedded third parties (e. g., ad networks), they
cannot control which third parties might get included when these
third parties loaded their content (e. g., due to ad real-time bidding).
The primary tool website providers have to solve these challenges
are data processing contracts that include indirectly embedded
third parties. From a research perspective, we have shown that a
simple horizontal scaling of websites to visit (i. e., websites from a
given toplist) is not sucient to measure a phenomenon of inter-
est. Meaning that future work should (1) scale their experiments
vertically and (2) previous results of dierent Web measurement
areas should be re-visited to measure the given challenges ade-
quately. Finally, our assessment of purposes if cookies underlines
the dire need of privacy protection mechanisms to limit cookie-
based tracking—which is currently promoted by several browser
vendors (e. g., Firefox [35], Chrome [18] and Safari [3]).
8 CONCLUSION
In this work, we have analyzed the cookie setting practices of the top
10k websites on the Web. We found that 99 % of all cookies we could
classify were set with the intention to track users or to serve them
targeted ads. Furthermore, we modeled third party trees, which
assemble all third parties embedded into a website and loading
dependencies among them. By analyzing the third party trees, we
found that the median depth of such trees is one (max eight), that
there is a sever uctuation of children in dierent branches with the
same parent node (third party), that especially ad networks result
in longer tree branches, and that only 7 % of all visited websites
(TLD+1) never embedded a third party that might pose possible
legal problems. Moreover, we have shown that studies that only
measure landing pages of websites miss a substantial amount of
embedded third parties and cookies set.
ACKNOWLEDGMENTS
This work was partially supported by the Ministry of Culture and
Science of the State of North Rhine-Westphalia (MKW grants 005-
1703-0021 “MEwM” and Research Training Group NERD.nrw). We
would like to thank Cybot (Cookiebot) for their support.
Measuring Third Party Dynamics in the Field WWW ’20, April 20–24, 2020, Taipei, Taiwan
REFERENCES
[1]
Gunes Acar, Marc Juarez, Nick Nikiforakis, Claudia Diaz, Seda Gürses, Frank
Piessens, and Bart Preneel. 2013. FPDetective: Dusting the Web for Fingerprinters.
In Proceedings of the 2013 ACM Conference on Computer and Communications
Security (CCS ’13). ACM Press, New York, NY, USA, 1129–1140.
[2]
Adobe Inc. 2017. Flash & The Future of Interactive Content. https://theblog.
adobe.com/adobe-ash- update/ Accessed: 2019-10-05.
[3]
Apple Inc. 2019. Intelligent Tracking Prevention 2.3. https://webkit.org/blog/
9521/intelligent-tracking- prevention-2-3/ Accessed: 2019-04-24.
[4] California State Legislature. 2020. The California Consumer Privacy Act.
[5]
Cliqz. 2018. WhoTracks.me Data - Tracker database. https://whotracks.me/blog/
gdpr-what- happened.html Accessed: 2019-04-24.
[6]
Adrian Dabrowski, Georg Merzdovnik, Johanna Ullrich, Gerald Sendera, and
Edgar Weippl. 2019. Measuring Cookies and Web Privacy in a Post-GDPR World.
In Proceedings of the 2019 Conference on Passive and Active Measurement (PAM
’19). Springer-Verlag, Cham, 14.
[7]
Martin Degeling, Christine Utz, Christoper Lentzsch, Henry Hosseini, Florian
Schaub, and Thorsten Holz. 2019. We Value Your Privacy ... Now Take Some
Cookies: Measuring the GDPR’s Impact on Web Privacy. In Proceedings of the
2019 Symposium on Network and Distributed System Security (NDSS ’19). Internet
Society, San Diego, California, USA, 20.
[8]
EasyList. 2019. EasyPrivacy. https://easylist.to/easylist/easyprivacy.txt Accessed:
2019-04-24.
[9]
Steven Englehardt and Arvind Narayanan. 2016. Online tracking: A 1-million-
site measurement and analysis. In Proceedings of the 2016 ACM Conference on
Computer and Communications Security (CCS ’16). ACM Press, New York, NY,
USA, 1388–1401.
[10]
Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman,
Jonathan Mayer, Arvind Narayanan, and Edward W. Felten. 2015. Cookies That
Give You Away: The Surveillance Implications of Web Tracking. In Proceedings
of the 24th World Wide Web Conference (WWW ’15). ACM Press, New York, New
York, USA, 289–299.
[11]
European Court of Justice. 2018. Unabhängiges Landeszentrum für Datenschutz
Schleswig-Holstein vs Wirtschaftsakademie Schleswig-Holstein GmbH, - Case
C
-
210/16. http://curia.europa.eu/juris/document/document.jsf?text=&docid=
202543&doclang=EN&d&part=1&cid=341550
[12]
Fingerprint.js. 2019. Fingerprint.js is the most advanced open-source fraud
detection JS library . https://ngerprintjs.com/ Accessed: 2019-04-24.
[13]
David Formby, Preethi Srinivasan, Andrew Leonard, Jonathan Rogers, and Ra-
heem Beyah. 2016. Who’s in Control of Your Control System? Device Finger-
printing for Cyber-Physical Systems. In Proceedings of the 2016 Symposium on
Network and Distributed System Security (NDSS ’16). Internet Society, San Diego,
California, USA, 15.
[14]
Imane Fouad, Nataliia Bielova, Arnaud Legout, and Natasa Sarajanovic-Djukic.
2020. Missed by Filter Lists: Detecting Unknown Third-Party Trackers with Invis-
ible Pixels. In Proceedings of the 20th Privacy Enhancing Technologies Symposium
(PETS ’20). Springer-Verlag, Berlin, Heidelberg. https://hal.inria.fr/hal- 01943496
[15]
Gertjan Franken, Tom Van Goethem, and Wouter Joosen. 2018. Who Left Open
the Cookie Jar? A Comprehensive Evaluation of Third-party Cookie Policies. In
Proceedings of the 27th USENIX Security Symposium (SEC ’18). USENIX Association,
Berkeley, CA, USA, 151–168.
[16]
Alejandro Gómez-Boix, Pierre Laperdrix, and Benoit Baudry. 2018. Hiding in
the Crowd: An Analysis of the Eectiveness of Browser Fingerprinting at Large
Scale. In Proceedings of the 2018 World Wide Web Conference (WWW ’18). ACM
Press, New York, New York, USA, 11.
[17]
Roberto Gonzalez, Lili Jiang, Mohamed Ahmed, Miriam Marciel, Ruben Cuevas,
Hassan Metwalley, and Saverio Niccolini. 2017. The cookie recipe: Untangling the
use of cookies in the wild. In Proceedings of the 2017 Network Trac Measurement
and Analysis Conference (TMA ’17). IEEE, Piscataway, NJ, 1–9.
[18]
Google Inc. 2020. Building a more private web: A path towards making third party
cookies obsolete. https://blog.chromium.org/2020/01/building-more-private-
web-path- towards.html Accessed: 2020-01-15.
[19]
Government Digital Service. 2019. Countries in the EU and EEA. https://www.
gov.uk/eu-eea Accessed: 2019-10-05.
[20]
Higher Regional Court, Düsseldorf, Germany. 2018. Opinion of Advocate General
Bobek on Fashion ID GmbH & Co. KG vs Verbraucherzentrale NRW eV - Case
C
-
40/17. https://eur-lex.europa.eu/legal- content/EN/TXT/HTML/?uri=ecli:ECLI:
EU:C:2018:1039
[21]
Muhammad Ikram, Rahat Masood, Gareth Tyson, Mohamed Ali Kaafar, Noha
Loizon, and Roya Ensa. 2019. The Chain of Implicit Trust: An Analysis of the
Web Third-Party Resources Loading. In Proceedings of the 2019 World Wide Web
Conference (WWW ’19). ACM Press, New York, NY, USA, 2851–2857.
[22]
International Chamber of Commerce UK. 2012. Cookie guide. International
Chamber of Commerce UK.
[23] Costas Iordanou, Georgios Smaragdakis, Ingmar Poese, and Nikolaos Laoutaris.
2018. Tracing Cross Border Web Tracking. In Proceedings of the 2018 Internet
Measurement Conference (IMC ’18). ACM Press, New York, NY, USA, 329–342.
[24]
Hugo Jonker, Benjamin Krumnow, and Gabry Vlot. 2019. Fingerprint Surface-
Based Detection of Web Bot Detectors. In Proceedings of the 2019 European Sym-
posium on Research in Computer Security (ESORICS ’19). Springer-Verlag, Cham,
586–605.
[25]
Mohammad Taha Khan, Joe DeBlasio, Georey M. Voelker, Alex C. Snoeren,
Chris Kanich, and Narseo Vallina-Rodriguez. 2018. An Empirical Analysis of the
Commercial VPN Ecosystem. In Proceedings of the 2018 Internet Measurement
Conference (IMC ’18). ACM Press, New York, NY, USA, 15.
[26]
Radhesh Krishnan Konoth, Emanuele Vineti, Veelasha Moonsamy, Martina
Lindorfer, Christopher Kruegel, Herbert Bos, and Giovanni Vigna. 2018.
MineSweeper: An In-depth Look into Drive-by Cryptocurrency Mining and
Its Defense. In Proceedings of the 2018 ACM Conference on Computer and Commu-
nications Security (CCS ’18). ACM Press, New York, NY, USA, 1714–1730.
[27]
David M. Kristol. 2001. H TTP Cookies: Standards, Privacy, and Politics. ACM
Trans. Internet Technol. 1, 2 (Nov. 2001), 151–198.
[28]
S. Kumar, S. S. Rautaray, and M. Pandey. 2017. Malvertising: A case study based on
analysis of possible solutions. In Proceedings of the 2017 International Conference
on Inventive Computing and Informatics (ICICI ’17). IEEE, San Francisco, United
States, 288–291.
[29]
Andreas Kurtz, Hugo Gascon, Tobias Becker, Konrad Rieck, and Felix C. Freil-
ing. 2016. Fingerprinting Mobile Devices Using Personalized Congurations.
Proceedings of the Privacy Enhancing Technologies Symposium 2016, 1 (2016), 4–19.
[30]
Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Maciej Ko-
rczyński, and Wouter Joosen. 2019. Tranco: A Research-Oriented Top Sites
Ranking Hardened Against Manipulation. In Proceedings of the 26th Annual Net-
work and Distributed System Security Symposium (NDSS ’19). Internet Society,
San Diego, California, USA, 20.
[31]
MaxMind Inc. 2019. GeoIP Databases & Services. https://www.maxmind.com/
en/geoip2-services- and-databases Accessed: 2019-10-05.
[32]
J. R. Mayer and J. C. Mitchell. 2012. Third-Party Web Tracking: Policy and
Technology. In Proceedings of the 2012 IEEE Symposium on Security and Privacy
(S&P ’12). IEEE, San Francisco, United States, 413–427.
[33]
McAfee LLC. 2019. Customer URL Ticketing System. https://trustedsource.org/
Accessed: 2019-10-05.
[34]
G. Merzdovnik, M. Huber, D. Buhov, N. Nikiforakis, S. Neuner, M. Schmiedecker,
and E. Weippl. 2017. Block Me If You Can: A Large-Scale Study of Tracker-
Blocking Tools. In Proceedings of the 2017 IEEE European Symposium on Security
and Privacy (EuroS&P ’17). IEEE, San Francisco, United States, 319–333.
[35]
Mozilla Corporation. 2019. Today’s Firefox Blocks Third-Party Tracking Cookies
and Cryptomining by Default . https://blog.mozilla.org/blog/2019/09/03/todays-
refox-blocks- third-party- tracking-cookies- and-cryptomining- by-default/ Ac-
cessed: 2019-04-24.
[36]
Nick Nikiforakis, Wouter Joosen, and Benjamin Livshits. 2015. PriVaricator:
Deceiving Fingerprinters with Little White Lies. In Proceedings of the 24th World
Wide Web Conference (WWW ’15). ACM Press, New York, New York, USA, 820–
830.
[37]
N. Nikiforakis, A. Kapravelos, W. Joosen, C. Kruegel, F. Piessens, and G. Vigna.
2013. Cookieless Monster: Exploring the Ecosystem of Web-Based Device Fin-
gerprinting. In Proceedings of the 2013 IEEE Symposium on Security and Privacy
(S&P ’13). IEEE, San Francisco, United States, 541–555.
[38]
OneTrust LLC. 2019. Cookiepedia. https://cookiepedia.co.uk/ Accessed: 2019-
10-05.
[39]
Xiang Pan, Yinzhi Cao, and Yan Chen. 2015. I Do Not Know What YouVisite dLast
Summer: Protecting Users from Third-party Web Tracking with TrackingFree
Browser. In Proceedings of the 2015 Symposium on Network and Distributed System
Security (NDSS ’15). Internet Society, San Diego, California, USA, 15.
[40]
Python. 2019. tldextract 2.2.1. https://pypi.org/project/tldextract/ Accessed:
2019-04-24.
[41]
Jan Rüth, Torsten Zimmermann, Konrad Wolsing, and Oliver Hohlfeld. 2018.
Digging into Browser-based Crypto Mining. In Proceedings of the 2018 Internet
Measurement Conference (IMC ’18). ACM Press, New York, NY, USA, 70–76.
[42]
Iskander Sanchez-Rola, Matteo Dell’Amico, Platon Kotzias, Davide Balzarotti,
Leyla Bilge, Pierre-Antoine Vervier, and Igor Santos. 2019. Can I Opt Out Yet?:
GDPR and the Global Illusion of Cookie Control. In Proceedings of the 2019 ACM
Symposium on Information, Computer and Communications Security (AsiaCCS
’19). ACM Press, New York, New York, USA, 340–351. https://doi.org/10.1145/
3321705.3329806
[43]
Muazzam Siddiqui, Morgan C. Wang, and Joohan Lee. 2008. Data Mining Methods
for Malware Detection Using Instruction Sequences. In Proceedings of the 26th
International Conference on Articial Intelligence and Applications (AIA ’08). ACTA
Press, Anaheim, CA, USA, 358–363.
[44]
Aditya K Sood and Richard J Enbody. 2011. Malvertising – exploiting web
advertising. Computer Fraud & Security 2011, 4 (2011), 11 – 16.
[45]
Jannick Kirk Sørensen and Sokol Kosta. 2019. Before and After GDPR: The
Changes in Third Party Presence at Public and Private European Websites. In
Proceedings of the 2019 World Wide Web Conference (WWW ’19). ACM Press, New
York, New York, USA, 11.
WWW ’20, April 20–24, 2020, Taipei, Taiwan Tobias Urban, Martin Degeling, Thorsten Holz, and Norbert Pohlmann
[46]
The European Parliament and the Council of the European Union. 2016. Regula-
tion (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016
on the protection of natural persons with regard to the processing of personal
data and on the free movement of such data, and repealing Directive 95/46/EC
(General Data Protection Regulation). Ocial Journal of the European Union, L
119/1.
[47]
The International Trade Administration. 2019. The Privacy Shield. https:
//www.privacyshield.gov/ Accessed: 2019-10-05.
[48]
Tobias Urban, Martin Degeling, Thorsten Holz, and Norbert Pohlmann. 2019.
“YourHashed IP Address: Ubuntu.”: Perspectives on Transparency Tools for Online
Advertising. In Proceedings of the 35th Anual Computer Security Applications
Conference (ACSAC ’19). ACM Press, New York, NY, USA, 702–717.
[49]
Tobias Urban, Dennis Tatang, Martin Degeling, Thorsten Holz, and Norbert
Pohlmann. 2019. A Study on Subject Data Access in Online Advertising After the
GDPR. In Data Privacy Management, Cryptocurrencies and Blockchain Technology
(DPM’19), Cristina Pérez-Solà, Guillermo Navarro-Arribas, Alex Biryukov, and
Joaquin Garcia-Alfaro (Eds.). Springer International Publishing, Cham, 61–79.
[50]
Tobias Urban, Dennis Tatang, Martin Degeling, Thorsten Holz, and Norbert
Pohlmann. 2020. Measuring the Impact of the GDPR on Data Sharing. In Proceed-
ings of the 15th ACM Symposium on Information, Computer and Communications
Security (AsiaCCS ’20). ACM Press, New York, NY, USA.
[51]
Christine Utz, Martin Degeling, Sascha Fahl, Florian Schaub, and Thorsten Holz.
2019. (Un)informed Consent: Studying GDPR Consent Notices in the Field.
In Proceedings of the 2019 ACM Conference on Computer and Communications
Security (CCS ’19). ACM Press, New York, NY, USA, 973–990.
[52]
Antoine Vastel, Pierre Laperdrix, Walter Rudametkin, and Romain Rouvoy. 2018.
FP-STALKER: Tracking Browser Fingerprint Evolutions. In Proceedings of the
39th IEEE Symposium on Security and Privacy (S&P ’18). IEEE, San Francisco,
United States, 728–741.
[53]
Q. Xu, R. Zheng, W. Saad, and Z. Han. 2016. Device Fingerprinting in Wireless
Networks: Challenges and Opportunities. IEEE Communications Surveys Tutorials
18, 1 (2016), 94–104.
... Sites and Pages. In this work, we use the term site to depict the registerable part of a given domain-often referred to as "extended Top Level Domain plus one" (eTLD+1) [6,10,23,42]. For example, given the URL https://www.bar.com/ the eTLD+1 is bar.com, or the URL https://foo.co.uk the eTLD+1 is foo.co.uk. ...
... Cookies were implemented to allow stateful communication in the otherwise stateless HTTP protocol [3]. Therefore, they are often used to manage sessions, store persistent client-side data, and often for user tracking purposes [42]. Various organizations proposed categories of cookies to account for the wide range of use cases and to bring some transparency to the usage of cookies(e.g., the IAB Europe proposed 17 categories of cookies [19]). ...
... Previous work has shown that subpages behave differently than the respective landing page [2,9,42]. Therefore, to build the final set of pages to visit, we visit each randomly selected site's landing page and collect 15 subpages (i.e., first-party links on the page) for each. ...
Article
Full-text available
Cookie notices (or cookie banners) are a popular mechanism for websites to provide (European) Internet users a tool to choose which cookies the site may set. Banner implementations range from merely providing information that a site uses cookies over offering the choice to accepting or denying all cookies to allowing fine-grained control of cookie usage. Users frequently get annoyed by the banner's pervasiveness as they interrupt ''natural'' browsing on the Web. As a remedy, different browser extensions have been developed to automate the interaction with cookie banners. In this work, we perform a large-scale measurement study comparing the effectiveness of extensions for ''cookie banner interaction.'' We configured the extensions to express different privacy choices (e.g., accepting all cookies, accepting functional cookies, or rejecting all cookies) to understand their capabilities to execute a user's preferences. The results show statistically significant differences in which cookies are set, how many of them are set, and which types are set---even for extensions that aim to implement the same cookie choice. Extensions for ''cookie banner interaction'' can effectively reduce the number of set cookies compared to no interaction with the banners. However, all extensions increase the tracking requests significantly except when rejecting all cookies.
... Other studies [32], [35], [36] focused on the chains of dependencies. A website's certain dependency may depend on yet another service. ...
... A website's certain dependency may depend on yet another service. [35] found that a single third-party dependency can lead up to eight subsequent requests to another service. They also found that 93% of analyzed websites are embedded by third parties that ...
Article
Full-text available
Websites using unsupported 3rd party technologies (libraries, frameworks, plugins, etc) are generally not recommended, especially due to security issues that are left unfixed. However, upgrading to supported technologies is also challenging, hence not all web maintainers upgrade their technology dependencies. Measuring the existence of unsupported technologies in the wild may contribute to the sense of urgency in keeping technologies updated. Our research proposed a method to measure the existence of unsupported technologies in international websites, using HTTP Archive as the data source. The contribution from our research is the method as well as the snapshot result from January 2023 data. The method is composed of four steps, namely: identify the list of websites, identify technologies used, group by technology names and retrieve currently supported versions, and compare versions between usage and supported versions. From the January 2023 data, we found several interesting results. One is that the higher the website rank is, the higher the number of supported technologies used. Another finding was that worldwide websites also generally use more supported versions of technologies, compared to Indonesian websites. Further research may be performed for longitudinal analysis of technology support evolution.
... Scholars who did consider different categories of websites in their results show variance in the amount of tracking within categories. An overarching finding from these efforts is that news websites enable more third-party tracking than other types of websites, as opposed to public websites that demonstrate less presence of trackers [9,19,21,23,38,42]. The difference in the amount of tracking for each category is usually associated with the incentives for publishers to include third-party trackers. ...
... • Crawling only homepages: We only crawled the homepages of selected websites to avoid bot detection. It is known that most tracking is happening on inner rather than homepages of websites [9,38], but at the same time, previous work found that crawling inner pages increases the chances of OpenWPM discovery by the crawled website [20]. • Studying clear-text ID cookie values: We are aware that cookie values can be encoded or encrypted when used by the same tracker in different websites [12]. ...
Preprint
Full-text available
The collapse of social contexts has been amplified by digital infrastructures but surprisingly received insufficient attention from Web privacy scholars. Users are persistently identified within and across distinct web contexts, in varying degrees, through and by different websites and trackers, losing the ability to maintain a fragmented identity. To systematically evaluate this structural privacy harm we operationalize the theory of Privacy as Contextual Integrity and measure persistent user identification within and between distinct Web contexts. We crawl the top-700 popular websites across the contexts of health, finance, news & media, LGBTQ, eCommerce, adult, and education websites, for 27 days, to learn how persistent browser identification via third-party cookies and JavaScript fingerprinting is diffused within and between web contexts. Past work measured Web tracking in bulk, highlighting the volume of trackers and tracking techniques. These measurements miss a crucial privacy implication of Web tracking - the collapse of online contexts. Our findings reveal how persistent browser identification varies between and within contexts, diffusing user IDs to different distances, contrasting known tracking distributions across websites, and conducted as a joint or separate effort via cookie IDs and JS fingerprinting. Our network analysis can inform the construction of browser storage containers to protect users against real-time context collapse. This is a first modest step in measuring Web privacy as contextual integrity, opening new avenues for contextual Web privacy research.
... In addition to cookies, various other techniques have been developed that allow comprehensive tracking of users for virtually all online and many offline activities (Urban et al., 2020). Whereas cookies primarily make use of the IP address of the device calling up a website to identify the user, current methods use additional identifiers. ...
Article
Full-text available
The significance of information and communication technologies (ICT) for the Paris Climate Agreement is continuously increasing because of its growing energy consumption. Here we examine the question for the smartphone and extend the investigation to more aspects of sustainability. Critical issues are identified for ten UN Sustainable Development Goals. Measurements of smartphone energy consumption show that a significant savings potential can be unlocked by reducing the data outflow and the large amount of personal data stored in data centers. Main discrepancies are also traced to the oligopolistic market structure of operating systems (OSs), messenger services, and social media apps. Technical means for a sustainable smartphone use are suggested as alternative OSs, social media channels of the Fediverse, as well as free and open‐source software. Finally, societal conditions are emphasized to make the market for OSs and apps more diverse so that a sustainable smartphone use can generally prevail.
... The prevalence and impact of third-party dependencies have been analyzed by Kashaf et al. [18] and Urban et al. [26], focusing on vulnerabilities and the concentration of dependencies on third-party service providers. The vulnerability of government domains has been investigated in [16]. ...
Chapter
Full-text available
This paper investigates the relationship between the digital divide, Internet transparency, and DNS dependencies. The term “digital divide” refers to a gap between how different population groups can access and use digital technology, with disadvantaged groups generally having less access than others. Internet transparency refers to efforts that reveal and understand critical dependencies on the Internet. DNS is a vital service in the Internet infrastructure. It has become common for network and website operators to outsource the operation of their DNS services to a (limited) number of specialized DNS providers. Depending on the choice of provider, a network or site may achieve better or worse availability, especially under adversarial conditions (power outages, attacks, etc.). This work-in-progress paper analyzes DNS provisioning and dependencies for Australian government websites to identify a possible digital divide. More specifically, we investigate setups with respect to potential drawbacks in terms of availability or domestic control over the setup. We choose sites whose audience is primarily the indigenous population and sites that target the broader, general population. We can indeed identify differences between the DNS dependencies, in particular with respect to the use of hyperscalers, domestic vs. international providers, and dedicated government infrastructure. The implications for availability and control are more subtle and require further investigation. However, our results show that Internet measurement can detect signals of possible digital divides, and we believe this aspect should be added to the Internet transparency agenda.
... Ethical perspective too must be considered when dealing with private information used in personalisation. In twosided personal data market (Fig 1) data owner's role is usually weak compared to others (Schmidt et al. 2021) and if ICT service has third party add-ons on its site collecting personal data, the service provider might not be awere what other third parties are included in known add-ons (Urban, Degeling, Holz & Pohlmann 2020). So, interactions with personal data might not always be transparent. ...
Research
Full-text available
Personalized marketing opportunities and risks in SMEs business context
Conference Paper
Full-text available
Mining is the foundation of blockchain-based cryptocurrencies such as Bitcoin rewarding the miner for finding blocks for new transactions. The Monero currency enables mining with standard hardware in contrast to special hardware (ASICs) as often used in Bitcoin, paving the way for in-browser mining as a new revenue model for website operators. In this work, we study the prevalence of this new phenomenon. We identify and classify mining websites in 138M domains and present a new fingerprinting method which finds up to a factor of 5.7 more miners than publicly available block lists. Our work identifies and dissects Coinhive as the major browser-mining stakeholder. Further, we present a new method to associate mined blocks in the Monero blockchain to mining pools and uncover that Coinhive currently contributes 1.18% of mined blocks having turned over 1293 Moneros in June 2018. CCS CONCEPTS • Security and privacy → Malware and its mitigation; • Networks → Network measurement;
Article
Full-text available
Web tracking has been extensively studied over the last decade. To detect tracking, previous studies and user tools rely on filter lists. However, it has been shown that filter lists miss trackers. In this paper, we propose an alternative method to detect trackers inspired by analyzing behavior of invisible pixels. By crawling 84,658 webpages from 8,744 domains, we detect that third-party invisible pixels are widely deployed: they are present on more than 94.51% of domains and constitute 35.66% of all third-party images. We propose a fine-grained behavioral classification of tracking based on the analysis of invisible pixels. We use this classification to detect new categories of tracking and uncover new collaborations between domains on the full dataset of 4, 216, 454 third-party requests. We demonstrate that two popular methods to detect tracking, based on EasyList&EasyPrivacy and on Disconnect lists respectively miss 25.22% and 30.34% of the trackers that we detect. Moreover, we find that if we combine all three lists, 379, 245 requests originated from 8,744 domains still track users on 68.70% of websites.
Conference Paper
Full-text available
The European General Data Protection Regulation (GDPR), which went into effect in May 2018, brought new rules for the processing of personal data that affect many business models, including online advertising. The regulation's definition of personal data applies to every company that collects data from European Internet users. This includes tracking services that, until then, argued that they were collecting anonymous information and data protection requirements would not apply to their businesses. Previous studies have analyzed the impact of the GDPR on the prevalence of online tracking, with mixed results. In this paper, we go beyond the analysis of the number of third parties and focus on the underlying information sharing networks between online advertising companies in terms of client-side cookie syncing. Using graph analysis, our measurement shows that the number of ID syncing connections decreased by around 40 % around the time the GDPR went into effect, but a long-term analysis shows a slight rebound since then. While we can show a decrease in information sharing between third parties, which is likely related to the legislation , the data also shows that the amount of tracking, as well as the general structure of cooperation, was not affected. Consolidation in the ecosystem led to a more centralized infrastructure that might actually have negative effects on user privacy, as fewer companies perform tracking on more sites.
Conference Paper
Full-text available
Ad personalization has been criticized in the past for invading privacy, lack of transparency, and improper controls offered to users. Recently, companies started to provide web portals and other means for users to access data collected about them. In this paper, we study these new transparency tools from multiple perspectives using a mixed-methods approach. Still practices of data sharing barely changed until recently when new legislation required all companies to grant individual access to personal data stored about them. Using a mixed-methods approach we study the benefits of the new rights for users. First, we analyze transparency tools provided by 22 companies and check whether they follow previous recommendations for usability and user expectations. Based on these insights, we conduct a survey with 490 participants to evaluate three common approaches to disclose data. To complement this user-centric view, we shed light on the design decisions and complexities of transparency in online advertising using an online survey (n = 24) and in-person interviews (n = 8) with experts from the industry. We find that newly created transparency tools present a variety of information to users, from detailed technical logs to high-level interest segment information. Our results indicate that users do not (yet) know what to learn from the data and mistrust the accuracy of the information shown to them. At the same time, new transparency requirements pose several challenges to an industry that excessively shares data that even they sometimes cannot relate to an individual.
Conference Paper
Full-text available
After the adoption of the General Data Protection Regulation (GDPR) in May 2018, more than 60 % of popular websites in Europe were found to display a cookie consent notice. This has quickly led to users becoming fatigued with privacy notifications and contributed to the rise of both browser extensions that block these banners and demands for a solution that bundles consent across multiple websites or in the browser. In this work, we identify common properties of the graphical user interface of consent notices and conduct three studies with more than 80,000 unique users on a German website to investigate their influence on consent. We find that users are more likely to interact with a notice shown in the lower (left) part of the screen. Given a binary choice, more users are willing to accept tracking compared to mechanisms that require them to allow cookie use for each category or company individually. We also show that the practice of nudging is widely used and has a large effect on the choices users make. Our studies have implications for future regulations and the design of consent notices that encourage users to actively make an informed choice.
Conference Paper
Full-text available
Online tracking has mostly been studied by passively measuring the presence of tracking services on websites (i) without knowing what data these services collect, (ii) the reasons for which specific purposes it is collected, (iii) or if the used practices are disclosed in privacy policies. The European General Data Protection Regulation (GDPR) came into effect on May 25, 2018 and introduced new rights for users to access data collected about them. In this paper, we evaluate how companies respond to subject access requests and portability to learn more about the data collected by tracking services. More specifically, we exercised our right to access with 38 companies that had tracked us online. We observe stark differences between the way requests are handled and what data is disclosed: Only 21 out of 38 companies we inquired (55 %) disclosed information within the required time and only 13 (34 %) companies were able to send us a copy of the data in time. Our work has implications regarding the implementation of privacy law as well as what online tracking companies should do to be more compliant with the new regulation.
Conference Paper
Full-text available
The European Union's (EU) General Data Protection Regulation (GDPR), in effect since May 2018, enforces strict limitations on handling users' personal data, hence impacting their activity tracking on the Web. In this study, we perform an evaluation of the tracking performed in 2,000 high-traffic websites, hosted both inside and outside of the EU. We evaluate both the information presented to users and the actual tracking implemented through cookies; we find that the GDPR has impacted website behavior in a truly global way, both directly and indirectly: USA-based websites behave similarly to EU-based ones, while third-party opt-out services reduce the amount of tracking even for websites which do not put any effort in respecting the new law. On the other hand, we find that tracking remains ubiquitous. In particular, we found cookies that can identify users when visiting more than 90% of the websites in our dataset - and we also encountered a large number of websites that present deceiving information, making it it very difficult, if at all possible, for users to avoid being tracked.
Chapter
Web bots are used to automate client interactions with websites, which facilitates large-scale web measurements. However, websites may employ web bot detection. When they do, their response to a bot may differ from responses to regular browsers. The discrimination can result in deviating content, restriction of resources or even the exclusion of a bot from a website. This places strict restrictions upon studies: the more bot detection takes place, the more results must be manually verified to confirm the bot’s findings.