Conference PaperPDF Available

What Happens After You Are Pwnd: Understanding The Use Of Leaked Webmail Credentials In The Wild


Abstract and Figures

Cybercriminals steal access credentials to online accounts and then misuse them for their own profit, release them publicly, or sell them on the underground market. Despite the importance of this problem, the research community still lacks a comprehensive understanding of what these stolen accounts are used for. In this paper, we aim to shed light on the modus operandi of miscreants accessing stolen Gmail accounts. We developed an infrastructure that is able to monitor the activity performed by users on Gmail accounts, and leaked credentials to 100 accounts under our control through various means, such as having information-stealing malware capture them, leaking them on public paste sites, and posting them on underground forums. We then monitored the activity recorded on these accounts over a period of 7 months. Our observations allowed us to devise a taxonomy of malicious activity performed on stolen Gmail accounts, to identify differences in the behavior of cybercriminals that get access to stolen accounts through different means, and to identify systematic attempts to evade the protection systems in place at Gmail and blend in with the legitimate user activity. This paper gives the research community a better understanding of a so far understudied, yet critical aspect of the cybercrime economy.
Content may be subject to copyright.
What Happens After You Are Pwnd: Understanding
The Use Of Leaked Webmail Credentials In The Wild
Jeremiah Onaolapo, Enrico Mariconti, and Gianluca Stringhini
University College London
{j.onaolapo, e.mariconti, g.stringhini}
Cybercriminals steal access credentials to webmail ac-
counts and then misuse them for their own profit, re-
lease them publicly, or sell them on the underground
market. Despite the importance of this problem, the
research community still lacks a comprehensive under-
standing of what these stolen accounts are used for. In
this paper, we aim to shed light on the modus operandi
of miscreants accessing stolen Gmail accounts. We de-
veloped an infrastructure that is able to monitor the ac-
tivity performed by users on Gmail accounts, and leaked
credentials to 100 accounts under our control through
various means, such as having information-stealing mal-
ware capture them, leaking them on public paste sites,
and posting them on underground forums. We then
monitored the activity recorded on these accounts over
a period of 7 months. Our observations allowed us to
devise a taxonomy of malicious activity performed on
stolen Gmail accounts, to identify differences in the be-
havior of cybercriminals that get access to stolen ac-
counts through different means, and to identify system-
atic attempts to evade the protection systems in place
at Gmail and blend in with the legitimate user activity.
This paper gives the research community a better un-
derstanding of a so far understudied, yet critical aspect
of the cybercrime economy.
Categories and Subject Descriptors
J.4 [Computer Applications]: Social and Behavioral
Sciences; K.6.5 [Security and Protection]: Unautho-
rized Access
Permission to make digital or hard copies of all or part of this work for personal
or classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice
and the full citation on the first page. Copyrights for components of this work
owned by others than ACM must be honored. Abstracting with credit is per-
mitted. To copy otherwise, or republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee. Request permissions from
IMC 2016, November 14-16, 2016, Santa Monica, CA, USA
2016 ACM. ISBN 978-1-4503-4526-2/16/11. . . $15.00
Cybercrime, Webmail, Underground Economy, Malware
The wealth of information that users store in web-
mail accounts on services such as Gmail, Yahoo! Mail,
or, as well as the possibility of misusing
them for illicit activities has attracted cybercriminals,
who actively engage in compromising such accounts.
Miscreants obtain the credentials to victims’ online ac-
counts by performing phishing scams [17], by infect-
ing users with information-stealing malware [29], or by
compromising large password databases, leveraging the
fact that people often use the same password across
multiple services [16]. Such credentials can be used by
the cybercriminal privately, or can then be sold on the
black market to other cybercriminals who wish to use
the stolen accounts for profit. This ecosystem has be-
come a very sophisticated market in which only vetted
sellers are allowed to join [30].
Cybercriminals can use compromised accounts in mul-
tiple ways. First, they can use them to send spam [18].
This practice is particularly effective because of the
established reputation of such accounts: the already-
established contacts of the account are likely to trust
its owner, and are therefore more likely to open the
messages that they receive from her [20]. Similarly, the
stolen account is likely to have a history of good be-
havior with the online service, and the malicious mes-
sages sent by it are therefore less likely to be detected
as spam, especially if the recipients are within the same
service (e.g., a Gmail account used to send spam to
other Gmail accounts) [33]. Alternatively, cybercrim-
inals can use the stolen accounts to collect sensitive
information about the victim. Such information can
include financial credentials (credit card numbers, bank
account numbers), login information to other online ser-
vices, and personal communications of the victim [13].
Despite the importance of stolen accounts for the
underground economy, there is surprisingly little work
on the topic. Bursztein et al. [13] studied the modus
operandi of cybercriminals collecting Gmail account cre-
dentials through phishing scams. Their paper shows
that criminals access these accounts to steal financial
information from their victims, or use these accounts to
send fraudulent emails. Since their work only focused
on one possible way used by criminals to steal user login
credentials, it leaves questions unanswered on how gen-
eral their observations are compared to credentials ac-
quired through other means. Most importantly, [13] re-
lies on proprietary information from Google, and there-
fore it is not possible for other researchers to replicate
their results or build on top of their work.
Other researchers did not attempt studying the activ-
ity of criminals on compromised online accounts because
it is usually difficult to monitor what happens to them
without being a large online service. The rare excep-
tions are studies that look at information that is pub-
licly observable, such as the messages posted on Twitter
by compromised accounts [18,19].
To close this gap, in this paper we present a system
that is able to monitor the activity performed by at-
tackers on Gmail accounts. To this end, we instrument
the accounts using Google Apps Script [1]; by doing so,
we were able to monitor any time an email was opened,
favorited, sent, or a new draft was created. We also
monitor the accesses that the accounts receive, with
particular attention to their system configuration and
their origin. We call such accounts honey accounts.
We set up 100 honey accounts, each resembling the
Gmail account of the employee of a fictitious company.
To understand how criminals use these accounts af-
ter they get compromised, we leaked the credentials to
such accounts on multiple outlets, modeling the differ-
ent ways in which cybercriminals share and get access to
such credentials. First, we leaked credentials on paste
sites, such as pastebin [5]. Paste sites are commonly
used by cybercriminals to post account credentials after
data breaches [2]. We also leaked them to underground
forums, which have been shown to be the place where
cybercriminals gather to trade stolen commodities such
as account credentials [30]. Finally, we logged in to
our honey accounts on virtual machines that were pre-
viously infected with information stealing malware. By
doing this, the credentials will be sent to the cybercrim-
inal behind the malware’s command and control infras-
tructure, and will then be used directly by her or placed
on the black market for sale [29]. We know that there
are other outlets that attackers use, for instance, phish-
ing and data breaches, but we decided to focus on paste
sites, underground forums, and malware in this paper.
We worked in close collaboration with the Google anti-
abuse team, to make sure that any unwanted activity by
the compromised accounts would be promptly blocked.
The accounts were configured to send any email to a
mail server under our control, to prevent them from
successfully delivering spam.
After leaking our credentials, we recorded any inter-
action with our honey accounts for a period of 7 months.
Our analysis allowed us to draw a taxonomy of the dif-
ferent actions performed by criminals on stolen Gmail
accounts, and provided us interesting insights on the
keywords that criminals typically search for when look-
ing for valuable information on these accounts. We
also show that criminals who obtain access to stolen ac-
counts through certain outlets appear more skilled than
others, and make additional efforts to avoid detection
from Gmail. For instance, criminals who steal account
credentials via malware make more efforts to hide their
identity, by connecting from the Tor network and dis-
guising their browser user agent. Criminals who obtain
access to stolen credentials through paste sites, on the
other hand, tend to connect to the accounts from lo-
cations that are closer to the typical location used by
the owner of the account, if this information is shared
with them. At the lowest level of sophistication are
criminals who browse free underground forums looking
for free samples of stolen accounts: these individuals do
not take significant measures to avoid detection, and
are therefore easier to detect and block. Our findings
complement what was reported by previous work in the
case of manual account hijacking [13], and show that
the modus operandi of miscreants varies considerably
depending on how they obtain the credentials to stolen
In summary, this paper makes the following contri-
We developed a system to monitor the activity of
Gmail accounts. We publicly release the source
code of our system, to allow other researchers to
deploy their own Gmail honey accounts and fur-
ther the understanding that the security commu-
nity has of malicious activity on online services. To
the best of our knowledge, this is the first publicly
available Gmail honeypot infrastructure.
We deployed 100 honey accounts on Gmail, and
leaked credentials through three different outlets:
underground forums, public paste sites, and vir-
tual machines infected with information-stealing
We provide detailed measurements of the activ-
ity logged by our honey accounts over a period of
7 months. We show that certain outlets on which
credentials are leaked appear to be used by more
skilled criminals, who act stealthy and actively at-
tempt to evade detection systems.
Gmail accounts. In this paper we focus on Gmail
accounts, with particular attention to the actions per-
formed by cybercriminals once they obtain access to
someone else’s account. We made this choice over other
webmail platforms because Gmail allows users to set up
scripts that augment the functionality of their accounts,
and it was therefore the ideal platform for developing
webmail–based honeypots. To ease the understanding
of the rest of the paper, we briefly summarize the capa-
bilities offered by webmail accounts in general, and by
Gmail in particular.
In Gmail, after logging in, users are presented with a
view of their Inbox. The inbox contains all the emails
that the user received, and highlights the ones that have
not been read yet by displaying them in boldface font.
Users have the option to mark emails that are important
to them and that need particular attention by starring
them. Users are also given a search functionality, which
allows them to find emails of interest by typing related
keywords. They are also given the possibility to orga-
nize their email by placing related messages in folders,
or assigning them descriptive labels. Such operations
can be automated by creating rules that automatically
process received emails. When writing emails, content
is saved in a Drafts folder until the user decides to send
it. Once this happens, sent emails can be found in a
dedicated folder, and they can be searched similarly to
what happens for received emails.
Threat model. Cybercriminals can get access to ac-
count credentials in many ways. First, they can per-
form social engineering-based scams, such as setting up
phishing web pages that resemble the login page of pop-
ular online services [17] or sending spearphishing emails
pretending to be members of customer support teams at
such online services [32]. As a second way of obtaining
user credentials, cybercriminals can install malware on
victim computers and configure it to report back any
account credentials issued by the user to the command
and control server of the botnet [29]. As a third way of
obtaining access to user credentials, cybercriminals can
exploit vulnerabilities in the databases used by online
services to store them [6].
User credentials can also be obtained illegitimately
through targeted online password guessing techniques [36],
often aided by the problem of password reuse across
various online services [16]. Finally, cybercriminals can
steal user credentials and access tokens by running net-
work sniffers [14] or mounting Man-in-the-Middle [11]
attacks against victims.
After stealing account credentials, a cybercriminal
can either use them privately for their own profit, re-
lease them publicly, or sell them on the underground
market. Previous work studied the modus operandi of
cybercriminals stealing user accounts through phishing
and using them privately [13]. In this work, we study
a broader threat model in which we mimic cybercrimi-
nals leaking credentials on paste sites [5] as well as mis-
creants advertising them for sale on underground fo-
rums [30]. In particular, previous research showed that
cybercriminals often offer a small number of account
credentials for free to test their “quality” [30]. We fol-
lowed a similar approach, pretending to have more ac-
counts for sale, but never following up to any further
inquiries. In addition, we simulate infected victim ma-
chines in which malware steals the user’s credentials and
sends them to the cybercriminal. We describe our setup
and how we leaked account credentials on each outlet
in detail in Section 3.2.
Our overall goal was to gain a better understanding
of malicious activity in compromised webmail accounts.
To achieve this goal, we developed a system able to
monitor accesses and activity on Gmail accounts. We
set up accounts and leaked them through different out-
lets. In the following sections, we describe our system
architecture and experiment setup in detail.
3.1 System overview
Our system comprises two components, namely, honey
accounts and a monitor infrastructure.
Honey accounts. Our honey accounts are webmail ac-
counts instrumented with Google Apps Script to mon-
itor activity in them. Google Apps Script is a cloud-
based scripting language based on JavaScript, designed
to augment the functionality of Gmail accounts and
Google Drive documents, in addition to building web
apps [4]. The scripts we embedded in the honey
accounts send notifications to a dedicated webmail ac-
count under our control whenever an email is opened,
sent, or“starred.” In addition, the scripts send us copies
of all draft emails created in the honey accounts. We
also added a “heartbeat message”function, to send us a
message once a day from each honey account, to attest
that the account was still functional and had not been
blocked by Google.
In each honey account, we hid the script in a Google
Docs spreadsheet. We believe that this measure makes
it unlikely for attackers to find and delete our scripts.
To minimize abuse, we changed each honeypot account’s
default send-from address to an email address pointing
to a mailserver under our control. All emails sent from
the honeypot accounts are delivered to the mailserver,
which simply dumps the emails to disk and does not
forward them to the intended destination.
Monitoring infrastructure. Google Apps Scripts are
quite powerful, but they do not provide enough informa-
tion in some cases. For example, they do not provide lo-
cation information and IP addresses of accesses to web-
mail accounts. To track those accesses, we set up exter-
nal scripts to drive a web browser and periodically login
into each honey account and record information about
visitors (cookie identifier, geolocation information, and
times of accesses, among others). The scripts navigate
to the visitor activity page in each honey account, and
dump the pages to disk, for offline parsing. By col-
lecting information from the visitor activity pages, we
obtain location and system configuration information of
accesses, as provided by Google’s geolocation and sys-
tem fingerprinting system.
We believe that our honey account and monitoring
framework unleashes multiple possibilities for researchers
who want to further study the behavior of attackers in
webmail accounts. For this reason, we release the source
code of our system1.
3.2 Experiment setup
As part of our experiments, we first set up a number
of honey accounts on Gmail, and then leaked through
multiple outlets used by cybercriminals.
Honey account setup. We created 100 Gmail ac-
counts and assigned them random combinations of pop-
ular first and last names, similar to what was done
in [31]. Creating and setting up these accounts is a
manual process. Google also rate-limits the creation
of new accounts from the same IP address by presenting
a phone verification page after a few accounts have been
created. These factors imposed limits on the number of
honey accounts we could set up in practice.
We populated the freshly-created accounts with emails
from the public Enron email dataset [22]. This dataset
contains the emails sent by the executives of the en-
ergy corporation Enron, and was publicly released as
evidence for the bankruptcy trial of the company. This
dataset is suitable for our purposes, since the emails
that it contains are the typical emails exchanged by
corporate users. To make the honey accounts believ-
able and avoid raising suspicion from cybercriminals
accessing them, we mapped distinct recipients in the
Enron dataset to our fictional characters (i.e., the ficti-
tious “owners” of the honey accounts), and replaced the
original first names and last names in the dataset with
our honey first names and last names. In addition, we
changed all instances of“Enron” to a fictitious company
name that we came up with.
In order to have realistic email timestamps, we trans-
lated the old Enron email timestamps to recent times-
tamps slightly earlier than our experiment start date.
For instance, given two email timestamps t1and t2in
the Enron dataset such that t1is earlier than t2, we
translate them to more recent timestamps T1and T2
such that T1is earlier than T2. We then schedule those
particular emails to be sent to the recipient honey ac-
counts at times T1and T2respectively. We sent be-
tween 200 – 300 emails from the Enron dataset to each
honey account in the process of populating them.
Leaking account credentials. To achieve our objec-
tives, we had to entice cybercriminals to interact with
our account honeypots while we logged their accesses.
We selected paste sites and underground forums as ap-
propriate venues for leaking account credentials, since
they tend to be misused by cybercriminals for dissem-
ination of stolen credentials. In addition, we leaked
some credentials through malware, since this is a popu-
lar way by which professional cybercriminals steal cre-
dentials and compromise accounts [10]. We divided
the honeypot accounts in groups and leaked their cre-
dentials in different locations, as shown in Table 1. We
leaked 50 accounts in total on paste sites. For 20 of
1 students/gmail-honeypot
them, we leaked basic credentials (username and pass-
word pairs) on the popular paste sites and We leaked 10 account credentials on Russian
paste websites ( and For the re-
maining 20 accounts, we leaked username and password
pairs along with UK and US location information of the
fictitious personas that we associated with the honey ac-
counts. We also included date of birth information of
each persona.
Group Accounts Outlet of leak
1 30 paste websites (no location)
2 20 paste websites (with location)
3 10 forums (no location)
4 20 forums (with location)
5 20 malware (no location)
Table 1: List of account honeypot groupings.
We leaked 30 account credentials on underground fo-
rums. For 10 of them, we only specified username and
password pairs, without additional information. In a
manner similar to the paste site leaks described earlier,
we appended UK and US location information to un-
derground forum leaks, claiming that our fictitious per-
sonas lived in those locations. We also included date of
birth information for each persona.
To leak credentials, we used these forums: offen-,,hack-, and We selected them
because they were open for anybody to register, and
were highly ranked in Google results. We acknowledge
that some underground forums are not open, and have a
strict vetting policy to let users in [30]. Unfortunately,
however, we did not have access to any private forum.
In addition, the same approach of studying open un-
derground forums has been used by previous work [7].
When leaking credentials on underground forums, we
mimicked the modus operandi of cybercriminals that
was outlined by Stone-Gross et al. in [30]. In the pa-
per, the authors showed that cybercriminals often post
a sample of their stolen datasets on the forum to show
that the accounts are real, and promise to provide ad-
ditional data in exchange for a fee. We logged the mes-
sages that we received on underground forums, mostly
inquiring about obtaining the full dataset, but we did
not follow up to them.
Finally, to study the activity of criminals obtaining
credentials through information-stealing malware in honey
accounts, we leaked access credentials of 20 accounts to
information-stealing malware samples. To this end, we
selected malware samples from the Zeus family, which
is one of the most popular malware families performing
information stealing [10], as well as from the Corebot
family. We will provide detailed information on our
malware honeypot infrastructure in the next section.
The reason for leaking different accounts on different
outlets is to study differences in the behavior of cyber-
criminals getting access to stolen credentials through
different sources. Similarly, we provide decoy location
information in some leaks, and not in others, with the
idea of observing differences in malicious activity de-
pending on the amount and type of information avail-
able to cybercriminals. As we will show in Section 4,
the accesses that were observed in our honey accounts
were heavily influenced by the presence of additional
location information in the leaked content.
Malware honeypot infrastructure. Our malware
sandbox system is structured as follows. A web server
entity manages the honey credentials (usernames and
passwords) and the malware samples. The host ma-
chine creates a Virtual Machine (VM), which contacts
the web server to request an executable malware file and
a honey credential file. The structure is similar to the
one explained in [21]. The malware file is then executed
in the VM (that is, the VM is infected with malware),
after which a script drives a browser in the VM to lo-
gin to Gmail using the downloaded credentials. The
idea is to expose the honey credentials to the malware
that is already running in the VM. After some time,
the infected VM is deleted and a fresh one is created.
This new VM downloads another malware sample and
a different honey credential file, and it repeats the in-
fection and login operation. To maximize the efficiency
of the configuration, before the experiment we carried
out a test without the Gmail login process to select only
samples whose C&C servers were still up and running.
3.3 Threats to validity
We acknowledge that seeding the honey accounts with
emails from the Enron dataset may introduce bias into
our results, and may make the honey accounts less be-
lievable to visitors. However, it is necessary to note that
the Enron dataset is the only large publicly available
email corpus, to the best of our knowledge. To make the
emails believable, we changed the names in the emails,
dates, and company name. In the future, we will work
towards obtaining or generating a better email dataset,
if possible. Also, some visitors may notice that the
honey accounts did not receive any new emails during
the period of observation, and this may affect the way
in which criminals interact with the accounts. Another
threat is that we only leaked honey credentials through
the outlets listed previously (namely paste sites, un-
derground forums, and malware), therefore, our results
reflect the activity of participants present on those out-
lets only. Finally, since we selected only underground
forums that are publicly accessible, our observations
might not reflect the modus operandi of actors who are
active on closed forums that require vetting for signing
3.4 Ethics
The experiments performed in this paper require some
ethical considerations. First of all, by giving access to
our honey accounts to cybercriminals, we incur the risk
that these accounts will be used to damage third par-
ties. To minimize this risk, as we said, we configured our
accounts in a way that all emails would be forwarded
to a sinkhole mailserver under our control and never
delivered to the outside world. We also established a
close collaboration with Google and made sure to re-
port to them any malicious activity that needed atten-
tion. Although the suspicious login filters that Google
typically uses to protect their accounts from unautho-
rized accessed were disabled for our honey accounts, all
other malicious activity detection algorithms were still
in place, and in fact Google suspended a number of
accounts under our control that engaged in suspicious
activity. It is important to note, however, that our ap-
proach does not rely on help from Google to work. Our
main reason for enlisting Google’s help to disable sus-
picious login filters was to ensure that all accesses get
through to the honey accounts (most accesses would
be blocked if Google did not disable the login filters).
This does not impact directly on our methodology, and
as a result does not reduce the wider applicability of
our approach. It is also important to note that Google
did not share with us any details on the techniques
used internally for the detection of malicious activity on
Gmail. Another point of risk is ensuring that the mal-
ware in our VMs would not be able to harm third par-
ties. We followed common practices [28] such as restrict-
ing the bandwidth available to our virtual machines and
sinkholing all email traffic sent by them. Finally, our
experiments involve deceiving cybercriminals by provid-
ing them fake accounts with fake personal information
in them. To ensure that our experiments were run in
an ethical fashion, we obtained IRB approval from our
We monitored the activity on our honey accounts for
a period of 7 months, from 25th June, 2015 to 16th
February, 2016. In this section, we first provide an
overview of our results. We then discuss a taxonomy
of the types of activity that we observed. We provide
a detailed analysis of the type of activity monitored
on our honey accounts, focusing on the differences in
modus operandi shown by cybercriminals who obtain
credentials to our honey accounts from different outlets.
We then investigate whether cybercriminals attempt to
evade location-based detection systems by connecting
from locations that are closer to where the owner of ac-
count typically connects from. We also develop a met-
ric to infer which keywords attackers search for when
looking for interesting information in an email account.
Finally, we analyze how certain types of cybercriminals
appear to be stealthier and more advanced than others.
Google records each unique access to a Gmail account
and labels the access with a unique cookie identifier.
These unique cookie identifiers, along with more infor-
mation including times of accesses, are included in the
visitor activity pages of Gmail accounts. Our scripts
extract this data, which we analyze in this section. For
the sake of convenience, we will use the terms “cookie”
and “unique access” interchangeably in the remainder of
this paper.
4.1 Overview
We created, instrumented, and leaked 100 Gmail ac-
counts for our experiments. To avoid biasing our re-
sults, we removed all accesses made to honey accounts
by IP addresses from our monitoring infrastructure. We
also removed all accesses that originated from the city
where our monitoring infrastructure is located. After
this filtering operation, we observed 326 unique accesses
to the accounts during the experiment, during which
147 emails were opened, 845 emails were sent, and there
were 12 unique draft emails composed by cybercrimi-
In total, 90 accounts received accesses during the ex-
periment, comprising 41 accounts leaked to paste sites,
30 accounts leaked to underground forums, and 19 ac-
counts leaked through malware. 42 accounts were
blocked by Google during the course of the experiment,
due to suspicious activity. We were able to log activity
in those accounts for some time before Google blocked
them. 36 accounts were hijacked by cybercriminals, that
is, the passwords of such accounts were changed by the
cybercriminals. As a result, we lost control of those
accounts. We did not observe any attempt by at-
tackers to change the default send-from address of our
honey accounts. However, assuming that happened and
attackers started sending spam messages, Google would
block such accounts since we asked them to monitor the
accounts with particular attention. A dataset contain-
ing the parsed metadata of the accesses received from
our honey accounts during our experiments is publicly
available at
4.2 A taxonomy of account activity
From our dataset of activity observed in the honey
accounts, we devise a taxonomy of attackers based on
unique accesses to such accounts. We identify four types
of attackers, described in detail in the following.
Curious. These accesses constitute the most basic type
of access to stolen accounts. After getting hold of ac-
count credentials, people login on those accounts to
check if such credentials work. Afterwards, they do not
perform any additional action. The majority of the ob-
served accesses belong to this category, accounting for
224 accesses. We acknowledge that this large number
of curious accesses may be due in part to experienced
attackers avoiding interactions with the accounts after
logging in, probably after some careful observations in-
dicating that the accounts do not look real. This could
potentially introduce some bias into our results.
Gold diggers. When getting access to a stolen ac-
count, attackers often want to understand its worth.
For this reason, on logging into honey accounts, some
attackers search for sensitive information, such as ac-
count information and attachments that have financial-
related names. They also seek information that may
be useful in spearphishing attacks. We call these ac-
cesses “gold diggers.” Previous research showed that
this practice is quite common for manual account hi-
jackers [13]. In this paper, we confirm that finding,
provide a methodology to assess the keywords that cy-
bercriminals search for, and analyze differences in the
modus operandi of gold digger accesses for credentials
leaked through different outlets. In total, we observed
82 accesses of this type.
Spammers. One of the main capabilities of webmail
accounts is sending emails. Previous research showed
that large spamming botnets have code in their bots
and in their C&C infrastructure to take advantage of
this capability, by having the bots directly connect to
such accounts and send spam [30]. We consider accesses
to belong to this category if they send any email. We
observed 8 accounts of this type that recorded such ac-
cesses. This low number of accounts shows that send-
ing spam appears not to be one of the main purposes
that cybercriminals use stolen accounts for, when stolen
through the outlets that we studied.
Hijackers. A stealthy cybercriminal is likely to keep
a low profile when accessing a stolen account, to avoid
raising suspicion from the account’s legitimate owner.
Less concerned miscreants, however, might just act to
lock the legitimate owner out of their account by chang-
ing the account’s password. We call these accesses “hi-
jackers.” In total, we observed 36 accesses of this type.
A change of password prevents us from scraping the vis-
itor activity page, and therefore we are unable to col-
lect further information about the accesses performed
to that account.
It is important to note that the taxonomy classes that
we described are not exclusive. For example, an at-
tacker might use an account to send spam emails, there-
fore falling in the “spammer” category, and then change
the password of that account, therefore falling into the
“hijacker” category. Such overlaps happened often for
the accesses recorded in our honey accounts. It is in-
teresting to note that there was no access that behaved
exclusively as “spammer.” Miscreants that sent spam
through our honey accounts also acted as “hijackers” or
as “gold diggers,” searching for sensitive information in
the account.
We wanted to understand the distribution of different
types of accesses in accounts that were leaked through
different means. Figure 1shows a breakdown of this dis-
tribution. As it can be seen, cybercriminals who get ac-
cess to stolen accounts through malware are the stealth-
iest, and never lock the legitimate users out of their
account. Instead, they limit their activity to check-
ing if such credentials are real or searching for sensi-
tive information in the account inbox, perhaps in an at-
tempt to estimate the value of the accounts. Accounts
leaked through paste sites and underground forums see
the presence of “hijackers.” 20% of the accesses to ac-
counts leaked through paste sites, in particular, belong
Figure 1: Distribution of types of accesses for different
credential leak accesses. As it can be seen, most accesses
belong to the “curious” category. It is possible to spot
differences in the types of activities for different leak
outlets. For example, accounts leaked by malware do
not present activity of“hijacker” type. Hijackers, on the
other hand, are particularly common among miscreants
who obtain stolen credentials through paste sites.
Figure 2: CDF of the length of unique accesses for dif-
ferent types of activity on our honey accounts. The vast
majority of unique accesses lasts a few minutes. Spam-
mers tend to use accounts aggressively for a short time
and then disconnect. The other types of accesses, and
in particular “curious” ones, come back after some time,
likely to check for new activity in the honey accounts.
to this category. Accounts leaked through underground
forums, on the other hand, see the highest percentage
of “gold digger” accesses, with about 30% of all accesses
belonging to this category.
4.3 Activity on honey accounts
In the following, we provide detailed analysis on the
unique accesses that we recorded for our honey accounts.
4.3.1 Duration of accesses
For each cookie identifier, we recorded the time that
the cookie first appeared in a particular honey account
as t0, and the last time it appeared in the honey ac-
count as tlast. From this information, we computed the
duration of activity of each cookie as tlast t0. It is nec-
essary to note that tlast of each cookie is a lower bound,
since we cease to obtain information about cookies if the
password of the honey account that is recording cookies
Figure 3: CDF of the time passed between account cre-
dentials leaks and the first visit by a cookie. Accounts
leaked through paste sites receive on average accesses
earlier than accounts leaked through other outlets.
is changed, for instance. Figure 2shows the Cumulative
Distribution Function (CDF) of the length of unique ac-
cesses of different types of attackers. As it can be seen,
the vast majority of accesses are very short, lasting only
a few minutes and never coming back. “Spammer” ac-
cesses, in particular, tend to send emails in burst for
a certain period and then disconnect. “Hijacker” and
“gold digger” accesses, on the other hand, have a long
tail of about 10% accesses that keep coming back for
several days in a row. The CDF shows that most “cu-
rious” accesses are repeated over many days, indicating
that the cybercriminals keep coming back to find out
if there is new information in the accounts. This con-
flicts with the finding in [13], which states that most
cybercriminals connect to a compromised webmail ac-
count once, to assess its value within a few minutes.
However, [13] focused only on accounts compromised
via phishing pages, while we look at a broader range
of ways in which criminals can obtain such credentials.
Our results show that the modes of operations of cy-
bercriminals vary, depending on the outlets they obtain
stolen credentials from.
4.3.2 Time between leak and first access
We then studied how long it takes after credentials
are leaked on different outlets before our infrastructure
records accesses from cybercriminals. Figure 3reports
a CDF of the time between leak and first access for
accounts leaked through different outlets. As it can be
seen, within the first 25 days after leak, we recorded
80% of all unique accesses to accounts leaked to paste
sites, 60% of all unique accesses to accounts leaked to
underground forums, and 40% of all unique accesses to
accounts leaked to malware. A particularly interesting
observation is the nature of unique accesses to accounts
leaked to malware. A close look at Figure 3reveals
rapid increases in unique accesses to honey accounts
leaked to malware, about 30 days after the leak, and
also after 100 days, indicated by two sharp inflection
Figure 4sheds more light into what happened at
those points. The figure reports the unique accesses
to each of our honey accounts over time. An interesting
aspect to note is that accounts that are leaked on public
outlets such as forum and paste sites can be accessed
by multiple cybercriminals at the same time. Account
credentials leaked through malware, on the other hand,
are available only to the botmaster that stole them, un-
til they decide to sell them or to give them to some-
one else. Seeing bursts in accesses to accounts leaked
through malware months after the actual leak happened
could indicate that the accounts were visited again by
the same criminal who operated the malware infrastruc-
ture, or that the accounts were sold on the underground
market and that another miscreant is now using them.
This hypothesis is somewhat confirmed by the fact that
these bursts in accesses were of the “gold digger” type,
while all previous accesses to the same accounts were of
the “curious” type.
In addition, Figure 4shows that the majority of ac-
counts leaked to paste sites were accessed within a few
days of leak, while a particular subset was not accessed
for more than 2 months. That subset refers to the ten
credentials we leaked to Russian paste sites. The cor-
responding honey accounts were not accessed for more
than 2 months from the time of leak. This either indi-
cates that cybercriminals are not many on the Russian
paste sites, or maybe they did not believe that the ac-
counts were real, thus not bothering to access them.
4.3.3 System configuration of accesses
We observed a wide variety of system configurations
for the accesses across groups of leaked accounts, by
leveraging Google’s system fingerprinting information
available to us inside the honey accounts. As shown
in Figure 5a, accesses to accounts leaked on paste sites
were made through a variety of popular browsers, with
Firefox and Chrome taking the lead. We also recorded
many accesses from unknown browsers. It is possible
for an attacker to hide browser information from Google
servers by presenting an empty user agent and hiding
other fingerprintable information [27]. About 50% of
accesses to accounts leaked through paste sites were
not identifiable. Chrome and Firefox take the lead in
groups leaked in underground forums as well, but there
is less variety of browsers there. Interestingly, all ac-
cesses to accounts in malware groups were made from
unknown browsers. This shows that cybercriminals that
accessed groups leaked through malware were stealth-
ier than others. While analyzing the operating systems
used by criminals, we observed that honey accounts
leaked through malware mostly received accesses from
Windows computers, followed by Mac OS X and Linux.
This is shown in Figure 5b. In the paste sites and un-
derground forum groups, we observe a wider range of
operating systems. More than 50% of computers in the
Figure 4: Plot of duration between time of leak and
unique accesses in accounts leaked through different
outlets. As it can be seen, accounts leaked to mal-
ware experience a sudden increase in unique accesses
after 30 days and 100 days from the leak, indicating
that they may have been sold or transferred to some
other party by the cybercriminals behind the malware
command and control infrastructure.
three categories ran on Windows. It is interesting to
note that Android devices were also used to connect
to the honey accounts in paste sites and underground
forum groups.
The diversity of devices and browsers in the paste
sites and underground forums groups indicates a mot-
ley mix of cybercriminals with various motives and ca-
pabilities, compared to the malware groups that appear
to be more homogeneous. It is also obvious that attack-
ers that steal credentials through malware make more
efforts to cover their tracks by evading browser finger-
4.3.4 Location of accesses
We recorded the location information that we found
in the accesses that were logged by our infrastructure.
Our goal was to understand patterns in the locations
(or proxies) used by criminals to access stolen accounts.
Out of the 326 accesses logged, 132 were coming from
Tor exit nodes. More specifically, 28 accesses to ac-
counts leaked on paste sites were made via Tor, out of a
total of 144 accesses to accounts leaked on paste sites.
48 accesses to accounts leaked on forums were made
through Tor, out of a total of 125 accesses made to ac-
counts leaked on forums. We observed 57 accesses to
accounts leaked through malware, and all except one of
those accesses were made via Tor. We removed these ac-
cesses for further analysis, since they do not provide in-
formation on the physical location of the criminals. Af-
ter removing Tor nodes, 173 unique accesses presented
location information. To determine this location infor-
(a) Distribution of browsers of honey account accesses (b) Distribution of operating systems of honey account
Figure 5: Distribution of browsers and operating systems of the accesses that we logged to our honey accounts. As it
can be seen, accounts leaked through different outlets attracted cybercriminals with different system configurations.
mation, we used the geolocation provided by Google on
the account activity page of the honey accounts. We
observed accesses from a total of 29 countries. To un-
derstand whether the IP addresses that connected to
our honey accounts had been recorded in previous ma-
licious activity, we ran checks on all IP addresses we
observed against the Spamhaus blacklist. We found
20 IP addresses that accessed our honey accounts in the
Spamhaus blacklist. Because of the nature of this black-
list, we believe that the addresses belong to malware-
infected machines that are used by cybercriminals to
connect to the stolen accounts.
One of our goals was to observe if cybercriminals at-
tempt to evade location-based login risk analysis sys-
tems by tweaking access origins. In particular, we wanted
to assess whether telling criminals the location where
the owner of an account is based influences the loca-
tions that they will use to connect to the account. De-
spite observing 57 accesses to our honey accounts leaked
through malware, we discovered that all these connec-
tions except one originated from Tor exit nodes. This
shows that the malware operators that accessed our ac-
counts prefer to hide their location through the use of
anonymizing systems rather than modifying their lo-
cation based on where the stolen account is typically
connecting from.
While leaking the honey credentials, we chose Lon-
don and Pontiac, MI as our decoy UK and US locations
respectively. The idea was to claim that the honey ac-
counts leaked with location details belonged to fictitious
personas living in either London or Pontiac. However,
we realized that leaking multiple accounts with the same
location might cause suspicion. As a result, we chose de-
coy UK and US locations such that London and Pontiac,
IL were the midpoints of those locations respectively.
To observe the impact of availability of location in-
formation about the honey accounts on the locations
that cybercriminals connect from, we calculated the
median values of distances of the locations recorded
in unique accesses, from the midpoints of the adver-
tised decoy locations in our account leaks. For ex-
ample, for the accesses Ato honey accounts leaked
on paste sites, advertised with UK information, we ex-
tracted location information, translated them to coor-
dinates LA, and computed the dist paste U K vector as
distance(LA, midUK ), where midU K are London’s co-
ordinates. All distances are in kilometers. We extracted
the median values of all distance vectors obtained, and
plotted circles on UK and US maps, specifying those
median distances as radii of the circles, as shown in
Figures 6a and 6b.
Interestingly, we observe that connections to accounts
with advertised location information originate from places
closer to the midpoints than accounts with leaked infor-
mation containing usernames and passwords only. Fig-
ure 6a shows that connections to accounts leaked on
paste sites and forums result in the smaller median cir-
cles, that is, the connections originate from locations
closer to London, the UK midpoint. The smallest circle
is for the accounts leaked on paste sites, with adver-
tised UK location information (radius 1400 kilometers).
In contrast, the circle of accounts leaked on paste sites
without location information has a radius of 1784 kilo-
meters. The median circle of the accounts leaked in
underground forums, with no advertised location infor-
mation, is the largest circle in Figure 6a, while the one
of accounts leaked in underground forums, along with
UK location information, is smaller.
We obtained similar results in the US plot, with some
interesting distinctions. As shown in Figure 6b, con-
nections to honey accounts leaked on paste sites, with
advertised US locations are clustered around the US
midpoint, as indicated by the circle with a radius of
939 kilometers, compared to the median circle of ac-
counts leaked on paste sites without location informa-
tion, which has a radius of 7900 kilometers. However,
despite the fact that the median circle of accounts leaked
in underground forums with advertised location infor-
mation is less than that of the one without advertised
location information, the difference in their radii is not
as pronounced. This again supports the indication that
cybercriminals on paste sites exhibit more location mal-
(a) Distance of login locations from the UK midpoint (b) Distance of login locations from the US midpoint
Figure 6: Distance of login locations from the midpoints of locations advertised while leaking credentials. Red
lines indicate credentials leaked on paste sites with no location information, green lines indicate credentials leaked
on paste sites with location information, purple lines indicate credentials leaked on underground forums without
location information, while blue lines indicate credentials leaked on underground forums with location information.
As it can be seen, account credentials leaked with location information attract logins from hosts that are closer to
the advertised midpoint than credentials that are posted without any location information.
leability, that is, masquerading their origins of accesses
to appear closer to the advertised location, when pro-
vided. It also shows that cybercriminals on the studied
forums are less sophisticated, or care less than the ones
on paste sites.
Statistical significance. As we explained, Figures 6a
and 6b show that accesses to leaked accounts happen
closer to advertised locations if this information is in-
cluded in the leak. To confirm the statistical signifi-
cance of this finding, we performed a Cramer Von Mises
test [15]. The Anderson version [8] of this test is used
to understand if two vectors of values do likely have
the same statistical distribution or not. The p-value
has to be under 0.01 to let us state that it is possible
to reject the null hypothesis (i.e., that the two vectors
of distances have the same distribution), otherwise it
is not possible to state with statistical significance that
the two distance vectors come from different distribu-
tions. The p-value from the test on paste sites vec-
tors (p-values of 0.0017415 for UK location information
versus no location and 0.0000007 for US location infor-
mation versus no location) allows us to reject the null
hypothesis, thus stating that the two vectors come from
different distributions while we cannot say the same ob-
serving the p-values for the tests on forum vectors (p-
values of 0.272883 for the UK case and 0.272011 for the
US one). Therefore, we can conclusively state that the
statistical test proves that criminals using paste sites
connect from closer locations when location information
is provided along with the leaked credentials. We can-
not reach that conclusion in the case of accounts leaked
to underground forums, although Figures 6a and 6b
indicate that there are some location differences in this
case too.
4.3.5 What are “gold diggers” looking for?
Cybercriminals compromise online accounts due to
the inherent value of those accounts. As a result, they
assess accounts to decide how valuable they are, and de-
cide exactly what to do with such accounts. We decided
to study the words that they searched for in the honey
accounts, in order to understand and potentially char-
acterize anomalous searches in the accounts. A limiting
factor in this case was the fact that we did not have ac-
cess to search logs of the honey accounts, but only to the
content of the emails that were opened. To overcome
this limitation, we employed Term Frequency–Inverse
Document Frequency (TF-IDF). TF-IDF is used to rank
words in a corpus by importance. As a result we re-
lied on TF-IDF to infer the words that cybercriminals
results 0.2250 0.0127 0.2122 transfer 0.2795 0.2949 -0.0154
bitcoin 0.1904 0.0 0.1904 please 0.2116 0.2608 -0.0493
family 0.1624 0.0200 0.1423 original 0.1387 0.1540 -0.0154
seller 0.1333 0.0037 0.1296 company 0.0420 0.1531 -0.1111
localbitcoins 0.1009 0.0 0.1009 would 0.0864 0.1493 -0.0630
account 0.1114 0.0247 0.0866 energy 0.0618 0.1471 -0.0853
payment 0.0982 0.0157 0.0824 information 0.0985 0.1308 -0.0323
bitcoins 0.0768 0.0 0.0768 about 0.1342 0.1226 0.0116
below 0.1236 0.0496 0.0740 email 0.1402 0.1196 0.0207
listed 0.0858 0.0207 0.0651 power 0.0462 0.1175 -0.0713
Table 2: List of top 10 words by T F ID FRT F IDFA(on the left) and list of top 10 words by T F ID FA(on the
right). The words on the left are the ones that have the highest difference in importance between the emails opened
by attackers and the emails in the entire corpus. For this reason, they are the words that attackers most likely
searched for when looking for sensitive information in the stolen accounts. The words on the right, on the other
hand, are the ones that have the highest importance in the entire corpus.
searched for in the honey accounts. TF-IDF is a prod-
uct of two metrics, namely Term Frequency (TF) and
Inverse Document Frequency (IDF). The idea is that we
can infer the words that cybercriminals searched for, by
comparing the important words in the emails opened by
cybercriminals to the important words in all emails in
the decoy accounts.
In its simplest form, TF is a measure of how fre-
quently term tis found in document d, as shown in
Equation 1. IDF is a logarithmic scaling of the fraction
of the number of documents containing term t, as shown
in Equation 2where Dis the set of all documents in the
corpus, Nis the total number of documents in the cor-
pus, |dD:td|is the number of documents in D,
that contain term t. Once TF and IDF are obtained,
TF-IDF is computed by multiplying TF and IDF, as
shown in Equation 3.
tf(t, d) = ft,d (1)
idf (t, D ) = log N
tfidf (t, d, D) = tf (t, d)×idf (t, D ) (3)
The output of TF-IDF is a weighted metric that ranges
between 0 and 1. The closer the weighted value is to 1,
the more important the term is in the corpus.
We evaluated TF-IDF on all terms in a corpus of
text comprising two documents, namely, all emails dA
in the honey accounts, and all emails dRopened by
the attackers. The intuition is that the words that
have a large importance in the emails that have been
opened by a criminal, but have a lower importance in
the overall dataset, are likely to be keywords that the
attackers searched for in the Gmail account. We pre-
processed the corpus by filtering out all words that have
less than 5 characters, and removing all known header-
related words, for instance “delivered” and “charset,”
honey email handles, and also removing signaling infor-
mation that our monitoring infrastructure introduced
into the emails. After running TF-IDF on all remaining
terms in the corpus, we obtained their TF-IDF values
as vectors T F ID FAand T F IDFR, the TF-IDF val-
ues of all terms in the corpus [dA, dR]. We proceeded
to compute the vector T F IDFRT F I DFA. The top
10 words by T F IDFRT F I DFA, compared to the
top 10 words by T F IDFAare presented in Table 2.
Words that have T F IDFRvalues that are higher than
T F I DFAwill rank higher in the list, and those are the
words that the cybercriminals likely searched for.
As seen in Table 2, the top 10 important words by
T F I DFRT F IDFAare sensitive words, such as “bit-
coin,”“family,” and “payment.” Comparing these words
with the most important words in the entire corpus re-
veals the indication that attackers likely searched for
sensitive information, especially financial information.
In addition, words with the highest importance in the
entire corpus (for example, “company” and “energy”),
shown in the right side of Table 2, have much lower im-
portance in the emails opened by cybercriminals, and
most of them have negative values in T F IDFRT F I DFA.
This is a strong indicator that the emails opened in the
honey accounts were not opened at random, but were
the result of searches for sensitive information.
Originally, the Enron dataset had no “bitcoin” term.
However, that term was introduced into the opened
emails document dR, through the actions of one of the
cybercriminals that accessed some of the honey accounts.
The cybercriminal attempted to send blackmail mes-
sages from some of our honey accounts to Ashley Madi-
son scandal victims [3], requesting ransoms in bitcoin,
in exchange for silence. In the process, many draft
emails containing information about “bitcoin” were cre-
ated and abandoned by the cybercriminal, and other
cybercriminals opened them during later accesses. That
way, our monitoring infrastructure picked up “bitcoin”
related terms, and they rank high in Table 2, showing
that cybercriminals showed a lot of interest in those
4.4 Interesting case studies
In this section, we present some interesting case stud-
ies that we encountered during our experiments. They
help to shed further light into actions that cybercrimi-
nals take on compromised webmail accounts.
Three of the honey accounts were used by an attacker
to send multiple blackmail messages to some victims of
the Ashley Madison scandal. The blackmailer threat-
ened to expose the victims, unless they made some pay-
ments in bitcoin to a specified bitcoin wallet. Tutorials
on how to make bitcoin payments were also included in
the messages. The blackmailer created and abandoned
many drafts emails targeted at more Ashley Madison
victims, which as we have already mentioned some other
visitors to the accounts opened, thus contributing to
the opened emails that our monitoring infrastructure
Two of the honey accounts received notification emails
about the hidden Google Apps Script in both honey
accounts “using too much computer time.” The noti-
fications were opened by an attacker, and we received
notifications about the opening actions.
Finally, an attacker registered on an carding forum
using one of the honey accounts as registration email
address. As a result, registration confirmation infor-
mation was sent to the honey account This shows that
some of the accounts were used as stepping stones by
cybercriminals to perform further illicit activity.
4.5 Sophistication of attackers
From the accesses we recorded in the honey accounts,
we identified 3 peculiar behaviors of cybercriminals that
indicate their level of sophistication, namely, configu-
ration hiding – for instance by hiding user agent in-
formation, location filter evading – by connecting from
locations close to the advertised decoy location if pro-
vided, and stealthiness – avoiding performing clearly
malicious actions such as hijacking and spamming. At-
tackers accessing the different groups of honey accounts
exhibit different types of sophistication. Those access-
ing accounts leaked through malware are stealthier than
others – they don’t hijack the accounts, and they don’t
send spam from them. They also access the accounts
through Tor, and they hide their system configuration,
for instance, their web browser is not fingerprintable by
Google. Attackers accessing accounts leaked on paste
sites tend to connect from locations closer to the ones
specified as decoy locations in the leaked account. They
do this in a bid to evade detection. Attackers access-
ing accounts leaked in underground forums do not make
significant attempts to stay stealthy or to connect from
closer locations. These differences in sophistication could
be used to characterize attacker behavior in future work.
In this section, we discuss the implications of the
findings we made in this paper. First, we talk about
what our findings mean for current mitigation tech-
niques against compromised online service accounts, and
how they could be used to devise better defenses. Then,
we talk about some limitations of our method. Finally,
we present some ideas for future work.
Implications of our findings. In this paper, we made
multiple findings that provide the research community
with a better understanding of what happens when on-
line accounts get compromised. In particular, we dis-
covered that if attackers are provided with location in-
formation about the online accounts, they then tend
to connect from places that are closer to those adver-
tised locations. We believe that this is an attempt to
evade current security mechanisms employed by online
services to discover suspicious logins. Such systems of-
ten rely on the origin of logins, to assess how suspicious
those login attempts are. Our findings show that there
is an arms race going on, with attackers attempting
to actively evade the location-based anomaly detection
systems employed by Google. We also observed that
many accesses were received through Tor exit nodes, so
it is hard to determine the exact origins of logins. This
problem shows the importance of defense in depth in
protecting online systems, in which multiple detection
systems are employed at the same time to identify and
block miscreants.
Despite confirming existing evasion techniques in use
by cybercriminals, our experiments also highlighted in-
teresting behaviors that could be used to develop effec-
tive systems to detect malicious activity. For example,
our observations about the words searched for by the cy-
bercriminals show that behavioral modeling could work
in identifying anomalous behavior in online accounts.
Anomaly detection systems could be trained adaptively
on words being searched for by the legitimate account
owner over a period of time. A deviation of search be-
havior would then be flagged as anomalous, indicating
that the account may have been compromised. Sim-
ilarly, anomaly detection systems could be trained on
the durations of connections during benign usage, and
deviations from those could be flagged as anomalous.
Limitations. We encountered a number of limitations
in the course of the experiments. For example, we were
able to leak the honey accounts only on a few outlets,
namely paste sites, underground forums, and malware.
In particular, we could only target underground forums
that were open to the public and for which registration
was free. Similarly, we could not study some of the most
recent families of information-stealing malware such as
Dridex, because they would not execute in our virtual
environment. Attackers could find the scripts we hid in
the honey accounts and remove them, making it impos-
sible for us to monitor the activity of the accounts. This
is an intrinsic limitation of our monitoring architecture,
but in principle studies similar to ours could be per-
formed by the online service providers themselves, such
as Google and Facebook. By having access to the full
logs of their systems, such entities would have no need
to set up monitoring scripts, and it would be impossi-
ble for attackers to evade their scrutiny. Finally, while
evaluating what cybercriminals were looking for in the
honey accounts, we were able to observe the emails that
they found interesting in the honey accounts, not every-
thing they searched for. This is because we do not have
access to the search logs of the honey accounts.
Future work. In the future, we plan to continue ex-
ploring the ecosystem of stolen accounts, and gaining a
better understanding of the underground economy sur-
rounding them. We would explore ways to make the
decoy accounts more believable, to attract more cyber-
criminals and keep them engaged with the decoy ac-
counts. We intend to set up additional scenarios, such
as studying attackers who have a specific motivation,
for example compromising accounts that belong to po-
litical activists (rather than generic corporate accounts,
as we did in this paper). We would also like to study
whether demographic information, as well as the lan-
guage that the emails in honey accounts are written
in, influence the way in which cybercriminals interact
with these accounts. To mitigate the fact that our in-
frastructure can only identify search terms for emails
that were found in the accounts, we plan to seed the
honey accounts with some specially crafted emails con-
taining decoy sensitive information, for instance, fake
bank account information and login credentials, along
with other regular email messages. Hopefully, this type
of specialized email seeding will help to increase the va-
riety of hits when cybercriminals search for content in
the honey accounts, by improving the seeding of the
honey accounts. We believe this will improve our in-
sight into what criminals search for.
In this section, we briefly compare this paper with
previous work, noting that most previous work focused
on spam and social spam. Only a few focused on manual
hijacking of accounts and their activity.
Bursztein et al. [13] investigated manual hijacking of
online accounts through phishing pages. The study fo-
cuses on cybercriminals that steal user credentials and
use them privately, and shows that manual hijacking
is not as common as automated hijacking by botnets.
This paper illustrates the usefulness of honey creden-
tials (account honeypots), in the study of hijacked ac-
counts. Compared to the work by Bursztein et al.,
which focused on phishing, we analyzed a much broader
threat model, looking at account credentials automati-
cally stolen by malware, as well as the behavior of cy-
bercriminals that obtain account credentials through
underground forums and paste sites. By focusing on
multiple types of miscreants, we were able to show dif-
ferences in their modus operandi, and provide multi-
ple insights on the activities that happen on hijacked
Gmail accounts in the wild. We also provide an open
source framework that can be used by other researchers
to set up experiments similar to ours and further explore
the ecosystem of stolen Google accounts. To the best
of our knowledge, our infrastructure is the first pub-
licly available Gmail honeypot infrastructure. Despite
the fact that the authors of [13] had more visibility on
the account hijacking phenomenon than we did, since
they were operating the Gmail service, the dataset that
we collected is of comparable size to theirs: we logged
326 malicious accesses to 100 accounts, while they stud-
ied 575 high-confidence hijacked accounts.
A number of papers looked at abuse of accounts on
social networks. Thomas et al. [34] studied Twitter ac-
counts under the control of spammers. Stringhini et
al. [31] studied social spam using 300 honeypot profiles,
and presented a tool for detection of spam on Face-
book and Twitter. Similar work was also carried out
in [9,12,24,38]. Thomas et al. [35] studied underground
markets in which fake Twitter accounts are sold and
then used to spread spam and other malicious content.
Unlike this paper, they focus on fake accounts and not
on legitimate ones that have been hijacked. Wang et
al. [37] proposed the use of patterns of click events to
spot fake accounts in online services.
Researchers also looked at developing systems to de-
tect compromised accounts. Egele et al. [18] presented
a system that detects malicious activity in online social
networks using statistical models. Stringhini et al. [32]
developed a tool for detecting compromised email ac-
counts based on the behavioral modeling of senders.
Other papers investigated the use of stolen credentials
and stolen files by setting up honeyfiles. Liu et al. [25]
deployed honeyfiles containing honey account creden-
tials in P2P shared spaces. The study used a similar
approach to ours, especially in the placement of honey
account credentials. However, they placed more empha-
sis on honeyfiles than on honey credentials. Besides,
they studied P2P networks while our work focuses on
compromised accounts in webmail services. Nikiforakis
et al. [26] studied privacy leaks in file hosting services by
deploying honeyfiles on them. In our previous work [23],
we developed an infrastructure to study malicious activ-
ity in online spreadsheets, using an approach similar to
the one described in this paper. Stone-Gross et al. [30]
studied a large-scale spam operation by analyzing 16
C&C servers of Pushdo/Cutwail botnet. In the paper,
the authors highlight that the Cutwail botnet, one of
the largest of its time, has the capability of connect-
ing to webmail accounts to send spam. In their paper,
Stone-Gross et al. also describe the activity of cyber-
criminals on spamdot, a large underground forum. They
show that cybercriminals were actively trading account
information such as the one provided in this paper, pro-
viding free “teasers” of the overall datasets for sale. In
this paper, we used a similar approach to leak account
credentials on underground forums.
In this paper, we presented a honey account system
able to monitor the activity of cybercriminals that gain
access to Gmail account credentials. Our system is
publicly available to encourage researchers to set up
additional experiments and improve the knowledge of
our community regarding what happens after webmail
accounts are compromised2. We leaked 100 honey ac-
counts on paste sites, underground forums, and virtual
machines infected with malware, and provided detailed
statistics of the activity of cybercriminals on the ac-
counts, together with a taxonomy of the criminals. Our
findings could help the research community to get a
better understanding of the ecosystem of stolen online
accounts, and potentially help researchers to develop
better detection systems against this malicious activity.
We wish to thank our shepherd Andreas Haeberlen
for his advice on how to improve our paper, and Mark
Risher and Tejaswi Nadahalli from Google for their sup-
port throughout the project. We also thank the anony-
mous reviewers for their comments. This work was
supported by the EPSRC under grant EP/N008448/1,
and by a Google Faculty Award. Jeremiah Onaolapo
was supported by the Petroleum Technology Develop-
ment Fund (PTDF), Nigeria, while Enrico Mariconti
was funded by the EPSRC under grant 1490017.
[1] Apps Script.
[2] Dropbox User Credentials Stolen: A Reminder To
Increase Awareness In House.
[3] Hackers Finally Post Stolen Ashley Madison
[4] Overview of Google Apps Script.
[5] Pastebin.
[6] The Target Breach, By the Numbers.
[7] S. Afroz, A. C. Islam, A. Stolerman,
R. Greenstadt, and D. McCoy. Doppelg¨
finder: Taking stylometry to the underground. In
IEEE Symposium on Security and Privacy, 2014.
[8] T. W. Anderson and D. A. Darling. Asymptotic
theory of certain “goodness of fit” criteria based
on stochastic processes. The Annals of
Mathematical Statistics, 1952.
2 students/gmail-honeypot
[9] F. Benevenuto, G. Magno, T. Rodrigues, and
V. Almeida. Detecting Spammers on Twitter. In
Conference on Email and Anti-Spam (CEAS),
[10] H. Binsalleeh, T. Ormerod, A. Boukhtouta,
P. Sinha, A. Youssef, M. Debbabi, and L. Wang.
On the analysis of the Zeus botnet crimeware
toolkit. In Privacy, Security and Trust (PST),
[11] D. Boneh, S. Inguva, and I. Baker. SSL MITM
Proxy. http:// mitm, 2007.
[12] Y. Boshmaf, I. Muslukhov, K. Beznosov, and
M. Ripeanu. The socialbot network: when bots
socialize for fame and money. In Annual
Computer Security Applications Conference
(ACSAC), 2011.
[13] E. Bursztein, B. Benko, D. Margolis,
T. Pietraszek, A. Archer, A. Aquino, A. Pitsillidis,
and S. Savage. Handcrafted Fraud and Extortion:
Manual Account Hijacking in the Wild. In ACM
Internet Measurement Conference (IMC), 2014.
[14] E. Butler. Firesheep.
http:// firesheep, 2010.
[15] H. Cram`er. On the composition of elementary
errors. Skandinavisk Aktuarietidskrift, 1928.
[16] A. Das, J. Bonneau, M. Caesar, N. Borisov, and
X. Wang. The Tangled Web of Password Reuse.
In Symposium on Network and Distributed System
Security (NDSS), 2014.
[17] R. Dhamija, J. D. Tygar, and M. Hearst. Why
phishing works. In ACM Conference on Human
Factors in Computing Systems (CHI), 2006.
[18] M. Egele, G. Stringhini, C. Kruegel, and
G. Vigna. COMPA: Detecting Compromised
Accounts on Social Networks. In Symposium on
Network and Distributed System Security (NDSS),
[19] M. Egele, G. Stringhini, C. Kruegel, and
G. Vigna. Towards Detecting Compromised
Accounts on Social Networks. In IEEE
Transactions on Dependable and Secure
Computing (TDSC), 2015.
[20] T. N. Jagatic, N. A. Johnson, M. Jakobsson, and
F. Menczer. Social Phishing. Communications of
the ACM, 50(10):94–100, 2007.
[21] J. P. John, A. Moshchuk, S. D. Gribble, and
A. Krishnamurthy. Studying Spamming Botnets
Using Botlab. In USENIX Symposium on
Networked Systems Design and Implementation
(NSDI), 2009.
[22] B. Klimt and Y. Yang. Introducing the Enron
Corpus. In Conference on Email and Anti-Spam
(CEAS), 2004.
[23] M. Lazarov, J. Onaolapo, and G. Stringhini.
Honey Sheets: What Happens to Leaked Google
Spreadsheets? In USENIX Workshop on Cyber
Security Experimentation and Test (CSET), 2016.
[24] K. Lee, J. Caverlee, and S. Webb. The social
honeypot project: protecting online communities
from spammers. In World Wide Web Conference
(WWW), 2010.
[25] B. Liu, Z. Liu, J. Zhang, T. Wei, and W. Zou.
How many eyes are spying on your shared folders?
In ACM Workshop on Privacy in the Electronic
Society (WPES), 2012.
[26] N. Nikiforakis, M. Balduzzi, S. Van Acker,
W. Joosen, and D. Balzarotti. Exposing the Lack
of Privacy in File Hosting Services. In USENIX
Workshop on Large-Scale Exploits and Emergent
Threats (LEET), 2011.
[27] N. Nikiforakis, A. Kapravelos, W. Joosen,
C. Kruegel, F. Piessens, and G. Vigna. Cookieless
monster: Exploring the ecosystem of web-based
device fingerprinting. In IEEE Symposium on
Security and Privacy, 2013.
[28] C. Rossow, C. J. Dietrich, C. Grier, C. Kreibich,
V. Paxson, N. Pohlmann, H. Bos, and M. van
Steen. Prudent practices for designing malware
experiments: Status quo and outlook. In IEEE
Symposium on Security and Privacy, 2012.
[29] B. Stone-Gross, M. Cova, L. Cavallaro,
B. Gilbert, M. Szydlowski, R. Kemmerer,
C. Kruegel, and G. Vigna. Your Botnet is My
Botnet: Analysis of a Botnet Takeover. In ACM
Conference on Computer and Communications
Security (CCS), 2009.
[30] B. Stone-Gross, T. Holz, G. Stringhini, and
G. Vigna. The underground economy of spam: A
botmaster’s perspective of coordinating
large-scale spam campaigns. In USENIX
Workshop on Large-Scale Exploits and Emergent
Threats (LEET), 2011.
[31] G. Stringhini, C. Kruegel, and G. Vigna.
Detecting Spammers on Social Networks. In
Annual Computer Security Applications
Conference (ACSAC), 2010.
[32] G. Stringhini and O. Thonnard. That Ain’t You:
Blocking Spearphishing Through Behavioral
Modelling. In Detection of Intrusions and
Malware, and Vulnerability Assessment (DIMVA),
[33] B. Taylor. Sender Reputation in a Large Webmail
Service. In Conference on Email and Anti-Spam
(CEAS), 2006.
[34] K. Thomas, C. Grier, D. Song, and V. Paxson.
Suspended accounts in retrospect: an analysis of
Twitter spam. In ACM Internet Measurement
Conference (IMC), 2011.
[35] K. Thomas, D. McCoy, C. Grier, A. Kolcz, and
V. Paxson. Trafficking Fraudulent Accounts: The
Role of the Underground Market in Twitter Spam
and Abuse. In USENIX Security Symposium,
[36] D. Wang, Z. Zhang, P. Wang, J. Yan, and
X. Huang. Targeted Online Password Guessing:
An Underestimated Threat. In ACM Conference
on Computer and Communications Security
(CCS), 2016.
[37] G. Wang, T. Konolige, C. Wilson, X. Wang,
H. Zheng, and B. Y. Zhao. You are how you click:
Clickstream analysis for sybil detection. USENIX
Security Symposium, 2013.
[38] S. Webb, J. Caverlee, and C. Pu. Social
Honeypots: Making Friends with a Spammer
Near You. In Conference on Email and Anti-Spam
(CEAS), 2008.
... Cybercriminals use social engineering techniques such as phishing and spear phishing [2], malware on victims' devices [3] and also exploiting vulnerabilities in authentication databases [4,5] To this end, we monitored during a period of one month the use of compromised Gmail accounts in the Dark Web using the infrastructure proposed by Onaolapo et al. [1]. For that purpose, we created fake Gmail accounts (called honey accounts) whose credentials were leaked in various outlets such as paste sites (online outlets where users can store and share plain text) and underground forums. ...
... Onaolapo et al. [1] analyse the activities cybercriminals perform on compromised Gmail accounts. Their work involves leaking fake Gmail accounts on paste sites, underground forums or by using malware. ...
... Onaolapo et al. [1] provide an analysis of the interaction between cybercriminals and hijacked accounts when stolen credentials are obtained in outlets hosted on the Surface Web. Based on the observations from the accesses to the honey accounts, they identified a classification of the activity conducted by cybercriminals. ...
The world has seen a dramatic increase in illegal activities on the Internet. Prior research has investigated different types of cybercrime, especially in the Surface Web, which is the portion of the content on the World Wide Web that popular engines may index. At the same time, evidence suggests cybercriminals are moving their operations to the Dark Web. This portion is not indexed by conventional search engines and is accessed through network overlays such as The Onion Router network. Since the Dark Web provides anonymity, cybercriminals use this environment to avoid getting caught or blocked, which represents a significant challenge for researchers. This research project investigates the modus operandi of cybercriminals on the Surface Web and the Dark Web to understand how cybercrime unfolds in different layers of the Web. Honeypots, specialised crawlers and extraction tools are used to analyse different types of online crimes. In addition, quantitative analysis is performed to establish comparisons between the two Web environments. This thesis is comprised of three studies. The first examines the use of stolen account credentials leaked in different outlets on the Surface and Dark Web to understand how cybercriminals interact with stolen credentials in the wild. In the second study, malvertising is analysed from the user's perspective to understand whether using different technologies to access the Web could influence the probability of malware infection. In the final study, underground forums on the Surface and Dark Web are analysed to observe differences in trading patterns in both environments. Understanding how criminals operate in different Web layers is essential to developing policies and countermeasures to prevent cybercrime more efficiently.
... Repeat sale and use of credit cards from Dark markets [104]. Repeat use of stolen accounts [158]. Malware spreading to nearby devices [147,35]. ...
... Thomas et al. [201] note that cybercriminals typically rent out compromised computers at varying prices depending on their regions, where computers from the West are usually more expensive in the cybercriminal economy than those from the rest of the world. Concerning the use of stolen email credentials, Onaolapo et al. [158] found that illicit users may ascertain the value of email accounts by executing searches using keywords such as 'bank' and 'money.' Turning to the other dimensions, the 'visibility' of a target from the perspective of a cybercriminal could translate to a victim's online presence, or the presence of well-known vulnerabilities in a service (e.g., a website bug, or a software CVE 3 ). ...
... Various forms of identity fraud, facilitated through Internet-enabled theft of personally identifiably information (PII) (e.g., names, addresses, email addresses) or account credentials for common services (e.g., email, banking, social media) are also problems that the information security community closely monitor. Researchers have recognised that phishing emails and malware are common precursors to identity fraud [170], and they have monitored the illegal activities that subsequently ensue with such credentials [158]. ...
Full-text available
Malware Delivery Networks (MDNs) are networks of webpages, servers, computers, and computer files that are used by cybercriminals to proliferate malicious software (or malware) onto victim machines. The business of malware delivery is a complex and multifaceted one that has become increasingly profitable over the last few years. Due to the ongoing arms race between cybercriminals and the security community, cybercriminals are constantly evolving and streamlining their techniques to beat security countermeasures and avoid disruption to their operations, such as by security researchers infiltrating their botnet operations, or law enforcement taking down their infrastructures and arresting those involved. So far, the research community has conducted insightful but isolated studies into the different facets of malicious file distribution. Hence, only a limited picture of the malicious file delivery ecosystem has been provided thus far, leaving many questions unanswered. Using a data-driven and interdisciplinary approach, the purpose of this research is twofold. One, to study and measure the malicious file delivery ecosystem, bringing prior research into context, and to understand precisely how these malware operations respond to security and law enforcement intervention. And two, taking into account the overlapping research efforts of the information security and crime science communities towards preventing cybercrime, this research aims to identify mitigation strategies and intervention points to disrupt this criminal economy more effectively.
... Prior work has used honeypots to investigate reputation manipulation in online social networks and account/website compromise. First, prior work has used honeypots to monitor attackers' interactions with honeypots by leaking credentials [35,39,53]. For example, Onaolapo et al. [53] leaked credentials of honey email accounts to blackhat forums and paste sites and monitored activities of attackers who accessed their honey email accounts. ...
... First, prior work has used honeypots to monitor attackers' interactions with honeypots by leaking credentials [35,39,53]. For example, Onaolapo et al. [53] leaked credentials of honey email accounts to blackhat forums and paste sites and monitored activities of attackers who accessed their honey email accounts. Second, prior work has used honeypots to monitor attacker interactions with honeypots without explicitly handing over their access, such as by purchasing fake followers on Facebook, Instagram, or Twitter [36,45,74]. ...
... We create a fresh Facebook account to share an email address as the honeytoken with a third-party app. We can either partner with a major email provider [35,53] or set up our own server for email accounts. We assign an email address to the Facebook account. ...
Full-text available
Online social networks support a vibrant ecosystem of third-party apps that get access to personal information of a large number of users. Despite several recent high-profile incidents, methods to systematically detect data misuse by third-party apps on online social networks are lacking. We propose CanaryTrap to detect misuse of data shared with third-party apps. CanaryTrap associates a honeytoken to a user account and then monitors its unrecognized use via different channels after sharing it with the third-party app. We design and implement CanaryTrap to investigate misuse of data shared with third-party apps on Facebook. Specifically, we share the email address associated with a Facebook account as a honeytoken by installing a third-party app. We then monitor the received emails and use Facebook’s ad transparency tool to detect any unrecognized use of the shared honeytoken. Our deployment of CanaryTrap to monitor 1,024 Facebook apps has uncovered multiple cases of misuse of data shared with third-party apps on Facebook including ransomware, spam, and targeted advertising.
... These findings are congruent with the results of this work, facilitating the identification of significant outliers in the data collected for in-depth investigation. Compared to the work of Onoalopo et al. [32], who used similar means of leaking credentials to observe the exploitation of webmail accounts, the authors observed a higher number of unique accesses for the accounts advertised in leaks, provided the access rate does not drastically fall after completion of this paper. One possible explanation addressing this divide might be that cloud storage accounts are more attractive to malicious actors who make out their targets by relying on leaks, especially those using paste sites. ...
Full-text available
Cloud infrastructures and services are of essential importance for enterprise operations. They form a central point for data storage, processing and exchange. Their information security properties are strongly associated with the protection of the most confidential and important data of enterprises. In this work a credential leak on different platforms is simulated, revealing authentication information for several accounts on a cloud application service. Each account associated with the leaks provides more authentication information for further infrastructures such as an e-mail server, an industrial control system and an enterprise-related streaming server. Additionally, a homepage was launched with information on the fictitious persons associated with the leaked accounts. Interaction with those servers is closely monitored. It was found that around one third of all trespassers conducted lateral movement and successful authentications frequently result in system enumeration and file operations. ACM Reference Format:
... This highlights a significant advantage of cookie-based account hijacking over credential-based (e.g., phishing): additional frauddetection checks employed during login [24] (e.g., IP geo-location [71], comparison of browser fingerprints [50]) are ommitted because the cookies are part of a session that has already been verified as legitimate (i.e, when the victim logged in). While certain attackers can pass geo-location checks (e.g., using an IP address near the user's location [67]), deceiving browser-based security checks is significantly more challenging. While spoofing the victim's fingerprints has been theorized [19] it has not been demonstrated in practice. ...
Conference Paper
In this paper, we focus on authentication and authorization flaws in web apps that enable partial or full access to user accounts. Specifically, we develop a novel fully automated black-box auditing framework that analyzes web apps by exploring their susceptibility to various cookie-hijacking attacks while also assessing their deployment of pertinent security mechanisms (e.g., HSTS). Our modular framework is driven by a custom browser automation tool developed to transparently offer fault-tolerance during extended interactions with web apps. We use our framework to conduct the first automated large-scale study of cookie-based account hijacking in the wild. As our framework handles every step of the auditing process in a completely automated manner, including the challenging process of account creation, we are able to fully audit 25K domains. Our framework detects more than 10K domains that expose authentication cookies over unencrypted connections, and over 5K domains that do not protect authentication cookies from JavaScript access while also embedding third party scripts that execute in the first party's origin. Our system also automatically identifies the privacy loss caused by exposed cookies and detects 9,324 domains where sensitive user data can be accessed by attackers (e.g., address, phone number, password). Overall, our study demonstrates that cookie-hijacking is a severe and prevalent threat, as deployment of even basic countermeasures (e.g., cookie security flags) is absent or incomplete, while developers struggle to correctly deploy more demanding mechanisms.
Conference Paper
The internet allows people to make social contacts and to communicate among different Social Networking Sites (SNSs). If users are ignorant of their exposures, it may reveal their identities and may enhance cyber-attacks. Hence, password secrecy is usually prioritized to protect our personal information. Besides, usages of the same usernames across many SNSs expose users’ identities to other users and intruders. Hackers can use usernames to track usage patterns and manipulate social media accounts or systems. As a result, in terms of security, usernames must be treated the as same as passwords. This empirical study illuminates the analyses of usernames’ strengths by predicting weak usernames with machine learning models to limit poor username selections. We have analyzed the Reddit usernames dataset (83958) to see how frequently people choose weak usernames for their accounts. Our predictive models correctly categorize strong and weak usernames with an average accuracy of 87%.
This article analyses the usage of software baits as an information security asset. They provided close research about honeypot types, their advantages and disadvantages, possible security breaches, configuration and overall system effectiveness. Often, the entire electronic business of the organization is at stake, and even with the most reliable system of protection, a one-hundred-per cent guarantee of invulnerability of internal company data will not be given in principle. Depending on the goals pursued by the software lure, it can have various configuration parameters, ranging from software levels that do not require large settings and ending with complex hardware complexes. Depending on the level of complexity of the bait and its capabilities, they can be classified into three groups: weak, medium, and strong levels of interaction. In addition to the purely practical application of Honeypot, described above, no less important is the other side of the issue - research. Unfortunately, one of the most pressing problems for security professionals is the lack of information. Who threatens, why they attack, how and by what means they use - these questions very often do not have a clear answer. Informed means are armed, but in the world of security such information is not enough - there are no data sources. This is a very rare scenario, as no one can even theoretically allow the possibility of using a trap as a starting point to attack other objects. If you allow Honeypot to connect to remote hosts, an attacker could attack other systems using the trap's IP address as the source of the attack, which would cause serious legal issues. This possibility may be prohibited or controlled, but if it is prohibited, it may seem suspicious to the attacker, and if it exists but is controlled, the attacker may assess the restrictions or prohibited requests based on the information received, conclude that the attacked object is a trap.
Hack-for-hire services charging $100-$400 per contract were found to produce sophisticated, persistent, and personalized attacks that were able to bypass 2FA via phishing. The demand for these services, however, appears to be limited to a niche market, as evidenced by the small number of discoverable services, an even smaller number of successful services, and the fact that these attackers target only about one in a million Google users.
Conference Paper
Full-text available
Cloud-based documents are inherently valuable, due to the volume and nature of sensitive personal and business content stored in them. Despite the importance of such documents to Internet users, there are still large gaps in the understanding of what cybercriminals do when they illicitly get access to them by for example compromising the account credentials they are associated with. In this paper, we present a system able to monitor user activity on Google spreadsheets. We populated 5 Google spreadsheets with fake bank account details and fake funds transfer links. Each spreadsheet was configured to report details of accesses and clicks on links back to us. To study how people interact with these spreadsheets in case they are leaked, we posted unique links pointing to the spreadsheets on a popular paste site. We then monitored activity in the accounts for 72 days, and observed 165 accesses in total. We were able to observe interesting modifications to these spreadsheets performed by illicit accesses. For instance, we observed deletion of some fake bank account information, in addition to insults and warnings that some visitors entered in some of the spreadsheets. Our preliminary results show that our system can be used to shed light on cybercriminal behavior with regards to leaked online documents.
Conference Paper
Full-text available
While trawling online/offine password guessing has been intensively studied, only a few studies have examined targeted online guessing, where an attacker guesses a specific victim’s password for a service, by exploiting the victim's personal information such as one sister password leaked from the victim’s another account and some personally identifiable information (PII). A key challenge for targeted online guessing is to choose the most effective password candidates, while the number of guess attempts allowed by a server's lockout or throttling mechanisms is typically very small. We propose TarGuess, a framework that systematically characterizes typical targeted guessing scenarios with seven sound mathematical models, each of which is based on varied kinds of data available to an attacker. These models allow us to design novel and effcient guessing algorithms. Extensive experiments on 10 large real-world password datasets show the effectiveness of TarGuess. Particularly, TarGuess I~IV capture the four most representative scenarios and within 100 guesses: (1) TarGuess-I outperforms its foremost counterpart by 142% against security-savvy users and by 46% against normal users; (2) TarGuess-II outperforms its foremost counterpart by 169% on security-savvy users and by 72% against normal users; and (3) Both TarGuess-III and IV gain success rates over 73% against normal users and over 32% against security-savvy users. TarGuess-III and IV, for the first time, address the issue of cross-site online guessing when given the victim’s one sister password and some PII.
Full-text available
Compromising social network accounts has become a profitable course of action for cybercriminals. By hijacking control of a popular media or business account, attackers can distribute their malicious messages or disseminate fake information to a large user base. The impacts of these incidents range from a tarnished reputation to multi-billion dollar monetary losses on financial markets. In our previous work, we demonstrated how we can detect large-scale compromises (i.e., so-called campaigns) of regular online social network users. In this work, we show how we can use similar techniques to identify compromises of individual high-profile accounts. High-profile accounts frequently have one characteristic that makes this detection reliable -- they show consistent behavior over time. We show that our system, were it deployed, would have been able to detect and prevent three real-world attacks against popular companies and news agencies. Furthermore, our system, in contrast to popular media, would not have fallen for a staged compromise instigated by a US restaurant chain for publicity reasons.
Conference Paper
Full-text available
One of the ways in which attackers steal sensitive information from corporations is by sending spearphishing emails. A typical spearphishing email appears to be sent by one of the victim's coworkers or business partners, but has instead been crafted by the attacker. A particularly insidious type of spearphishing emails are the ones that do not only claim to be written by a certain person, but are also sent by that per-son's email account, which has been compromised. Spearphishing emails are very dangerous for companies, because they can be the starting point to a more sophisticated attack or cause intellectual property theft, and lead to high financial losses. Currently, there are no effective systems to protect users against such threats. Existing systems leverage adaptations of anti-spam techniques. However, these techniques are often inadequate to detect spearphishing attacks. The reason is that spearphishing has very different characteristics from spam and even traditional phishing. To fight the spearphishing threat, we propose a change of focus in the techniques that we use for detecting malicious emails: instead of looking for features that are indicative of attack emails, we look for emails that claim to have been written by a certain person within a company, but were actually authored by an attacker. We do this by modelling the email-sending behavior of users over time, and comparing any subsequent email sent by their accounts against this model. Our approach can block advanced email attacks that traditional protection systems are unable to detect, and is an important step towards detecting advanced spearphishing attacks.
Conference Paper
Full-text available
Fake identities and Sybil accounts are pervasive in today's online communities. They are responsible for a growing number of threats, including fake product reviews, malware and spam on social networks, and astroturf political campaigns. Unfortunately, studies show that existing tools such as CAPTCHAs and graph-based Sybil detectors have not proven to be effective defenses. In this paper, we describe our work on building a practical system for detecting fake identities using server-side clickstream models. We develop a detection approach that groups "similar" user clickstreams into behavioral clusters, by partitioning a similarity graph that captures distances between clickstream sequences. We validate our clickstream models using ground-truth traces of 16,000 real and Sybil users from Renren, a large Chinese social network with 220M users. We propose a practical detection system based on these models, and show that it provides very high detection accuracy on our clickstream traces. Finally, we worked with collaborators at Renren and LinkedIn to test our prototype on their server-side data. Following positive results, both companies have expressed strong interest in further experimentation and possible internal deployment.
Stylometry is a method for identifying anonymous authors of anonymous texts by analyzing their writing style. While stylometric methods have produced impressive results in previous experiments, we wanted to explore their performance on a challenging dataset of particular interest to the security research community. Analysis of underground forums can provide key information about who controls a given bot network or sells a service, and the size and scope of the cybercrime underworld. Previous analyses have been accomplished primarily through analysis of limited structured metadata and painstaking manual analysis. However, the key challenge is to automate this process, since this labor intensive manual approach clearly does not scale. We consider two scenarios. The first involves text written by an unknown cybercriminal and a set of potential suspects. This is standard, supervised stylometry problem made more difficult by multilingual forums that mix l33t-speak conversations with data dumps. In the second scenario, you want to feed a forum into an analysis engine and have it output possible doppelgangers, or users with multiple accounts. While other researchers have explored this problem, we propose a method that produces good results on actual separate accounts, as opposed to data sets created by artificially splitting authors into multiple identities. For scenario 1, we achieve 77% to 84% accuracy on private messages. For scenario 2, we achieve 94% recall with 90% precision on blogs and 85.18% precision with 82.14% recall for underground forum users. We demonstrate the utility of our approach with a case study that includes applying our technique to the Carders forum and manual analysis to validate the results, enabling the discovery of previously undetected doppelganger accounts.
Online accounts are inherently valuable resources-both for the data they contain and the reputation they accrue over time. Unsurprisingly, this value drives criminals to steal, or hijack, such accounts. In this paper we focus on manual account hijacking-account hijacking performed manually by humans instead of botnets. We describe the details of the hijacking workflow: The attack vectors, the exploitation phase, and post-hijacking remediation. Finally we share, as a large online company, which defense strategies we found effective to curb manual hijacking. Copyright
Conference Paper
As web services such as Twitter, Facebook, Google, and Yahoo now dominate the daily activities of Internet users, cyber criminals have adapted their monetization strategies to engage users within these walled gardens. To facilitate access to these sites, an underground market has emerged where fraudulent accounts - automatically generated credentials used to perpetrate scams, phishing, and malware - are sold in bulk by the thousands. In order to understand this shadowy economy, we investigate the market for fraudulent Twitter accounts to monitor prices, availability, and fraud perpetrated by 27 merchants over the course of a 10-month period. We use our insights to develop a classifier to retroactively detect several million fraudulent accounts sold via this marketplace, 95% of which we disable with Twitter's help. During active months, the 27 merchants we monitor appeared responsible for registering 10-20% of all accounts later flagged for spam by Twitter, generating $127-459K for their efforts.
Conference Paper
File hosting services (FHSs) are used daily by thousands of people as a way of storing and sharing files. These services normally rely on a security-through-obscurity approach to enforce access control: For each uploaded file, the user is given a secret URI that she can share with other users of her choice. In this paper, we present a study of 100 file hosting services and we show that a significant percentage of them generate secret URIs in a predictable fashion, allowing attackers to enumerate their services and access their file list. Our experiments demonstrate how an attacker can access hundreds of thousands of files in a short period of time, and how this poses a very big risk for the privacy of FHS users. Using a novel approach, we also demonstrate that attackers are aware of these vulnerabilities and are already exploiting them to get access to other users' files. Finally we present SecureFS, a client-side protection mechanism which can protect a user's files when uploaded to insecure FHSs, even if the files end up in the possession of attackers.