Conference PaperPDF Available

# Targeted Online Password Guessing: An Underestimated Threat

Authors:

## Abstract and Figures

While trawling online/offine password guessing has been intensively studied, only a few studies have examined targeted online guessing, where an attacker guesses a specific victim’s password for a service, by exploiting the victim's personal information such as one sister password leaked from the victim’s another account and some personally identifiable information (PII). A key challenge for targeted online guessing is to choose the most effective password candidates, while the number of guess attempts allowed by a server's lockout or throttling mechanisms is typically very small. We propose TarGuess, a framework that systematically characterizes typical targeted guessing scenarios with seven sound mathematical models, each of which is based on varied kinds of data available to an attacker. These models allow us to design novel and effcient guessing algorithms. Extensive experiments on 10 large real-world password datasets show the effectiveness of TarGuess. Particularly, TarGuess I~IV capture the four most representative scenarios and within 100 guesses: (1) TarGuess-I outperforms its foremost counterpart by 142% against security-savvy users and by 46% against normal users; (2) TarGuess-II outperforms its foremost counterpart by 169% on security-savvy users and by 72% against normal users; and (3) Both TarGuess-III and IV gain success rates over 73% against normal users and over 32% against security-savvy users. TarGuess-III and IV, for the first time, address the issue of cross-site online guessing when given the victim’s one sister password and some PII.
Content may be subject to copyright.
An Underestimated Threat
Ding Wang, Zijian Zhang, Ping Wang, Jeff Yan, Xinyi Huang
School of EECS, Peking University, Beijing 100871, China
*School of Computing and Communications, Lancaster University, United Kingdom
School of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350007, China
{wangdingg, zhangzj, pwang}@pku.edu.cn; jeff.yan@lancaster.ac.uk; xyhuang81@gmail.com
ABSTRACT
While trawling online/ofﬂine password guessing has been inten-
sively studied, only a few studies have examined targeted online
guessing, where an attacker guesses a speciﬁc victim’s password
for a service, by exploiting the victim’s personal information such
as one sister password leaked from her another account and some
personally identiﬁable information (PII). A key challenge for tar-
geted online guessing is to choose the most effective password can-
didates, while the number of guess attempts allowed by a server’s
lockout or throttling mechanisms is typically very small.
We propose TarGuess, a framework that systematically charac-
terizes typical targeted guessing scenarios with seven sound math-
ematical models, each of which is based on varied kinds of data
available to an attacker. These models allow us to design novel and
efﬁcient guessing algorithms. Extensive experiments on 10 large
real-world password datasets show the effectiveness of TarGuess.
Particularly, TarGuess IIV capture the four most representative
scenarios and within 100 guesses: (1) TarGuess-I outperforms its
foremost counterpart by 142% against security-savvy users and by
46% against normal users; (2) TarGuess-II outperforms its fore-
most counterpart by 169% on security-savvy users and by 72%
against normal users; and (3) Both TarGuess-III and IV gain suc-
cess rates over 73% against normal users and over 32% against
security-savvy users. TarGuess-III and IV, for the ﬁrst time, address
the issue of cross-site online guessing when given the victim’s one
Keywords
Password authentication; Targeted online guessing; Personal infor-
1. INTRODUCTION
Passwords ﬁrmly remain the most prevalent mechanism for user
authentication in various computer systems. To understand pass-
word security, a number of probabilistic guessing models, e.g.,
Markov n-grams [21, 25] and probabilistic context-free grammars
(PCFG) [31, 35], have been successively proposed. A common
feature of these guessing models is that they characterize a trawl-
ing ofﬂine guessing attacker who mainly works against the leaked
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proﬁt or commercial advantage and that copies bear this notice and the full citation
on the ﬁrst page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission
and/or a fee. Request permissions from permissions@acm.org.
CCS’16, October 24-28, 2016, Vienna, Austria
© 2016 ACM. ISBN 978-1-4503-4139-4/16/10. . . $15.00 DOI: http://dx.doi.org/10.1145/2976749.2978339 password ﬁles and aims to crack as many accounts as possible. As highlighted in [16], ofﬂine guessing attacks, no matter trawling ones or targeted ones, only pose a real concern in the very limited circumstance: the server’s password ﬁle is leaked, the leakage goes undetected, and the passwords are also properly hashed and salted. Recent research [7, 16] has realized that it should be the role of websites to protect user passwords from ofﬂine guessing by securely storing password ﬁles, while normal users only need to choose passwords that can survive online guessing. Online guessing can be launched against the publicly facing server by anyone using a browser at anytime, with the primary constraint being the number of guesses allowed. Trawling online guessing mainly exploits users’ behavior of choosing popular pass- words [22, 34], and it can be well addressed by various security mechanisms at the server (e.g., suspicious login detection [14], rate-limiting and lockout [18]). However, targeted online guessing (see Fig. 1) can exploit not only weak popular passwords, but also passwords reused across sites and passwords containing personal information. This is a serious security concern, since various Personally Identiﬁable Information (PII) and leaked passwords be- come readily available due to unending data breaches [2, 3, 17]. For instance, the most recent large-scale PII data breach in April 2016 [3] involves 50 million Turkish citizens, accounting for 64% of the population. According to the CNNIC 2015 report [1], over 78.2% of the 668 million Chinese netizens have suffered PII data leakage. In a series of recent breaches, over 253 million American netizens become victims of PII and password leakage [27]. This indicates that the existing password creation rules (e.g., [15, 28]) and strength meters (e.g., [24,32]) grounded on these trawling guessing models [21, 25, 31, 35] can mainly accommodate to the limited ofﬂine guessing threat, taking no account of the targeted online guessing threat which is increasingly more damaging and realistic. This misplaced research focus largely attributes to the failure (see [7, 33]) of the academic world to identify the crux of current practices and to suggest convincingly better password solutions than current practices to lead the industrial world. The main challenge for targeted online password guessing is to effectively characterize an attacker As guessing model, with multi- ple dimensions of available information (see Fig. 2) well captured, while the number of guesses allowed to Ais small – the NIST Authentication Guideline [18] requires Level 1 and 2 systems to keep login failures less than 100 per user account in any 30-day period. The following explains why it is a challenge. First, people’s password choices vary much among each other. When creating a password, some people reuse an existing pass- word, and some modify an existing password; Some incorporate PII into their passwords, yet others do not; Some favor digits, some favor letters, and so on. Thus, a user population’s passwords created for a given web service can differ greatly. Therefore, the In response to the results of this work, NIST 800-63-3 has been revised regarding the threat of online password guessing. The NIST staff notified us about the revisions on 19th Sep., 2016 by email http://bit.ly/2cTOTUk . Figure 1: Targeted online guessing. Figure 2: Multiple info for A. trawling guessing models [21, 25, 31, 35], which aim to produce asingle guess list for all users, are not suitable for characterizing targeted online guessing. Second, users’ PII is highly heterogeneous. Some kinds of PII (e.g., name, and hobby) are composed of letters, some (e.g., birthday and phone number) are composed of digits, and some (e.g., user name) are a mixture of letters, digits and symbols. Some PII (e.g., name, birthday and hobby), as shown in Fig. 2, can be directly used as password components, while others (e.g., gender and education) cannot. As we will show, most of them have an impact on people’s password choice. Thus, it is challenging to, at a large-scale, automatically incorporate such heterogeneous PII into guessing models when the guess attempts allowed is limited. Third, users employ a diversiﬁed set of transformation rules to modify passwords for cross-site reuse. As shown in [12, 32], when given a password, there are over a dozen transformation rules, such as insert, delete, capitalization and leet (e.g., password passw0rd) and the synthesized ones (e.g., password Passw0rd1), that a user can utilize to create a new password. How to prioritize these rules for each individual user is not easy. Moreover, which transformation rules users will apply for pass- word reuse are often context dependent. Suppose attacker Atargets Alice’s eBay account which requires passwords of length 8+, and knows that Alice is in her 30s. With access to a sister pass- word Alice1978Yahoo leaked from Alice’s Yahoo account, A will have a higher chance by guessing Alice1978eBay than by Alice1978 due to the inertia of human behaviors. Yet, when Alice’s leaked password is 123456,Awould more likely succeed by guessing Alice1978 than by Alice1978eBay. When site password policies are also considered, the situation may further vary. Such context dependence necessitate an adaptive, semantics- aware cross-site guessing model. 1.1 Related work Zhang et al. [37] suggested an algorithm for predicting a user’s future password with previous ones for the same account. Das et al. [12] studied the password reuse issue, and proposed a cross-site cracking algorithm. However, their algorithm is not optimal for targeted online guessing for four reasons. First, it does not consider common popular passwords (e.g., iloveyou, and pa$$w0rd) which do not involve reuse behaviors or user PII. Second, it as- sumes that all users employ the transformation rules in a ﬁxed priority. Yet, as we observe, this priority is actually dynamic and context-dependent. Third, their algorithm does not consider various synthesized rules. Fourth, it is heuristics based. Li et al. [20] examined how user’s PII may impact password security, and found that 60.1% of users incorporate at least one kind of PII into their passwords. They proposed a semantics-rich algorithm, Personal-PCFG, which considers six types of personal information: name, birthdate, phone number, National ID, email address and user name. However, as we will show, its length- based PII matching and substitution approach makes it inaccurate to capture user PII usages, greatly hindering the cracking efﬁciency. Our TarGuess-I manages to overcome this issue by using a type- based PII matching approach and gains drastic improvements. 1.2 Our contributions In this work, we make the following key contributions: A practical framework. To overcome the challenges dis- cussed above, we propose TarGuess, a practical framework to characterize typical targeted online guessing attacks, with sound probabilistic models (rather than ad hoc models or heuristics). TarGuess captures seven typical attacking sce- narios, with each based on a different combination of various information available to the attacker. Four probabilistic algorithms. To model the most repre- sentative targeted guessing scenarios, we propose four al- gorithms by leveraging probabilistic techniques including PCFG, Markov and Bayesian theory. Our algorithms all signiﬁcantly outperform prior art. We further demonstrate how they can be readily employed to deal with the other three remaining attacking scenarios. An extensive evaluation. We perform a series of experi- ments to demonstrate that both the efﬁcacy and general ap- plicability of our algorithms. Our empirical results show that an overwhelming fraction of users’ passwords are vulnerable to our targeted online guessing. This suggests that the danger of this threat has been signiﬁcantly underestimated. New insights. For example, Type-based PII-tags are more effective than length-based PII-tags in targeted guessing. Simply incorporating many kinds of PII into algorithms will not increase success rates, which is counter intuitive. The success rate of a guess decreases with a Zipf’s law as the rank of this guess in the guess list increases. 2. PRELIMINARIES We now explicate what kinds of user personal information are considered in this work and elaborate on the security model. 2.1 Explication of personal information The most prominent feature that differentiates a targeted guess- ing attack from a trawling one is that, the former involves user- speciﬁc data, or so-called “personal info”. This term is sometimes used inter-changeably with the term “personally identiﬁable info” [10, 20], while sometimes their deﬁnitions vary greatly in different situations, laws, regulations [23, 29]. Generally, a user’s personal info is “any info relating to” this user [29], and it is broader than PII. For better comprehension, in Table 1we provide the ﬁrst classiﬁcation of personal info in the case of password cracking, making a systematical investigation of targeted guessing possible. We divide user personal information into three kinds, with each kind having a varied degree of secrecy, different roles in passwords and various types of speciﬁc elements. The ﬁrst kind is user PII (e.g., name and gender), which is natively semipublic: public to friends, colleagues, acquaintances, etc., yet private to strangers. The second kind is user identiﬁcation credentials, and parts of them (e.g., user name) are public, while parts of them (e.g., password) are exclusively private. The remaining user personal data falls into the third kind and is irrelevant to this work. We further divide user PII into two types: Type-1 and Type-2. Type-1 PII (e.g., name and birthday) can be the building blocks of passwords, while Type-2 PII (e.g., gender and education [22]) may impact user behavior of password creation yet cannot be directly used in passwords. Each type of PII shapes our guessing algorithms quite distinctly. Here we highlight a special kind of user personal information — a user’s passwords at various web services. As shown in [12, 32], users tend to reuse or modify their existing passwords at other sites (called sister passwords) for new accounts. However, such sister passwords are becoming more and more easily available due to the unending catastrophic password ﬁle leakages (see [2, 4,27]). Table 1: Explication of user personal info (NID stands for National identiﬁcation number, e.g., SSN; PW for password) Different kinds of personal info Degree of secrecy Roles in PWs Considered in this work(X) Not Considered in this work(×) Personally identiﬁable Type-1 Semipublic Explicit Name, Birthday, Phone number, NID Place of birth, Likes, Hobbies, etc. information (PII) Type-2 Semipublic Implicit Gender, Age, Language Faith, Disposition, Education, etc. User identiﬁcation credentials Private Explicit Passwords, Personal Identiﬁcation Numbers Finger prints, Private keys, etc. Public Explicit User name, Email address Debit card number, Health IDs, etc. Other kinds of personal data Employment, Financial records, etc. Table 2: A summary of the four most representative scenarios of targeted online guessing Attacking scenario Exploiting public information Exploiting user personal information Existing literature Our model (e.g., datasets and policies) One sister password Type-1 PII Type-2 PII Trawling #1 XRef. [21, 25, 35] Targeted #1 X X Ref. [20] TarGuess-I Targeted #2 XRef. [12] X X None TarGuess-II Targeted #3 X X X None TarGuess-III Targeted #4 X X X X None TarGuess-IV As public password datasets are readily available, TarGuess-II and [12] is comparable because they exploit the same type of user PII. A total of 7(=C1 3+C2 3+C3 3) scenarios result from combining the three types of personal info. With TarGuess-IIV, all 7 cases will be tackled in Sec. 4. 2.2 Security model Without loss of generality, in this work we mainly focus on the client-server architecture, the most common case of user au- thentication, as shown in the right of Fig. 1. There are three entities involved in a targeted online guessing attack: a user U, an authentication server Sand an attacker A. User Uhas registered a password account at the server S. This password is only known to S, though U’s passwords at other sites may have already been publicly disclosed. Smay be remote (e.g., an e-commerce site) or local (e.g., a password-protected mobile device). To be realistic, we assume that Senforces some security mechanisms such as suspicious login detection and lockout [14,18], and thus the number of guesses allowed to Ais limited (e.g., 102 [8, 18]). Aknows some amount of personal info about U, and may be a curious friend, a jealous wife, a blackmailer, or even an evil hacker group that buys personal info from the underground market. As there is a messy mixture of multiple dimensions of info (see Fig. 2) potentially available to the attacker A, it is challenging to characterize A. We tackle this issue by assuming that all the public info (e.g., leaked PW lists and site policies) should be available to A, and then by deﬁning a series of attacking scenarios (see Table 2) based on varied types of U’s personal info given to A. This is reasonable: (1) Ais smart and likely to exploit the readily available public info to increase her chance; and (2) Awould use different attacking strategies when given different personal info. Once A has successfully guessed the password, the victim’s sensitive info can be disclosed, reputation could be ruined (see [36]), password account may be hijacked and money might be lost (see [26]). Note that, here we only consider scenarios where Ais with at most one sister password of user U. The underlying reason is that, among the 547.56M of leaked password accounts that we have collected over a period of six years, less than 1.02% (resp. 1.73%) of them have more than one match by email (resp. user name). Similarly, among the 7.96M accounts collected by Das et al. in 2014 [12], only 152 (0.00191%) of them have more than one match by email. Therefore, it is realistic to assume that most users have leaked one sister password, and Acan exploit Us this sister password for attacking. 3. HUMAN BEHAVIORS OF PASSWORD CREATION Here we report a large-scale empirical study of human behaviors in creating passwords, in particular, how often they choose popular passwords, how often to reuse passwords, how often to make use of their own PII. Table 3: Basic information about our 10 password datasets Dataset Web service Language When leaked Total PWs With PII Dodonew E-commerce Chinese Dec., 2011 16,258,891 CSDN Programmer Chinese Dec., 2011 6,428,277 126 Email Chinese Dec., 2011 6,392,568 12306 Train ticketing Chinese Dec., 2014 129,303 X Rockyou Social forum English Dec., 2009 32,581,870 000webhost Web hosting English Oct., 2015 15,251,073 Yahoo Web portal English July, 2012 442,834 Rootkit Hacker forum English Feb., 2011 69,418 X XiaomiMobile, cloud Chinese May, 2014 8,281,385 Xato Synthesised English Feb., 2015 9,997,772 Xiaomi passwords are in salted-hash and will be used as real targets. Table 4: Basic information about our personal-info datasets Dataset Language Number of Items Types of PII useful for this work Hotel Chinese 20,051,426 Name, Gender, Birthday, Phone, NID 51job Chinese 2,327,571 Email, Name, Gender, Birthday, Phone 12306 Chinese 129,303 Email, User name, Name, Gender, Birth- day, Phone, NID Rootkit English 69,324 Email, User name, Name, Age, Birthday 3.1 Our datasets Our evaluation builds on ten large real-world password datasets (see Table 3), including ﬁve from English sites and ﬁve from Chinese sites. They were hacked by attackers or leaked by insiders, and disclosed publicly on the Internet, and some of them have been used in trawling password models [13, 19, 21]. Rootkit initially contains 71,228 passwords hashed in MD5, and we recover 97.46% of them by using our TarGuess-IV and various trawling guessing models [21, 30] in one week. In total, these datasets consist of 95.83 million plain-text passwords and cover various popular web services. The role of each dataset will be speciﬁed in Sec. 5. In particular, two of these ten password datasets contain various types of PII as shown in Table 4. Besides, we further employ two auxiliary PII datasets, aiming to augment the password datasets by matching the email address to facilitate a more comprehensive understanding of the role of PII in user-chosen passwords. While most of the PII attributes in Chinese PII-associated datasets are available, 17.90% of names and 54.04% of birthdays in Rootkit are null. These missing attributes may hinder the effectiveness of targeted attacks against Rootkit users. To the best of knowledge, our corpus is the largest and most diversiﬁed ever collected for evaluating the security threat of targeted online guessing. 3.2 Popular passwords Table 5shows how often users from different services choose popular passwords. It is disturbing that 0.79%10.44% of user- chosen passwords can be guessed by just using the top 10 pass- words. Generally, top Chinese passwords are more concentrated than English ones [34], which may imply that the former would be Table 5: Top-10 most popular passwords of each service Rank Dodonew CSDN 126 12306 Rockyou 000webhost Xato Yahoo Rootkit 1 123456 123456789 123456 123456 123456 abc123 123456 123456 123456 2 a123456 12345678 123456789 a123456 12345 123456a password password password 3 123456789 11111111 111111 5201314 123456789 12qw23we 12345678 welcome rootkit 4 111111 dearbook password 123456a password 123abc qwerty ninja 111111 55201314 00000000 000000 111111 iloveyou a123456 123456789 abc123 12345678 6 123123 123123123 123123 woaini1314 princess 123qwe 12345 123456789 qwerty 7 a321654 1234567890 12345678 123123 1234567 secret666 1234 12345678 123456789 8 12345 88888888 5201314 000000 rockyou YfDbUfNjH10305070111111 sunshine 123123 9 000000 111111111 18881888 qq123456 12345678 asd123 1234567 princess qwertyui 10 123456a 147258369 1234567 1qaz2wsx abc123 qwerty123 dragon qwerty 12345 % of top-10 3.28% 10.44% 3.52% 1.28% 2.05% 0.79% 1.46% 1.01% 3.94% The letter-part (i.e., YfDbUfNjH) can be mapped to a Russian word which means “navigator”. Why it is so popular is beyond our comprehension. more prone to online guessing. While most of the top Chinese pass- words are only made of simple digits, popular English ones tend to be meaningful letter strings or keyboard patterns. Love plays an important role — iloveyou and princess are among the top- 10 lists of two English sites, while 5201314 and woaini1314, both of which sound as “I love you forever and ever” in Chinese, are among the top-10 lists of three Chinese sites. Other factors such as culture (see 18881888) and site name (see rockyou and rootkit) also show their impacts on password creation. Figure 3: Fraction of PWs shared between two sites. Fig. 3illustrates the fraction of top-kpasswords shared between two different services with varied thresholds of k. Generally, the fraction of shared passwords from the same language is substantial- ly higher than that of shared passwords from different languages. In addition, the fraction of shared passwords between any two services is less than 60% at any threshold klarger than 10. This implies that both language and service play an important role in shaping users’ top popular passwords. Rockyou and 000webhost share signiﬁcantly fewer common passwords than other pairs do. We examine these two datasets and ﬁnd that 99.29% of 000webhost passwords include both letters and digits, indicating that this site enforces a password creation policy that requires passwords to include both letters and digits. This can also be corroborated by Table 5where all top-10 000webhost passwords are composed of both letters and digits. Similarly, we ﬁnd that CSDN requires passwords to be of length 8+. 3.3 Password reuse While users have to maintain probably several times as many password accounts as they did 10 years ago, human-memory ca- pacity remains stable. As a result, users tend to cope by reusing passwords across different services [16,32]. Several empirical stud- ies [5, 12] have explored the password reuse behaviors of English and European users, yet as far as we know, no empirical results have been reported about Chinese users, who reached 668 million by Dec., 2015 [11] and account for about 25% (and the largest fraction) of the world’s Internet population. To ﬁll this gap, we intersect 12306 with Dodonew by matching email, and further eliminate the users with identical password pairs. This produces a new list 12306&Dodonew with two non-identical sister passwords for each user. Similarly, we obtain two more intersected Chinese password lists and three intersected English lists as shown in Fig. 4. During the matching process, we ﬁnd that 34.02%71.11% of Chinese users’ sister password pairs are identical (and thus are eliminated), while these ﬁgures for English users are 6.25%21.96% (see Sec. 5.1). This suggests that our English users reuse less. Figure 4: Using the Levenshtein-distance similarity metric to measure the similarity of two passwords chosen by the same user across different services. Results suggest that most users modify passwords in a non-trivial way. We employ the widely accepted Levenshtein-distance metric to measure the similarity between two different passwords of a given user. Fig. 4shows that, sister passwords of Chinese users generally have higher similarity than English users, implying that Chinese users modify passwords less complexly. About 30% of the non-identical Chinese password pairs have similarity scores in [0.7, 1.0], while this ﬁgure for our English password pairs is less than 20%. We also employ the longest-common-subsequence metric for measurement. Both metrics show similar results. Our results imply that the majority of users modify passwords in a non- trivial approach, and it would be challenging to model such users’s modiﬁcation behaviors. We have observed that our English users reuse less and modify passwords more complexly. A plausible reason for this observation is that the two english sites are not normal: Rootkit is a hacker fo- rum and 000webhost is mainly used by web administrators. There- fore, the users of both sites are likely to be more security-savvy than normal users. Thus, the lists Rootkit&000webhost, Rootkit&Yahoo and 000webhost&Yahoo will show more secure reuse behaviors than that of normal English/Chinese users. In 2014, Das et al. [12] found that the fraction of identical sister PW pairs of normal English users is 43%, which roughly accords with our Chinese users yet 26 times higher than our English users. They also showed that about 30% of their non-identical English PW pairs have similarity scores in [0.7, 1.0], well in accord with that of our Chinese users. Moreover, the survey results on password reuse behaviors of normal Chinese users [32] are largely consistent with the survey results on normal English users [12]. Both empirical and survey results suggest that normal Chinese and English users have similar reuse behaviors, while our English users would be good representatives of security-savvy users. Table 6: Percentages of users building passwords with (and only with) their own heterogeneous personal information Typical usages of personal information (examples) PII-Dodonew PII-126 PII-CSDN PII-12306 PII-Rootkit PII-Yahoo PII-000web- (161,510) (30,741) (77,439) (129,303) (69,330) (214 ) host(2,950) Full_name (lei wang,john smith) 4.68 0.82 3.00 1.32 4.85 1.81 5.02 1.13 1.38 0.75 2.34 1.87 2.44 1.32 Family_name (wang,smith) 11.15 0.01 6.16 0.00 9.75 0.00 11.23 0.00 2.28 0.78 4.67 1.87 3.73 1.46 Given_name (lei,john) 6.49 0.07 4.10 0.12 6.26 0.08 6.61 0.07 0.49 0.07 0.93 0.00 0.75 0.20 Abbr. full_name (wl,lwang,js,jsmith) 13.64 0.02 6.36 0.00 9.42 0.00 13.13 0.00 0.15 0.01 0.00 0.00 0.20 0.00 Birthday(19820607,06071982,07061982) 3.12 1.00 3.70 2.77 6.29 5.16 4.33 1.77 0.08 0.06 0.47 0.00 0.10 0.07 Year of bithday (1982) 8.92 0.00 8.84 0.01 11.37 0.00 10.78 0.00 0.75 0.01 1.40 0.00 1.12 0.00 Date of bithday (0607,0706) 8.32 0.00 10.48 0.02 11.84 0.00 10.03 0.00 0.44 0.01 0.47 0.00 0.58 0.00 Abbr. bithday(198267,671982,761982,820607,060782) 2.37 0.59 2.60 1.71 2.89 1.45 3.31 1.12 0.10 0.05 0.00 0.00 0.20 0.14 Family_name+bithday (wang19820607,smith06071982) 0.08 0.08 0.05 0.05 0.03 0.03 0.14 0.14 0.00 0.00 0.00 0.00 0.00 0.00 Family_name+Abbr. bithdayÀ(wang198267,smith671982) 0.11 0.11 0.03 0.02 0.05 0.05 0.15 0.14 0.00 0.00 0.00 0.00 0.00 0.00 Family_name+Abbr. bithdayÁ(wang820607,smith060782) 0.17 0.17 0.07 0.07 0.13 0.11 0.17 0.16 0.00 0.00 0.00 0.00 0.00 0.00 Family_name+year of birth (wang1982,smith1982) 0.55 0.22 0.20 0.07 0.22 0.07 0.64 0.25 0.01 0.00 0.00 0.00 0.00 0.00 Family_name+date of birth (wang0607,smith0607) 0.12 0.09 0.05 0.03 0.08 0.04 0.16 0.12 0.01 0.00 0.00 0.00 0.00 0.00 User name (icemoon12,bluebirdz) 1.54 1.14 0.54 0.38 0.61 0.43 1.96 1.32 1.59 0.92 2.34 1.40 2.20 1.32 Email_preﬁx (l0veu4ever@example.com) 5.07 3.07 2.52 1.60 4.35 2.48 3.03 1.82 0.77 0.44 4.21 1.87 1.32 0.78 Phone number (11-digit Chinese mobile number 13511336677) 0.10 0.10 0.48 0.45 0.50 0.45 0.07 0.01 ‘a’+birthday(a19820607,a06071982,a07061982) 0.16 0.13 0.04 0.02 0.03 0.02 0.16 0.12 0.00 0.00 0.00 0.00 0.00 0.00 Full_name+1 (wanglei1,johnsmith1) 1.49 0.22 0.51 0.03 0.84 0.03 1.65 0.17 0.06 0.01 0.00 0.00 0.03 0.00 All the decimals in the table use ‘%’ as the unit. For instance, 4.68 in the top left corner means that 4.68% of the 161,510 PII-associated Dodonew users employ their full name to build passwords; 0.82 means that 0.82% of these 161,510 Dodonew users’ passwords are just their full names. (a) Gender on freq. distribution. (b) Age on length distribution. Figure 5: Impact of type-2 PII on user password creation. Both gender and age show tangible impacts. 3.4 Password containing personal info We show in Table 6how often users employ their own PII to build passwords. Since some password lists have no PII (see Table 3), we correlate them with the PII datasets of the same language in Table 4by matching email. As a result, seven PII-associated password lists are produced, and they are much more diversiﬁed than those in [20]. The sample size of each PII-associated dataset is shown in the ﬁrst row of Table 4. As expected, highly heteroge- neous PII becomes components of passwords, and users like to use names, birthdays and their variations. Particularly, a non-negligible fraction of users employ just their full names (0.75%1.87%) as passwords, and 1.00%5.16% of Chinese users use just their birthdays as passwords. Surprisingly, email and user name prevail in passwords of both user groups, ranging from 0.77% to 5.07% and from 0.54% to 2.34%, respectively. In comparison, English users exhibit a more secure behavior in PII usages, for our English users represent security-savvy ones. Fig. 5illustrates the impact of type-2 PII : (1) passwords of Dodonew female users are more concentrated; (2) passwords of Dodonew users in age24 and age46 have quite similar length distributions (pairwise χ2test, p-value= 0.009), while users in age 2545 are signiﬁcantly different in length distributions (p- values<106). Similar results are found in all other datasets. Type-based PII matching. To achieve accuracy in PII recognition, we propose a type-based PII segment matching method: besides the traditional PCFG-based L,D,Stags [35], we employ a few kinds of PII tags (e.g., Nfor name and Bfor birthday), and each subscript number of our PII tags stands for a particular sub-type of one kind of PII considered. For instance, N1denotes the usage of family name (e.g., li), B5denotes the usage of year in birthday (e.g., 1982) and so on. More details will be given in Sec. 4.1. This is inherently different from the length-based PII matching method given in an independent study [20]. To avoid mismatching, only PII segments with len 3are considered in [20]. For instance, a match with any length 3+substring (e.g., 195,952, 520) of a birthday 19520123 will be considered as a birthday match. However, this introduces both under-estimations and over- estimations in PII matching. For example, the password li.520 of a user named “Wei Li” with birthday 19520123” will be tagged as L2S1Birth3, because the family name li is of length <3. As 20% of the top-50 Chinese family names are with length <3 (e.g., li,wu and he), a large fraction of users’ name usages may be under-estimated by [20]. For instance, 30,926 (23.9%) of the 13K 12306 users are with a family name len 2, and 4,346 of these 30,926 users indeed use their family name in passwords, yet this fact cannot be captured in [20]. On the other hand, the segments (e.g., 123,520 and 201) in top popular digital passwords (e.g., 123456,123456789, 5201314) would often coincide with user birthdays and phone numbers, leading to over-estimations of their usages in passwords. As we will show in Sec. 4.1, this length-based matching method also introduces a weakness in the guess generation process when performing cracking, while either increasing or decreasing their length threshold will not eliminate the problem. Summary. Our PII-associated password corpus is so far the largest and most diversiﬁed ever collected for evaluating targeted online guessing. Particularly, it, for the ﬁrst time, covers (security-savvy) English users. While users’ three vulnerable behaviors might be potentially exploited to improve cracking, our results show that varied circumstances (e.g., language, service and policy), non- trivial transformation rules and highly heterogeneous PII all would make it a challenging task to automate this process, especially when given a limited guessing number (e.g., 100 by NIST [8,18]). 4. TARGUESS: A FRAMEWORK FOR TARGETED ONLINE GUESSING We now propose TarGuess, a practical framework that effec- tively addresses the realistic yet challenging problem of modeling various targeted online guessing scenarios. As shown in Fig. 6, TarGuess consists of three phases (i.e. preparing, training and guessing). The design of the ﬁrst and third phases is straightforward, and the main task lies in the second one. TarGuess captures four types of the most representative targeted online guessing scenarios, with each type based on varied kinds of personal information available to A(see Table 2): (i) only Figure 6: An architectural overview of the TarGuess. Figure 7: TarGuess-I: an illustration. type-1 PII; (ii) one sister password; (iii) combination of iand ii; (iv) combination of iii and type-2 PII. To model these four scenarios, we suggest four guessing models (IIV) by leveraging a number of probabilistic techniques such as PCFG, Markov and Bayesian theory. We also show that, with TarGuess-IIV, the three remaining scenarios can also be well addressed. 4.1 TarGuess-I TarGuess-I aims to online guess a user Us passwords by ex- ploiting U’ some type-1 PII (e.g., name and birthday, not gender). It builds on Weir et al.’s PCFG-based algorithm [35] which has been shown a great success in dealing with trawling guessing scenarios. In the training phase of [35], each password is seen as a combination of letter(L)-, digit(D)- and symbol(S)- segments. For example, loveyou@1314 is parsed into the L-segment “lovey- ou”, S-segment “@” and D-segment “1314”, and its base structure is L7S1D4;wanglei@1982 is also parsed into L7S1D4. Our new algorithm. To capture PII semantics, besides the L,D, Stags as with PCFG [35], we introduce a number of type-based PII tags (e.g., N1N7and B1B10 ). For a type-based PII tag, its subscript number stands for a particular sub-type of one kind of PII usages but not the length matched, as opposed to the L, D,Stags. For instance, Nstands for name usages, while N1for the usage of full name, N2for the abbr. of full name (e.g., lw from “lei wang”), · · · ;Bstands for birthday usages, B1for full birthday in YMD format (e.g., 19820607), B2for full birthday in MDY, · · · . This gives rise to a PII-enriched context-free grammar GI= (V,Σ,S,R), where: 1) S ∈ V is the start symbol. 2) V={S;L,D,S;N1,···,N7;B1,· · · ,B10 ;A1,A2,A3;E1, E2,E3;P1,P2;I1,I2,I3} is a ﬁnite set of variables,1where: (1) N1and N2have been speciﬁed earlier, N3for family name (e.g., wang), N4for given name, N5for the 1st letter of the given name + family name (e.g., lwang), N6for last name+ the 1st letter of the given name (e.g., wangl), N7for family name with its 1st letter capitalized (e.g., Wang); (2) B1and B2have been speciﬁed earlier, B3stands for full birthday in DMY (e.g., 07061982), B4for the date in birthday, B5for the year in birthday, B6for Year+Month (e.g., (e.g., 198206), B7 1The number of PII-based variables and their speciﬁc deﬁnitions depend on the nature of the PII to be trained (e.g., phone number in US is 10 digits while 11 digits in China) and on the granularity the attacker Aprefers (e.g., Amay prefer 4 types of name usages but not 7 as we do). Here we give a typical deﬁnition for attacking Chinese users, and it is easily tailored to other user groups. Besides, it’s feasible to generalize GIby pre-deﬁning a number of running-modes: Chinese-mode, US-mode, German-mode, etc. Then, GIneeds no customization, and the inputs to the algorithm TarGuess- I are the victim’s PII-attributes plus a running-mode. for Month+Year (e.g., 061982), B8for the last two digits of year + date in MD format (e.g., 820607), B9for date in MD format + the last two digits of year (e.g., 060782), B10 for date in DM format + the last two digits of year (e.g., 070682); (3) Astands for account name usages, A1for full account name (e.g., icemoon12), A2for the (ﬁrst) letter-segment of account name (e.g., icemoon), A3for the (ﬁrst) digital- segment of account name (e.g., 12); (4) Estands for email preﬁx usages, E1for the full email preﬁx (e.g., loveu1314 from loveu1314@example.com), E2for the ﬁrst letter- segment of email preﬁx (e.g., loveu), E3for the ﬁrst digital- segment of account name (e.g., 1314); (5) Pstands for mobile phone number usages, P1for the full number, P2for the ﬁrst three digits, P3for last four digits; (6) Istands for the Chinese Notional Identiﬁcation number, I1for the last 4 digits, I2for the ﬁrst 3 digits, I3for the ﬁrst 6 digits.2 3) Σ={95 printable ASCII codes, N ull}is a ﬁnite set disjoint from Vand contains all the terminals of GI. 4) Ris a ﬁnite set of rules of the form Aα, with A∈ V and α∈ V Σ(see Fig. 7). Figure 8: A comparison of TarGuess-I (and its variants) with Personal-PCFG [20], trained on the 50% of 12306 dataset and tested on the remaining 50%. Both TarGuess-I and Personal- PCFG [20] employ six kinds of the 12306 type-1 PII, while TarGuess-Ieliminates phone # and NID, TarGuess-I′′ further eliminates email and user name, and TarGuess-I′′′ further eliminates birthday. TarGuess-I and Igreatly outperform [20]. A probabilistic context-free grammar (PCFG) is a CFG that, for a speciﬁc left-hand side (LHS) variable (e.g., L4), all the proba- bilities associated with its rules (e.g., L4love and L4 Suny) can add up to 1 [35]. This condition is satisﬁed by GI. 2Our deﬁnitions are gained by recursively adjusting and training on the PII- associated 126 dataset. To follow the basic machine learning principles, this dataset hereafter will never be used as the test set. It is not difﬁcult to see that the training phase of TarGuess-I can indeed automatically derive a PCFG. For more background, see [35]. Using this grammar, in the guess generation process (see Fig. 7) we can further derive: (1) all the terminals (e.g., love@1314) which are instantiated from base structures that only consist of L,D and Stags; and (2) all the pre-terminals (e.g., N3B5and N31234) which are intermediate guesses consisting of PII-based tags. The ﬁnal guess candidates come from these terminals as well as from instantiating all the pre-terminals with the victim’s PII. Note that, to improve accuracy, we match using the longest- preﬁx rule and also only consider PII-segments with len 2. For example, if john06071982 matches John Smith’s account name “john0607”, it will be parsed into A1B5using the longest- preﬁx rule, but not N3B2. In addition, we have only considered full MMDD dates in the deﬁnition of B1B10, yet many users tend to use an abbr. of date when possible (e.g., “198267” instead of “19820607”). Thus, when matching a birthday-based segment in the training phase, if an abbreviation happens, the tag related to the corresponding full segment will be counted by one; In the password generation process, both full and abbreviated date segments will be produced. For instance, both “john06071982” and “john671982” will be produced if the structure N3B2is used for guess generation. Our type-based PII tags are widely applicable. In the above, we have shown how to employ type-based PII tags to build a semantic- aware grammar using PCFG. Actually, they can also be employed by various other guessing algorithms (e.g., Markov-based [21] and TarGuess-II in Sec. 4.2) to build PII-enriched cracking algorithms. For instance, to build a PII-enriched Markov-based algorithm, we only need to incorporate the type-based PII tags {N1,···,N7;B1, ···,B10;A1,A2,A3;E1,E2,E3;P1,P2;I1,I2,I3} into the alphabet Σ(e.g., Σ = {95 printable ASCII characters}in [21]) of the Markov n-gram model, and then all operations for these type- based PII tags are the same with the original characters in Σ. GIis highly adaptive. On the one hand, whenever we want to consider new semantic usages (e.g., website name) or new type-1 PII usages (e.g., hobby), we can simply deﬁne new corresponding type-based tags (e.g., Wfor website name and Hfor hobby), the same as we deﬁne the Nand Btags. In the training and guess generation phases, all the operations related to Hand Ware similar to that of Nand Btags. It is “Plug-and-Play”. On the other hand, even if TarGuess-I deﬁnes the Btag yet the training set has no birthday information, GIstill works properly—it will not parse passwords using Btags and simply parse birthday information in passwords using the Dtag. That is, GIis “self-dumping” and we do not need to specially eliminate the B-related tags in such cases. An independent study. In a recent paper (published in April 2016), Li et al. [20] presented a length-based PII matching method. Our work is independent from theirs. Besides the L,D,Stags in PCFG [35], Li et al. introduced six kinds of PII tags: Nfor name, Bfor birthdate, Efor email, Afor account (user) name, Pfor phone number, and Ifor NID. In contrast to our type-based approach, each PII-based tag in [20] uses a sub- script number to denote the length len of the matched PII segment (only len 3are considered in [20]), following the same approach of the L,Dand Stags as in PCFG [35]. As a result, in [20], wanglei@1982 now is parsed into the Nsegment “wanglei”, S segment “@” and Bsegment “1982”, and its base structure will be N7S1B4; “loveyou@1314” is parsed into L7S1D4. Within 100 guesses, their “Personal-PCFG” algorithm cracked about 17% of their test dataset by using “perfect dictionaries”. However, we discovered a weakness in their Personal-PCFG. Their algorithm uses length-based tags, the same as Weir et al.’s algorithm [35]: it differentiates a segment’s length, but is insensi- tive to a segment’s subtype. For example, both john@1982 and wang@1982 will be parsed into N4S1B4, because both “wang and “john” are of length 4, despite the fact that “wang” is a family name while “john” is a given name. As shown in Fig. 9, this not only introduces both under-estimations and over-estimations in the training phase, but also leads to illogical situations in the guess generation phase. In the training phase, since “wang”, “smith and “li” are all family names and “1982” is a user’s year of birth, the probability of N4B4shall be 0.6 but not 0.4, the probability of L2B4shall be 0 but not 0.2. In the guess generation phase, “lee is of length 3, but there is no base structure N3B4available, and thus the guess lee1977 will be given a probability 0 and thus not be generated by Personal-PCFG. Figure 9: A weakness of Personal-PCFG [20]. According to our grammar GI, both wang1982 and li1982 will be parsed into the same base structure N3B5,john1982 is parsed into N4B5, and thus the guess lee1977 can be generated using the base structure N3B5for “David Lee” born in 1977. This addresses the weakness in [20]. Evaluation. For fair comparison, we leverage the 12306 dataset as with [20] and follow their experimental setups (see Fig. 8) as closely as possible. The only exception is that, we do not use “the perfect (L-) dictionary” which is collected directly from the test set, because this not only introduces overﬁtting issues [21, 35] but also is unrealistic in practice. Instead, all our experiments directly learn the L- dictionary from the 12306 training set, a recommended practice in password cracking [13, 21]. Fig. 8shows that, within 10103guesses, our TarGuess-I outperforms Persona-PCFG [20] by 37.11%73.33%, and outperforms the three trawling online guessing algorithms [6, 21, 35] by at least 412%740%. We have further examined to what extent each individual PII would impact TarGuess-I. As shown in Fig. 8, within 100 guesses, our TarGuess-I can successfully crack a 12306 user’s password with an average chance of 20.26% when given this user’s email, account name, name, birthday, phone number and NID. This ﬁgure is 20.18% when given email, account name, name and birthday. This ﬁgure is 13.61% when given name and birthday; this ﬁgure is 6.04% when given only name. Our results suggest that email, account name, name and birthday would be very valuable for an online attacker, while phone number and NID provide marginally improved success rates. Interestingly, Personal-PCFG [20] exploits two more PII attributes than TarGuess-Iyet is much less effective, suggesting that simply incorporating more PII information into algorithms will not always yield more effectiveness. Summary. We are the ﬁrst to propose type-based PII tags for building a semantics-aware PCFG. Such PII tags can also be em- ployed by other trawling algorithms (e.g., Markov n-grams [21]) to build targeted ones. Within 102guesses, TarGuess-I has a success rate of 20.18% when given a 12306 user’s email, user name, name and birthday. Within 10103guesses, TarGuess- I outperforms Personal-PCFG [20] by cracking 37.11%73.33% more passwords. Particularly, TarGuess-I is highly adaptive. 4.2 TarGuess-II TarGuess-I aims to online guess a user Us password P Wxat one service (e.g., CSDN) when given U’s one sister password P Ws leaked from another service (e.g., Dodonew). This is a challenging task for two reasons. Firstly, the online guessing number allowed is small. Secondly, there are over a dozen transformation rules, such as insert, delete, capitalize, leet (e.g., passwordpassw0rd) and the synthesized ones (e.g., passwordPassw0rd), at user- s’ choices to create P Wxby modifying P Ws. This process de- pends on the password creation policy, the value of service and each user’s creativity. Moreover, even if Aknows that Uis likely to insert three digits, which digital sequence will be exactly used by U? It is worth noting that, users also love to use top popular passwords instead of modifying P Ws(see Sec. 3.2). The sole work by Das et al. [12] considers cross-site guessing by using U’s one sister password, yet it is ineffective due to four reasons as we have shown in Sec. 1. Figure 10: The training process of TarGuess-II In this work, we prefer a data-driven approach. We use two lists of passwords as training sets, one leaked from a similar poli- cy/service with the target site, the other is similar with that of P Ws, and look for the same users in these two lists by matching email. Further, the identical PW pairs are eliminated, and this creates a new list of non-identical sister PW pairs {P WA,P WB}. Then, we measure how P WBis modiﬁed from P WA, or whether P WB is simply a popular password. To determine whether P WBis a popular one, we build a top-104list Lfor the target service (e.g., CSDN) from various leaked lists with consideration of policy and language, e.g., L={pw |len(pw)8and the value of Pcsdn(pw) P126(pw)Pdodonew (pw)ranks top-104}. As shown in Fig. 10, in the training phase of TarGuess-II, ﬁrst of all it determines whether P WB L or not. If yes, the occurrence of P WB∈ L increases by one; If not, the pair (P WA, P WB) ﬁrst goes through the structure-level training and then the segment-level training. In the structure-level process, TarGuess- II ﬁrst parses passwords with L,D,Stags as with TarGuess-I. For instance, abc123 is parsed into L3D3. According to [12, 32], we consider six main types of structure-level transformation rules: insertion (e.g., L3D3L3D3S2), deletion (e.g., L3D3S2) L3D3), capitalization C, leet L, substring movement SM (e.g., abc123123abc) and reverse R(e.g., abc123321cba). There are two segment-level rules: insertion (e.g., L3: abc L4: abcd) and deletion (e.g., L3: abc L2: bc). For each of these eight types of rules, there exist a number of sub-types and we consider the most common ones. More speciﬁcally, for both levels of insertion (see Fig. 11), there are tail insertion ti (resp. ti) and head insertion hi (resp. hi); For both levels of deletion, there are tail deletion td(resp. td) and head deletion hd(resp. hd); For capitalization, there are four types as in Table 7; For leet, we consider 5 sub-types as in Table 8; For reverse, we consider 2 sub- types as in Table 10. Note that, a combination of our insertion and deletion operations can transform abc123 to abc!123, achieving the middle insertion. The ﬁrst step of the structure-level training is to employ the Levenshtein-distance (LD) algorithm (with only insertion and deletion enabled) to measure the similarity score d1=LD(P WA, P WB)between the pair P WAand P WB. Then, we use each structure-level rule (except for insertion and deletion) in the C,L, R,SM order to obtain P W Abased on P WA, when considering that their popularity order is C>L>SM >R[12, 32] and that the rule Rwould result in a more drastic change than SM . Upon each rule, we compute d2=LD(P W A, P WB). If d2>d1, such a rule is called a “live” one, then P WAis updated to P W A, and the occurrence of the corresponding rule (see Tables 7to 10) increases by one. Then, we execute the next rule on P W Ato produce P W ′′ A, and compute d3=LD(P W ′′ A, P WB). If d3>d2, this rule is “live” and counted. Upon all these live rules, assume P W ′′′ Awill be created from the original P WA. To avoid futile transformations to dilute the effective ones, we require that if LD(P W ′′′ A, P WB)is smaller than a predeﬁned threshold (e.g., 0.5as suggested in this work), then all these “live” rules are un-counted, and the training process switches to the next password pair in the training set. Otherwise, both P W ′′′ A and P WBare parsed with L,Dand Stags to be, e.g., L4D3S2and L6S1. Since we do not consider the length of a PW segment in the structure-level, L4D3S2and L6S1will be seen as LDS and LS. Now we use the LD metric to compute d3=LD(“LDS”, “LS”) and meanwhile, the LD algorithm returns a LD edit route which records how to arrive “LS” from “LDS”: ﬁrst use the rule td on the S2- segment, then use the rule td on the D3-segment, and ﬁnally use the rule ta on the S1-segment, producing P W ′′′′ Awith a base structure L4S1. Accordingly, the occurrence of all the corresponding items in Fig. 11(a) is updated. Now we come to the segment-level training phase, and the focus is in the inner of the L-, D- or S- segment of a password. For P W ′′′′ A(whose structure is L4S1) and P WB(L6S1), we use the LD metric to measure the similarity of their L-segments. As with the structure-level training, the LD metric is used to update the occurrence of all the corresponding rules in Fig. 11(b). In our experiments, we ﬁnd that the probabilities in the right-most row of Fig. 11(b) are better by computing using Markov n-grams [21] which are trained on a million-sized large password list, than by using the training as stated above. This is mainly because the size of the non-identical PW pairs in our training sets is only moderate and may lead to the sparsity issue. Fortunately, Markov n-gram model trained on million-sized PW lists can overcome this issue. Our above two training phases give rise to a password-reuse based context-free grammar GII = (V,Σ,S,R), where: 1) V={S;L,D,S;L,R,C,S M;ti,td,hi,hd;ti,td,hi, hd} is a ﬁnite set of variables. 2) S ∈ V is the start symbol. 3) Σ={95 printable ASCII codes; C1,···,C4;R1,R2;L1,· · · , L5;Yes,No;No}is a ﬁnite set disjoint from V. 4) Ris a ﬁnite set of rules of the form Aα, with A∈ V and α∈ V Σ(see Fig. 11 and Tables 7to 10). Note that, GII is a probabilistic context-free grammar due to the fact that, for a speciﬁc left-hand side (LHS) variable (e.g., R) of GII , all the probabilities associated with its rules (e.g., RNo, RR1and RR2) can add up to 1. Using GI I , in the guess generation phase we can create a list of guesses with possibilities. For instance, when given password,Pr(“Pa$$word123”)= Pr(S → L8)*Pr(L8ti)Pr(tiD3)*Pr(D3123)P(C C1)Pr(LL2)Pr(LL2)Pr(RNo)Pr(SM No) =1 * 0.1 * 0.15 * 0.08 * 0.03 * 0.01 * 0.01 * 0.97 * 0.97 = 3.39 * 109, where the related probabilities are referred to Tables 7to 10. Then, all the probabilities of guesses generated by GII should be multiplied by the factor α. This αrepresents the fraction of users who do not choose top passwords (e.g., 0.21 in Fig. 10). Then, the probability of each password in the top-104list are multiplied by 1α. Finally, these two probability-associated lists are merged and sorted in decreasing order, and then we select the top k(e.g., k=103) as the ﬁnal guess candidates. In Fig. 12, we provide a comparison of TarGuess-II with Das et al.’s algorithm [12]. These two algorithms are comparable because they employ the same personal information of the victim. When given a user Us Dodonew password, within 100 guesses, Das et al.’s algorithm [12] gained a success rate of 8.98% against U’s CS- DN account, while the ﬁgure for TarGuess-II is 20.19%, reaching Table 7: Training of capitalization C(C1: Cap. all; C2: Cap. the 1st letter; C3: Lower all; C4: Lower 1st) No C1C2C3C4 Probability 0.95 0.01 0.03 0.003 0.007 Table 8: Training of the leet transformation rule L No L1:a@L2:s$L3:o0L4:i1L5:e3
Prob. 0.95 0.02 0.01 0.01 0.005 0.005
Table 9: Training of sub-
string movement
substring moved SM
Yes No
Prob. 0.03 0.97
Table 10: Training of reverse
operation R(R1: Reverse all; R2:
Reverse each segment)
No R1R2
Probability 0.97 0.02 0.01
(a) Structure-level insertion/deletion.
(b)Segment-level insertion/deletion.
The right two rows is better trained
using Markov n-grams.
Figure 11: Training of two levels of insertion and deletion. As over
99% of passwords are with len 16 [21], only segments with len 16
are considered by us. The right-most two rows in Fig. 11(a) is better
trained by using PCFG [35] on a million-sized password list.
a 124.83% improvement. In a series of 10 experiments in Sec.
5, under the same personal information and within 100 guesses,
TarGuess-II outperforms their algorithm [12] by 8.12%300%
(avg. 111.06%). One may conjecture that the two variations of
TarGuess-II employ more personal information than one sister PW,
and thus they are more powerful. In what follows, we, for the ﬁrst
time, provide more than anecdotal evidence to back this conjecture.
4.3 TarGuess-III
TarGuess-III aims to online guess a user Us passwords by ex-
ploiting U’s one sister password as well as some PII. This is
realistic: if the attacker Awants to target Uand knows U’s one
sister password, it is likely that Acan also obtain some PII (e.g.,
email, name) about U. As far as we know, no public literature
has ever paid attention to this kind of attacking scenario. Here we
mainly consider type-1 PII (e.g., name and birthday), while type-2
PII (e.g., gender and age) will be dealt with in Sec. 4.4.
to TarGuess-III generally means more messy things to be consid-
ered and thus more challenges to be addressed. Suppose Awants
to target Alice Smith’s account at eBay which requires passwords
to be of length 8+, and knows Alice was born in 1978 and one
of Alice’s passwords Alice1978Yahoo was leaked from Yahoo.
Given guesses Alice1978eBay,Alice1978 and 12345678,
which one shall Atry ﬁrst? If Alice’s leaked password is 123456,
will the choice vary? Answering this question necessitates an
Fortunately, we ﬁnd that TarGuess-III can fulﬁll this goal by
introducing the PII-based tags (which we have proposed in GI
of TarGuess-I) into the grammar GII of TarGuess-II. In this way,
we can build a PII-enriched, password reuse-based grammar GIII .
More speciﬁcally, besides the L,D,Stags in GII , our grammar
GII I further includes six types of PII usages as with GI, and adds a
number of type-based PII tags (e.g., N1N7and B1B10 as shown
in Sec. 4.1) into Vof GI I .
In the training phase, all the PII-based password segments (each
of which is parsed with one kind of PII tag) only involve the six
structure-level transformation rules as deﬁned in GII , and all the
other things in GIII remain the same with that of GII . In the guess
generation phase, from GII I we derive: (1) all the terminals (e.g.,
love@1314) which are instantiated from base structures that only
consist of L,Dand Stags; and (2) all the pre-terminals (e.g., N4B5
and N31314) which are intermediate guesses that consist of PII-
based tags. For these intermediate guesses, we further instantiate
them with the target user’s PII.
As with GI, our GIII is also highly adaptive. The reasons
are similar with that of GI(see Sec. 4.1). This means that a
new semantic tag, namely W1for website name, can be easily
incorporated into GII I as with these PII tags. Now, GIII can parse
Alice1978Yahoo into the structure N4B5W1, and the guess
Alice1978eBay can be generated with the highest probability,
because no transformation rules will be involved in the process
from N4B5W1to Alice1978eBay.
Figure 12: A comparison of TarGuess IIIV and Das et al.’s algorith-
m [12], trained on the 66,573 non-identical PW pairs of 126CSDN
and tested on the 30,8045 non-identical PW pairs of DodonewCSDN.
Besides a sister password, TarGuess-III uses four types of 51job type-1
PII and TarGuess-IV further uses the gender information.
As shown in Fig. 12, even if Udoes not exactly reuse her
Dodonew password for her CSDN account, TarGuess-III can still
achieve a success rate of 23.48% when allowed to try only 100
guesses, being 16.3% more effective than TarGuess-II. Among
these un-cracked PW pairs by TarGuess-III, over 80% are signif-
icantly different (with LD similarity scores<0.5).
4.4 TarGuess-IV
TarGuess-IV aims to online guess a user Us passwords by
exploiting U’s one sister password as well as both type-1 and type-
2 PII. A major challenge is that, type-2 PII (e.g., gender) can not
be directly measured using any PII tag-based PCFG grammars or
Markov n-grams. We tackle this issue by proving a theorem and
leveraging the Bayesian theory.
THE ORE M 1. Let pw denote the event that the password pw
is selected by Ufor a service, pwdenote the event that pw
is selected by Ufor another service and was leaked, Ai(i=
1,2,··· , n) denote one kind of user PII attributes, including both
type-1 and type-2 ones. We have
Pr(pw|pw, A1, A2,··· , An) = n
i=1 Pr(pw|pw,Ai)
Pr(pw|pw)n1,
under the assumptions that A1, A2,· · · , Anare mutually indepen-
dent, and that they are also mutually independent under the events
pwand (pw, pw).3
Proof. Since A1,··· , Anare assumed to be mutually
independent, we have Pr(A1,· · · , An) = n
i=1 Pr(Ai). Since
3Note that, the assumptions in Theorem 1do not contradict with the fact
that A1,· · · , Anare dependent on the events pwand (pw , pw).
they are also assumed to be mutually independent under the events
pwand (pw, pw), thus Pr(A1,· · · , An|pw)=n
i=1Pr(Ai|pw),
Pr(A1,· · · , An|pw) = n
i=1 Pr(Ai|pw)and Pr(A1,··· ,
An|pw, pw) = n
i=1 Pr(Ai|pw, pw). Now, we can derive:
Pr(pw|pw, A1, A2,··· , An)
=Pr(pw, A1, A2,··· , An|pw)
Pr(A1, A2,· · · , An|pw)
=Pr(A1, A2,· · · , An|pw, pw)·Pr(pw|pw)
n
i=1 Pr(Ai|pw)
=(n
i=1 Pr(Ai|pw, pw))·Pr(pw|pw)
n
i=1 Pr(Ai|pw)
=n
i=1 (Pr(Ai|pw, pw)·Pr(pw|pw))
(n
i=1 Pr(Ai|pw))·Pr(pw|pw)n1
=n
i=1
Pr(pw,Ai|pw)
Pr(Ai|pw)
Pr(pw|pw)n1=n
i=1 Pr(pw|pw, Ai)
Pr(pw|pw)n1.
This theorem indicates that the problem of predicting a user
U’s pwat one service, when given Us PII information A1,A2,
··· , Anand the sister password pw at another service, can be
addressed by a “divide-and-conquer” approach. More speciﬁcally,
we can ﬁrst compute Pr(pw|pw, Ai)for each iand Pr(pw|pw),
and then compute the ﬁnal goal Pr(pw|pw, A1, A2,··· , An).
Fortunately, Pr(pw|pw)can be computed using TarGuess-II, and
Pr(pw|pw, Ai)can be obtained by using TarGuess-III when Ai
is a type-1 PII. When there are 2+type-1 PII attributes to be con-
sidered (suppose Al, Am, An), they together ﬁrst can be deemed
as one virtual attribute (e.g., to be A
l) in Theorem 1and then be
addressed simultaneously by running TarGuess-III. The only issue
left is how to compute Pr(pw|pw, Ai)when Aiis a type-2 PII.
To address it, we introduce the Bayesian theory. Without loss of
generality, assume Akis a type-2 PII. First,
Pr(pw, pw, Ak) = Pr(pw, Ak|pw)·Pr(pw)
= Pr[(Ak|pw)|pw]·Pr(pw|pw)·Pr(pw).
It is reasonable to approximate Pr[(Ak|pw)|pw]by
Pr(Ak|pw). Consequently, we have:
Pr(pw|pw, Ak) = Pr(pw, Ak, pw)
Pr(pw, Ak)
=Pr[(Ak|pw)|pw]·Pr(pw|pw)·Pr(pw)
Pr(pw, Ak)
Pr(pw|pw)·Pr(Ak|pw)·Pr(pw)
Pr(pw, Ak),
(1)
where Pr(pw|pw)is called the “prior” in Bayesian theory, the
factor Pr(Ak|pw)·Pr(pw)/Pr(pw, Ak)represents the impact
of (pw, Ak) on the probability of event (pw, pw).
As a result, Pr(pw|pw, Ak)can be fully computed: (1)
Pr(pw|pw)can be computed using TarGuess-II; (2) Pr(pw, Ak)
=c1is a constant, because the event (pw, Ak)is a known and
ﬁxed fact when we attack U, and the ordering of guesses do not
need the exact value of c1; (3) Pr(pw)=c2is, similarly, a constant
since the event pwis a known and ﬁxed fact when we attack U;
and (4) Pr(Ak|pw)can be computed by counting the training
set—the password pw is selected by what fraction of users that are
with an attribute Ak. When the training data is sufﬁciently large
(e.g., >106), Pr(Ak|pw)can be obtained by direct counting.
Otherwise, smoothing techniques (e.g., Laplace and Good-Turing)
shall be used to overcome the sparsity issue to assure accuracy.
We note that some PII attributes are inherently dependent be-
tween each other (e.g., birthday vs. age, and ﬁrst name vs. gender).
Fortunately, since the majority of PII attributes (see Table 1) are
mutually independent, the practicality of Theorem 1will not be
affected much. This is especially true when many attributes are
simultaneously exploited. We observe that, even if TarGuess-III
only employs birthday and TarGuess-IV employs one more PII
(i.e., age), TarGuess-IV still performs better than TarGuess-III by
now only adjusting the non-birthday-involved guesses using Eq. 1.
As shown in Fig. 12, by exploiting an additional PII (i.e., gen-
der), TarGuess-IV can achieve improvements over TarGuess-III
by 4.38%18.19% within 10103guesses, reaching a success
rate of 24.51% with 102guesses and 30.66% with 103guesses,
respectively. This indicates that type-2 PII, which, as far as we
know, has never been considered in the literature of password
cracking, is indeed valuable for A.
4.5 Dealing with other attacking scenarios
As mentioned in Table 2, seven scenarios can be resulted from
the various combinations of the 3 types of personal info that we
focus in this work. This means that, beyond the four most represen-
tative scenarios #1#4 that we have considered above, three other
ones remain: #5 (type-2 PII), #6 (type-1 PII + type-2 PII) and #7 (1
sister PW + type-2 PII). Besides, there are scenarios involving 2+
sister PWs: #8 (2+sister PWs) and #9 (2+sister PWs+some PII).
When Akis a type-2 PII, it is natural to derive from Eq. 1that:
Pr(pw|Ak)Pr(pw)·Pr(Ak|pw)
Pr(Ak),(2)
where Pr(Ak) = c3is a constant, and both Pr(pw)and
Pr(Ak|pw)can be obtained by counting the training set, as
discussed in Sec. 4.4. Eq. 2well addresses Scenarios #5.
To tackle Scenario #6, we need to develop a new formulation.
From Theorem 1, we can derive that
Pr(pw|A1, A2,··· , An) = n
i=1 Pr(pw|Ai)
Pr(pw)n1,(3)
where Pr(pw|Ai)can be obtained by using TarGuess-I when Aiis
a type-1 PII, or be obtained by using Eq. 2when Aiis a type-2 PII.
As our Theorem 1is suitable for both type-1 and type-2 PII,
Scenario #7 can be readily tackled by ﬁrst using TarGuess-II to
generate a list of guesses and then adjusting the probabilities of the
guesses according to Eq. 1.
els proposed above. A simple approach to tackle them is to employ
our TarGuess in a repeated manner, yet this is not optimal and we
leave these two scenarios for future work. Still, as we have shown
in Sec. 2, only a marginal fraction of users have leaked two or more
passwords, and thus these two scenarios are far less common than
the seven targeted guessing scenarios we have addressed.
Summary. We have designed a series of sound probabilistic mod-
els for targeted online guessing, with each characterizing one of the
seven types of attacking scenarios. Our TarGuess-I and II signiﬁ-
cantly outperform the related algorithms [12, 20], while TarGuess-
III and IV, for the ﬁrst time, tackle the realistic issues of combining
users’ leaked passwords and PII to facilitate online guessing. Based
on TarGuess-IIV, we further show how to address the three re-
maining scenarios. Extensive experiments in the following section
further demonstrate the effectiveness of our TarGuess-IIV.
5. EXPERIMENTS
We now describe our experimental setups and comparatively
evaluate TarGuess-IIV with ﬁve leading algorithms.
5.1 Experiment setup
Among the nine algorithms to be evaluated, three (i.e., Markov
[21], PCFG [35] and Trawling optimal [6]) only need some training
passwords, four (i.e., Das et al.’s algorithm [12], TarGuess-IIIV)
work on password pairs of the same user, and four (i.e., Personal-
PCFG [20], TarGuess-I, III and IV) involve various types of user
Table 11: Training and test settings for each attacking scenario under 9 algorithms
Experimental Training set(s), with policy and language consistent with the test set Test set
scenarioPCFG Markov Tra. opt. Personal-PCFG TarGuess-I Das et al. TarGuess-IITarGuess-III, IV(size; service)
#1: 12306Dodo 126 126 Dodo PII-12306 PII-12306 126 12306, 126 12306, PII-126 49,775; Dodo
#2: 12306CSDN 8+Dodo 8+Dodo CSDN 8+PII-Dodo 8+PII-Dodo 8+Dodo 12306, 8+Dodo 12306, 8+PII-Dodo 12,635; CSDN
#3: DodoCSDN 8+Dodo 8+Dodo CSDN 8+PII-Dodo 8+PII-Dodo 8+Dodo Dodo, 8+126 Dodo, 8+PII-12306 5,997; CSDN
#4: Dodo12306 Dodo Dodo CSDN PII-Dodo PII-Dodo Dodo Dodo, 126 PII-Dodo, 126 49,775; 12306
#5: CSDN12306 Dodo Dodo 12306 PII-Dodo PII-Dodo Dodo CSDN, Dodo CSDN, PII-Dodo 12,635; 12306
#6: CSDNDodo 126 126 Dodo PII-12306 PII-12306 126 CSDN, 126 CSDN, PII-126 5,997; Dodo
#7: RootkitYahoo Rockyou Rockyou Yahoo PII-Rootkit PII-Rootkit Rockyou Rootkit, Xato Rootkit, PII-Xato 214; Yahoo
#8: Rootkit000web L+DRockyou L+DRockyou 000web L+DPII-Rootkit L+DPII-Rootkit L+DRockyou Rootkit, L+DXato Rootkit, L+DPII-Xato 2,949; 000web
#9: 000webRootkit Rockyou Rockyou Rootkit PII-Xato PII-Xato Rockyou 000web, Xato 000web, PII-Xato 2,949; Rootkit
#10: YahooRootkit Rockyou Rockyou Rootkit PII-Xato PII-Xato Rockyou Yahoo, Xato Yahoo, PII-Xato 214; Rootkit
Tra. opt.=Trawling optimal; Dodo=Dodonew; 000web=000webhost; 8+=len8; PII-X=the PII-associated list X;L+D=Passwords with both letters and digits.
ABmeans that: (1) for the four password-reuse-based algorithms (i.e., Das et al.’s algorithm [12], TarGuess-IIIV), a user U’s password at service Acan be used by Ato
help attack U’s account at service B; and (2) for the other ﬁve algorithms, Us password at Ais not involved, and only U’s password at Bis used as the target. Note that, every
user’s passwords in both Aand Bnow have been associated with PII (see Tables 12 and 13) to facilitate the four PII-based algorithms.
When training TarGuess-IIIV, U’s one sister password comes from the 1st dataset, and Auses it to guess Us password from the 2nd dataset.
personal info. However, only two of our original datasets (i.e.,
12306 and Rootkit) are associated with PII (see Table 3). Thus,
as mentioned in Sec. 3.4, we build PII-Dodonew with size 161,510
by matching Dodonew with 51job and 12306 using email address,
and PII-000webhost with size 2,950 by matching 000webhost with
Rootkit. Matching by email ensures that all our PII-associated En-
glish passwords are created by Rootkit hackers, who well represent
security-savvy users. Since Rockyou does not contain email or user
name, we further match Xato with Rootkit to obtain 15,304 PII-
associated Xato passwords to supplement Rockyou.
As shown in Table 12, we further build three lists of password
pairs for Chinese users by matching Dodonew and CSDN with the
instance, the list Dodonew12306 has a total of 49,775 password
pairs, of which 14,380 pairs are with non-identical passwords. Sim-
ilarly, we build three lists of password pairs for English users (see
Table 13), but eliminate one of them (i.e., Yahoo000webhost)
because the limited size of test set (i.e., 96) would make it impossi-
ble to reﬂect the true nature of an algorithm. These ﬁve pair-wised
lists lead to ten experimental scenarios in Table 11.
Table 12: Basic information of the matched Chinese datasets
Original PII-12306(129,303) PII-Dodonew(161,510)
dataset Total Non-identical(%) Total Non-identical(%)
Dodonew 49,775 14,380 (28.89%)
CSDN 12,635 5,538 (43.83%) 5,997 3,957(65.98%)
Table 13: Basic information of the matched English datasets
Original PII-Rootkit(69,330) PII-000webhost(2,950)
dataset Total Non-identical(%) Total Non-identical(%)
000webhost 2,949 2,510 (85.11%)
Yahoo 214 167 (79.04%) 96 90 (93.75%)
To make our experiments as realistic as possible, our choices of
the training set(s) for a given test set (attacking scenario) adhere
to three rules: (1) they never come from the same service; (2)
they are of the same language and password policy; and (3) the
training set(s) shall be as large as possible. Rule 1 prevents our
experiments from the overﬁtting issue, while rules 2 and 3 ensure
the effectiveness of each algorithm. For fair comparison, we further
make sure that all the 9 algorithms work on the same test set, and
that for the same type of algorithms (e.g., TarGuess-I and [20]),
their training sets exploit the same personal information.
5.2 Evaluation results
The guess number allowed is the most scarce resource when per-
forming an online attack, while computational power and bandwith
are not essential. For instance, in every of these ten experiments,
the training phase can be completed on a common PC in less than
65.3s, while the generation of 1000 guesses for each user takes less
than 2.1s. Thus, we use the guess-number-graph to evaluate the
effectiveness of our four probabilistic algorithms with ﬁve leading
algorithms (i.e., PCFG [35], Markov [21], Trawling optimal [6],
Personal-PCFG [20] and Das et al.’s cross-site algorithm [12]).
Figs. 13(a)13(j) show that, under the same personal informa-
tion available to Aand within 100 guesses: (1) TarGuess-I dras-
tically outperforms its foremost counterpart, Personal-PCFG [20],
by 11.17%509% (avg. 84%); (2) TarGuess-II drastically outper-
forms Das et al.’s algorithm [12] by 8.12%300% (avg. 111.06%)
when cracking non-identical password pairs; and (3) TarGuess-III
and IV can gain a success rate of 73.09% when attacking Chinese
normal users and of 31.61% when attacking English security-savvy
users. As the number of guesses increases, in most cases the
superiorities of TarGuess-IIV over their counterparts are further
enhanced. Here we focus on cracking efﬁciency of cross-cite algo-
rithms for non-identical PW pairs, because cracking non-identical
pairs is their primary goal. As mentioned in Sec. 3.1, many PII
attributes in English test sets (i.e., Rootkit and its matched lists) are
missing, otherwise our cracking results would have been higher.
In particular, TarGuess-IV, for the ﬁrst time, characterizes a
very powerful yet realistic attacker who can launch cross-site
guessing by exploiting a victim’s one sister password as well as
both type-1 and type-2 PII. As shown in Figs. 13(a)13(f), within
10 guesses, TarGuess-IV can gain success rates 45.49%85.33%
(avg. 65.70%) against accounts of normal users at various web
services; Within 102guesses, the ﬁgures are 56.96%88.02%
(avg. 73.08%); Within 103guesses, the ﬁgures are 62.95%
89.87% (avg. 77.32%). To achieve such high success rates, the
state-of-the-art trawling algorithms [21, 30] need 1013 guesses per
user and take several days by using high performance computers.
We discover that password strength against both targeted guess-
ing and trawling guessing follows a Zipf distribution. The ﬁrst few
guesses are extremely effective, e.g., at 100 guesses, TarGuess-I
can gain a success rate of 20% against every Chinese service. Yet,
as the number of guesses increases, the success rate of each attempt
decreases rapidly. Interestingly, we ﬁnd that for each of the eight
real-world algorithms (i.e., excluding the trawling optimal one [6]),
the ratio fnof the number of successfully cracked accounts to the
number of guesses per account ncan be well approximated by the
Zipf’s law [34]: fn=Cns, where generally s[0.15, 0.30] and
C[0.001, 0.01] are constants dependent on the test set. Such a
relationship is a straight line when plotted on a log-log scale (see
Fig. 13(k) which depicts the scenario of 12306Dodonew). The
three parallel layers in Fig. 13(k) just correspond to three kinds of
guessing algorithms: trawling, targeted using PII, targeted using a
sister PW. This diminishing principle has an important implication:
Awould stop at some point as her gains do not outweigh her
costs, and there would be three such points corresponding to three
different attacking strategies, yet existing guidelines [16, 18] only
consider the trawling point.
When sister passwords are available, TarGuess-IV can reach a
success rate of 77% against normal users with 100 guesses; Even
when sister passwords are unavailable, TarGuess-I can still achieve
(a) 12306dodonew (b) 12306CSDN (c) DodonewCSDN
(d) dodonew12306 (e) CSDN12306 (f) CSDNDodonew
(g) Rootkit000webhost (h) RootkitYahoo (i) 000webhostRootkit
(j) YahooRootkit (k) Diminish returns in all algorithms (l) A further validation: 12306Xiaomi
Figure 13: Experiment results for 11 targeted guessing scenarios. Sub-ﬁgures (a) to (f) represent attacks on normal users, while sub-ﬁgures (g) to (j)
represent attacks on security-savvy users. Our TarGuess-IIV are highly effective. Sub-ﬁgure (k) depicts the diminish returns in 12306Dodonew.
about 20% success rates against normal users with just 100 guesses,
25% with 103guesses, and 50% with 106guesses. This suggests
that the majority of normal users’ passwords are prone to a small
number of targeted online guesses (e.g., 100 as allowed by NIST
[8, 18]), invalidating the 2016 NIST claim that online guessing
permitted” [18]. As normal users’ passwords are even not strong
enough to resist online guessing and still far away from the [106,
1014] “online-ofﬂine chasm” [16], efforts directed towards resist-
ing ofﬂine attacks (e.g., 1010 guesses or beyond [24, 28]) could
have been better placed. Sites shall primarily focus on defending
against online attacks and protecting the password hash ﬁle, while
users shall mainly avoid the three bad behaviors examined in Sec.
3to survive online guessing. In all, targeted online guessing is
a much more damaging threat to password security than trawling
guessing and than the community (see [8,18]) might have expected.
Our models will facilitate better evaluation of existing and future
password policies (e.g., [9,24, 28]), and they will also be helpful for
forensic investigators to recover passwords in an ofﬂine manner.
5.3 A further validation
In the above experiments, our test sets are in plain-text. Would
our algorithms be still effective when cracking “real accounts”
about which little is known? We conﬁrm this with a further ex-
periment to crack Xiaomi cloud passwords, which are MD5 hashed
with salt, leaked from the world’s 3rd largest phone maker.
We attack Xiaomi as we crack Dodonew in Scenario #2 of Table
11. The test set contains 5,284 Xiaomi hashes obtained after match-
ing the 8M Xiaomi dataset with the 130K 12306 dataset using
email. As shown in Fig. 13(l), within 10103guesses, TarGuess-
I outperforms Personal-PCFG [20] by 70.58%119%; TarGuess-
II outperforms Das et al.’s algorithm [12] by 73.66% 405%;
TarGuess-III and IV can gain success rates of 63.61%73.56%.
These results well accord with our above 10 experiments, especial-
ly the Chinese ones, suggesting the generality of our models.
6. CONCLUSION
We have presented the ﬁrst systematic evaluation of the extent to
which an online guessing attacker can gain advantages by exploit-
ing various types of user personal info including leaked passwords
and common PII. Our study is grounded on a framework that
consists of 7 sound probabilistic models, with each addressing one
typical attacking scenario. Particularly, TarGuess-IIV character-
ize the four most representative scenarios, and for the ﬁrst time, the
problem of how to model context-aware, semantic-enriched cross-
Extensive experimental results show that TarGuess-I and II dras-
tically outperform their foremost counterparts, and TarGuess-III
and IV can gain success rates as high as 73% with just 100 guesses
against normal users and 32% against security-savvy users. Our
results suggest that the currently used security mechanisms would
be largely ineffective against the targeted online guessing threat,
and this threat has already become much more damaging than
expected. We believe that the new algorithms and knowledge of
effectiveness of targeted guessing models can shed light on both
Acknowledgment
The authors are grateful to the anonymous reviewers for their con-
structive comments. We also give our special thanks to Dinei
Florˆencio, Cormac Herley, Hugo Krawczyk, Haining Wang, Yue
Li, Joseph Gardiner, Haibo Cheng and Qianchen Gu for their
insightful suggestions and invaluable help. Ping Wang is the cor-
responding author. This research was in part supported by the Na-
tional Natural Science Foundation of China (NSFC) under Grants
Nos. 61472016 and 61472083, and by the National Key Research
and Development Plan under Grant No. 2016YFB0800600.
7. REFERENCES
[1] Nearly 80 percent of Internet users suffer identity leaks, July
2015. http://bit.ly/2b9TEdn.
[2] All Data Breach Sources, May 2016.
https://breachalarm.com/all-sources.
[3] Turkey: personal data of 50 million citizens leaked online,
April 2016. http://bit.ly/1TPA4j4.
[4] Amid Widespread Data Breaches in China, Dec. 2011.
http://www.techinasia.com/alipay-hack/.
[5] D. V. Bailey, M. Dürmuth, and C. Paar. Statistics on
In Proc. SCN 2014, pages 218–235.
[6] J. Bonneau. The science of guessing: Analyzing an
anonymized corpus of 70 million passwords. In Proc. IEEE
S&P 2012, pages 538–552.
[7] J. Bonneau, C. Herley, P. van Oorschot, and F. Stajano.
Passwords and the evolution of imperfect authentication.
Commun. ACM, 58(7):78–87, 2015.
[8] W. Burr, D. Dodson, R. Perlner, and et al. NIST SP800-63-2:
Electronic authentication guideline. Technical report, NIST,
Reston, VA, Aug. 2013.
[9] X. Carnavalet and M. Mannan. A large-scale evaluation of
high-impact password strength meters. ACM Trans. Inform.
Syst. Secur., 18(1):1–32, 2015.
[10] A. Chaabane, G. Acs, M. A. Kaafar, et al. You are what you
like! information leakage through users’ interests. In Proc.
NDSS 2012, pages 1–15.
[11] C. Custer. China’s Internet users zoom to 668 million, Jan.
2016. http://www.apira.org/news.php?id=1736.
[12] A. Das, J. Bonneau, M. Caesar, N. Borisov, and X. Wang.
The tangled web of password reuse. In Proc. NDSS 2014.
[13] M. Dell’Amico and M. Filippone. Monte carlo strength
evaluation: Fast and reliable password checking. In Proc.
ACM CCS 2015, pages 158–169.
[14] M. Dürmuth, D. Freeman, and B. Biggio. Who are you? A
statistical approach to measuring user authenticity. In Proc.
NDSS 2016, pages 1–15.
[15] S. Egelman, A. Sotirakopoulos, K. Beznosov, and C. Herley.
meters on password selection. In Proc. ACM CHI 2013,
pages 2379–2388.
[16] D. Florêncio, C. Herley, and P. van Oorschot. An
USENIX LISA 2014, pages 44–61.
[17] Now it’s easy to see if leaked passwords work on other sites,
July 2016. http://bit.ly/29AJANh.
[18] P. A. Grassi and J. L. Fenton. NIST SP800-63B: Digital
authentication guideline. Technical report, NIST, Reston,
VA, 2016. https://pages.nist.gov/800-63- 3/sp800-63b.html.
[19] S. Ji, S. Yang, X. Hu, W. Han, Z. Li, and R. Beyah. Zero-sum
password cracking game: A large-scale empirical study on
the crackability, correlation, and security of passwords. IEEE
Trans. Depend. Secur. Comput., 2015. Doi:
10.1109/TDSC.2015.2481884.
[20] Y. Li, H. Wang, and K. Sun. A study of personal information
in human-chosen passwords and its security implications. In
Proc. IEEE INFOCOM 2016, pages 1–9.
[21] J. Ma, W. Yang, M. Luo, and N. Li. A study of probabilistic
password models. In Proc. IEEE S&P 2014, pages 689–704.
[22] M. L. Mazurek, S. Komanduri, T. Vidas, L. F. Cranor, P. G.
Kelley, R. Shay, and B. Ur. Measuring password guessability
for an entire university. In Proc. CCS 2013, pages 173–186.
[23] E. McCallister, T. Grance, and K. Scarfone. NIST
SP800-122: Guide to protecting the conﬁdentiality of
personally identiﬁable information (PII). Technical report,
NIST, Reston, VA, April, 2010.
[24] W. Melicher, B. Ur, S. Segreti, S. Komanduri, L. Bauer,
N. Christin, and L. Cranor. Fast, lean and accurate: Modeling
password guessability using neural networks. In Proc.
USENIX SEC 2016, pages 1–17.
[25] A. Narayanan and V. Shmatikov. Fast dictionary attacks on
2005, pages 364–372.
[26] J. Onaolapo, E. Mariconti, and G. Stringhini. What happens
after you are pwnd: Understanding the use of leaked account
credentials in the wild. In Proc. SIGCOMM IMC 2016.
[27] Four Years Later, Anthem Breached Again: Hackers Stole
Credentials, Feb. 2015. http://t.cn/RqWrMKC.
[28] R. Shay, S. Komanduri, A. Durity, and et al. Designing
password policies for strength and usability. ACM Trans. Inf.
Syst. Secur., 18(4):1–34, 2016.
[29] Senate Bill No. 1386: Personal information, Sep. 2002.
http://bit.ly/1WJIIpK.
[30] B. Ur, S. M. Segreti, L. Bauer, and et al. Measuring
real-world accuracies and biases in modeling password
guessability. In USENIX SEC 2015, pages 463–481.
[31] R. Veras, C. Collins, and J. Thorpe. On the semantic patterns
of passwords and their security impact. In Proc. NDSS 2014.
[32] D. Wang, D. He, H. Cheng, and P. Wang. fuzzyPSM: A new
password strength meter using fuzzy probabilistic
context-free grammars. In Proc. IEEE/IFIP DSN 2016, pages
595–606. http://bit.ly/2ahJ8CO.
[33] D. Wang and P. Wang. The emperor’s new password creation
policies. In Proc. ESORICS 2015, pages 456–477.
[34] D. Wang and P. Wang. On the implications of Zipf’s law in
passwords. In Proc. ESORICS 2016, pages 111–131.
[35] M. Weir, S. Aggarwal, B. de Medeiros, and B. Glodek.
grammars. In Proc. IEEE S&P 2009, pages 391–405.
[36] This could be the iCloud ﬂaw that led to celebrity photos
being leaked, Sep. 2014. http://bit.ly/Y5vnNc.
[37] Y. Zhang, F. Monrose, and M. Reiter. The security of modern
password expiration:an algorithmic framework and empirical
analysis. In Proc. ACM CCS 2010, pages 176–186.
... Password guessing is when the attacker tries to gain access to a system by persistently attempting to guess user passwords [44,45]. The passwords attempted are normally derived from either leaked password associated with a particular user or dictionaries of common passwords, in which case the attack is also known as a brute-force attack. ...
... A basic password guessing attack is one where a single account on a single host is targeted with a brute-force attack. It is the most common type of password guessing attack [44,45]. Because it only focuses on one-to-one relations, the graph patterns should be (:ACCOUNT)-[:AUTH ATT]->(:HOST). ...
... The credential stuffing attack, also known as a targeted password guessing attack, consists of trying the same credentials (e.g. user name and password combination) on multiple hosts [45]. The credentials used in these attack are traditionally obtained from leaks of previous attacks or are default passwords in systems that have them. ...
... Despite this, it is still unknown the performance of MultiPSM [19] in more effective probability guessing scenarios (e.g., Markov [38]). Similarly, it is also unknown whether MultiPSM [19] can provide accurate strength evaluation results in online guessing (the primary threat that users need to mitigate [40,41]). ...
... On the contrary, in offline guessing scenarios, the attacker A has already obtained all the data that can be used to locally verify her guesses: It is usually a salt-hashed password file obtained through database leaks. Here, rate-limiting (at the server side) is inapplicable, and A is only constrained by her local computing resources and the time allowed [17,41,46]. Different application scenarios may suffer from different kinds of guessing attacks. ...
... Since the protection measures are quite diverse, the exact value of the online guessing threshold T depends on the target system's risk analysis results. Without loss of generality, we set T =10 4 according to our manual test (see Appendix A) and rule-of-thumb recommendations in [17,40,41,52]. Given T , for an attacker well-informed of the target password distribution, her sensible strategy is to try the most popular passwords (e.g., top-10 4 ) in the known distribution [4]. ...
Conference Paper
Full-text available
... In contrast, to guess a specific user's password, a targeted attacker takes advantage of the user's personal information to facilitate guessing. This is realistic because users tend to employ a variety of personal information (e.g., name, birthday, and old/sister passwords) when generating passwords [4,9,15,22,24]. ...
... These trawling PSMs do not include the user's personal information in password strength evaluation, and are thus unable to accurately measure password strength when facing real-world attacks. Besides, the targeted PPSM [15] relies on sister passwords from different sites in evaluating password strength, which is highly impractical due to two reasons: 1) The server generally does not hold the user's old (sister) passwords; 2) Sister passwords are not easily accessible [4,24]. For example, Das et al. [4] analyzed 7.96 million accounts from different sites and found that only 152 (0.00191%) were successfully matched by email more than once; Wang et al. [24] analyzed 547.56 million accounts and found that less than 1.02% and 1.73% were successfully matched by email and username more than once. ...
Chapter
Full-text available
In recent years, unending breaches of users’ personally identifiable information (PII) have become increasingly severe, making targeted password guessing using PII a practical threat. However, to our knowledge, most password strength meters (PSMs) only consider the traditional trawling password guessing threat, and no PSM has taken into account the more severe targeted guessing threat using PII (e.g., name, birthday, and phone number). To fill this gap, in this paper, we mainly focus on targeted password strength evaluation in the scenario where users’ PII is available to the attacker. First, to capture more fine-grained password structures, we introduce the high-frequency substring as a new grammar tag into leading targeted password probabilistic models TarGuess-I and TarMarkov, and propose TarGuess-I-H and TarMarkov-H. Then, we weight and combine our two improved models to devise PII-PSM, the first practical targeted PSM resistant to common PII-accessible attackers. By using the weighted Spearman (WSpearman) metric recommended at CCS’18, we evaluate the accuracy of our PII-PSM and its counterparts (i.e., our TarGuess-I-H and TarMarkov-H, as well as two benchmarks of Optimal and Min_auto). We conduct evaluation experiments on password datasets leaked from eight high-profile English and Chinese services. Results show that our PII-PSM is more accurate than TarGuess-I-H and TarMarkov-H, and is closer to Optimal and Min_auto, with WSpearman differences of only 0.014∼0.023 and 0.012∼0.031, respectively. This establishes the accuracy of PII-PSM, facilitating to nudge users to select stronger passwords.
... The main idea is that human-chosen passwords and personally identifiable information that a person typically shares at registration time (such as their name, address, and phone number) are naturally correlated with each other. Beyond just individual pairs of users and passwords [56], we show that this correlation actually extends to whole communities; we demonstrate that information about users in the same group can be combined and interpolated together to define a robust and accurate prior over their password distribution. ...
... However, the focus of this search can vary depending on the objective. It can be heavily conditioned on auxiliary information, as in the case of targeted online attacks against individual accounts [41], [56], [16], or based on a less informative prior, as in the more common trawling offline setting [53]. In this work, we focus on the latter case. ...
... Password hashes are always associated with auxiliary data such as email addresses, usernames, and related tags. As previously demonstrated in the targeted guessing attacks literature [56], auxiliary data and users' passwords are not independent. Users often include personal information in their passwords, and an attacker who knows a user's personal information can use it to mount a targeted attack and guess the user's password more efficiently [56], [41], [16]. ...
Preprint
Full-text available
... Despite some benefits offered by passwords in terms of usability and deployability, 'legacy' passwords receive a poor rating on security [2]. Targeted online password guessing is an underestimated threat [3], in view of the wealth of personal data ranging from usernames and passwords to social security numbers stored on those devices. In particular, shoulder-surfing attacks, if successful, can lead to illegal access to all kinds of sensitive data and information on a mobile device or system which can potentially lead to malicious activities. ...
... Passwords are one of the most common methods for mobile user authentication [7]. A password-based authentication matches a user-entered password against a pre-set secret password that typically consists of a string of letters, digits, graphics, and/or symbols [3,4]. Among others, textual passwords are the most common [10]. ...
Conference Paper
Full-text available
Password-based mobile user authentication is vulnerable to a variety of security threats. Shoulder-surfing is the key to those security threats. Despite a large body of research on password security with mobile devices, existing studies have focused on shaping the security behavior of mobile users by enhancing the strengths of user passwords or by establishing secure password composition policies. There is little understanding of how an attacker actually goes about observing the password of a target user. This study empirically examines attackers' behaviors in observing password-based mobile user authentication sessions across the three observation attempts. It collects data through a longitudinal user study and analyzes the data collected through a system log. The results reveal several behavioral patterns of attackers. The findings suggest that attackers are strategic in deploying attacks of shoulder-surfing. The findings have implications for enhancing users' password security and refining organizations' password composition policies.
... Measuring security of passwords is a well-studied topic [37]. Earlier works considered different approaches the attacker could employ and thus used Markov models, probabilistic CFG, neural networks, etc. for the guessability estimations [47,50,59,61,62]. However, there is relatively less work in the domain of passphrases. ...
Preprint
Full-text available
Passwords are the most common mechanism for authenticating users online. However, studies have shown that users find it difficult to create and manage secure passwords. To that end, passphrases are often recommended as a usable alternative to passwords, which would potentially be easy to remember and hard to guess. However, as we show, user-chosen passphrases fall short of being secure, while state-of-the-art machine-generated passphrases are difficult to remember. In this work, we aim to tackle the drawbacks of the systems that generate passphrases for practical use. In particular, we address the problem of generating secure and memorable passphrases and compare them against user chosen passphrases in use. We identify and characterize 72, 999 user-chosen in-use unique English passphrases from prior leaked password databases. Then we leverage this understanding to create a novel framework for measuring memorability and guessability of passphrases. Utilizing our framework, we design MASCARA, which follows a constrained Markov generation process to create passphrases that optimize for both memorability and guessability. Our evaluation of passphrases shows that MASCARA-generated passphrases are harder to guess than in-use user-generated passphrases, while being easier to remember compared to state-of-the-art machine-generated passphrases. We conduct a two-part user study with crowdsourcing platform Prolific to demonstrate that users have highest memory-recall (and lowest error rate) while using MASCARA passphrases. Moreover, for passphrases of length desired by the users, the recall rate is 60-100% higher for MASCARA-generated passphrases compared to current system-generated ones.
... While traditional attacks like brute-force attempts are not impossible, they often come with significant drawbacks, like a high time cost. As a result, attackers are seeking more practical methods to obtain user passwords, such as distributing malware (e.g., keyloggers) onto mobile devices, or guessing passwords based on personal information [34]. On Android devices, malware often uses the accessibility service to steal user passwords. ...
... These stolen credentials often end up on the dark web and become trading assets to cybercriminals [514]. Attackers can then try the stolen credentials on different online accounts at scale through automated login requests (i.e., credential stuffing), meaning that a breach of one service provider's password database could put other accounts at risk when they use the same or similar passwords [557]. Once accounts have been compromised, attackers may use them for spam, fraud, identity theft, or distributing malware [380], leaving users-especially those who reuse their passwords-vulnerable to these security incidents. ...
Thesis
Conference Paper
Full-text available
While much has changed in Internet security over the past decades, textual passwords remain as the dominant method to secure user web accounts and they are proliferating in nearly every new web services. Nearly every web services, no matter new or aged, now enforce some form of password creation policy. In this work, we conduct an extensive empirical study of 50 password creation policies that are currently imposed on high-profile web services, including 20 policies mainly from US and 30 ones from mainland China. We observe that no two sites enforce the same password creation policy, there is little rationale under their choices of policies when changing policies, and Chinese sites generally enforce more lenient policies than their English counterparts. We proceed to investigate the effectiveness of these 50 policies in resisting against the primary threat to password accounts (i.e. online guessing) by testing each policy against two types of weak passwords which represent two types of online guessing. Our results show that among the total 800 test instances, 541 ones are accepted: 218 ones come from trawling online guessing attempts and 323 ones come from targeted online guessing attempts. This implies that, currently, the policies enforced in leading sites largely fail to serve their purposes, especially vulnerable to targeted online guessing attacks.
Conference Paper
Full-text available
Cybercriminals steal access credentials to online accounts and then misuse them for their own profit, release them publicly, or sell them on the underground market. Despite the importance of this problem, the research community still lacks a comprehensive understanding of what these stolen accounts are used for. In this paper, we aim to shed light on the modus operandi of miscreants accessing stolen Gmail accounts. We developed an infrastructure that is able to monitor the activity performed by users on Gmail accounts, and leaked credentials to 100 accounts under our control through various means, such as having information-stealing malware capture them, leaking them on public paste sites, and posting them on underground forums. We then monitored the activity recorded on these accounts over a period of 7 months. Our observations allowed us to devise a taxonomy of malicious activity performed on stolen Gmail accounts, to identify differences in the behavior of cybercriminals that get access to stolen accounts through different means, and to identify systematic attempts to evade the protection systems in place at Gmail and blend in with the legitimate user activity. This paper gives the research community a better understanding of a so far understudied, yet critical aspect of the cybercrime economy.
Conference Paper
Full-text available
Textual passwords are perhaps the most prevalent mechanism for access control over the Internet. Despite the fact that human-beings generally select passwords in a highly skewed way, it has long been assumed in the password research literature that users choose passwords randomly and uniformly. This is partly because it is easy to derive concrete (numerical) security results under the uniform assumption, and partly because we do not know what's the exact distribution of passwords if we do not make a uniform assumption. Fortunately, researchers recently reveal that user-chosen passwords generally follow the Zipf's law, a distribution which is vastly different from the uniform one. In this work, we explore a number of foundational security implications of the Zipf-distribution assumption about passwords. Firstly, we reveal that the attacker's advantages against password-based cryptographic protocols (e.g., authentication, encryption, signature and secret share) can be 2~4 orders of magnitude more accurately captured (formulated) than existing formulation results. This result would impact numerous existing and future password protocols. As password protocols are the most widely used cryptographic protocols, our new formulation is of practical significance. Secondly, we provide new insights into popularity-based password creation policies and point out that, under the current, widely recommended security parameters, usability will be largely impaired. Thirdly, we show that the well-known password strength metric $\alpha$-guesswork, which was believed to be parametric, is actually non-parametric in two of four cases under the Zipf assumption. Particularly, nine large-scale, real-world password datasets are employed to establish the practicality of our findings.
Article
Full-text available
Cybercriminals steal access credentials to webmail ac- counts and then misuse them for their own profit, re- lease them publicly, or sell them on the underground market. Despite the importance of this problem, the research community still lacks a comprehensive under- standing of what these stolen accounts are used for. In this paper, we aim to shed light on the modus operandi of miscreants accessing stolen Gmail accounts. We de- veloped an infrastructure that is able to monitor the ac- tivity performed by users on Gmail accounts, and leaked credentials to 100 accounts under our control through various means, such as having information-stealing mal- ware capture them, leaking them on public paste sites, and posting them on underground forums. We then monitored the activity recorded on these accounts over a period of 7 months. Our observations allowed us to devise a taxonomy of malicious activity performed on stolen Gmail accounts, to identify differences in the be- havior of cybercriminals that get access to stolen ac- counts through different means, and to identify system- atic attempts to evade the protection systems in place at Gmail and blend in with the legitimate user activity. This paper gives the research community a better un- derstanding of a so far understudied, yet critical aspect of the cybercrime economy.
Article