Conference PaperPDF Available

User-Centric Phishing Threat Detection



This paper presents a context-aware phishing threat detection model from users’ behavioral perspectives. The context of users’ information accesses is investigated to explore the users’ browsing behaviors that confront phishing situations. Large-scale experiments show that our approach achieves an accuracy of 0.9973 and an F1 score of 0.9311 for predicting the phishing threats of users’ next accesses without intelligent content analysis. Error analysis indicates that our proposed model results in a favorably low false positive rate of 0.0006. In practice, our proposed model is complementary to the existing anti-phishing techniques for cost-effectively blocking phishing threats with wisdom of the crowds.
Poster: User-Centric Phishing Threat Detection
Lung-Hao Lee (Student), Kuei-Ching Lee (Student),
Yu-Yun Liu (Student), Hsin-Hsi Chen (Faculty)
Dept. of Computer Science and Information Engineering
National Taiwan University
Taipei, Taiwan
{d01922005, p00922002, r99922102, hhchen}
Yuen-Hsien Tseng (Faculty)
Information Technology Center
National Taiwan Normal University
Taipei, Taiwan
AbstractThis paper presents a context-aware phishing
threat detection model from users behavioral perspectives. The
context of users’ information accesses is investigated to explore
the users’ browsing behaviors that confront phishing situations.
Large-scale experiments show that our approach achieves an
accuracy of 0.9973 and an F1 score of 0.9311 for predicting the
phishing threats of users’ next accesses without intelligent content
analysis. Error analysis indicates that our proposed model results
in a favorably low false positive rate of 0.0006. In practice, our
proposed model is complementary to the existing anti-phishing
techniques for cost-effectively blocking phishing threats with
wisdom of the crowds.
Keywords—collaborative filtering; browsing behaviors;
collective intelligence; context-aware category prediction
Phishing crimes are significant security threats involving
fraudulent web pages that masquerade as trustworthy ones for
tricking users into revealing private and sensitive information,
e.g., bank account numbers, passwords, personal identification
numbers, and credit card numbers. In the past, a content-based
method has adopted Robust Hyperlinks for anti-phishing [1].
Content-based lexcial features has been extracted to detect
phishing URLs using online learning [2]. A hybrid approach
has been proposed to detect phishing web pages by identity
discovery and keywords retrieval [3]. Several heuristics has
been introduced to expand existing backlists that defend against
phishing threats [4].
The previous approaches, which formulate the
discriminative patterns from phishing web pages themselves,
suffer from those security threats resulting from unknown
phishing patterns. In contrast, we study the phishing context
which users will fall into from userspoints of view. We aim at
exploiting collective intelligence embedded in users’ browsing
behaviors to detect phishing threats without the help of
crawling web pages for intelligent content analysis.
Criminals usually create phishing web pages by exactly
copying the legitimate ones or slightly modifying their page
content for redirecting users’ valuable information to the
criminals rather than the legitimate sites. Users’ browsing
behaviors on the web result in users’ click-through trails, which
are defined as access sequences during web surfing. The
browsing context of users’ information accesses is explored to
understand how users fall into phishing states. The problem
statement in this study is described as follows. Let u1u2u(n-
1)un be a user’s access sequence, where ui is the ith clicked URL
in the click-through trail. We focus on determining the category
of a user’s next access un, i.e., phishing or legitimate, based on
the previous accesses u1u2...u(n-1) and their contextual
information from users’ behavioral perspectives.
Behavioral features of each clicked URL in a user’s access
sequence, which are extracted to capture contextual
information, are classified into the following 3 types. (1)
Hostname: phishing URLs tend to look like the original
legitimate ones, so users are usually not conscious of them
easily. For example, the hostname “” was
verified as a phishing website of well-known social networking
service Facebook. We identify the hostnames of clicked URLs
as hints for phishing threat detection. (2) IP Address: phishing
criminals usually create and maintain a large number of hosts
or redirections to pretend legitimate URLs. These suspected
URLs may be hosted in the same suspicious IP address. We
also look up the referring IPs of clicked URLs as features. (3)
Port Number: Secure Socket Layer (SSL) is a cryptographic
protocol that provides communication security on the web. The
port number is usually defined as 443 for accomplishing this
secure connection. In addition, some content providers use
specific ports to achieve their specific purposes. We also
identify the port numbers of clicked URLs for anti-phishing.
We employ the Maximum Entropy Markov Model
(MEMM) by learning users’ browsing behaviors for predicting
the category of a user’s next access. A user’s access is regarded
as a state in our behavioral MEMM. Given an observation and
its previous states, which are in terms of the above features, the
probability of reaching a state is trained via maximum entropy.
In testing phase, the proposed MEMM reports the category
with the largest probability as the predicted result.
The data sets came from click-through data in the Trend
Micro research laboratory. They consist of web browsing
behaviors from 76,943 anonymous worldwide users. After
manually checking the candidate categories proposed by
analyzing content signature of each clicked URL, the category
of a user’s access is determined to provide secured web surfing.
User click-through trails were divided into two distinct data
sets shown as follows. (1) Training set: 99,249 clicked URLs
from November 1st to December 31st 2010 were rated as
phishing accesses. A phishing trail is denoted as u1u2...u(n-1)un
where the previous accesses u1u2...u(n-1) are legitimate and the
target URL un is phishing. For balanced learning consideration,
we selected the same number of legitimate trails u1u2...u(m-1)um
in which all the accesses are legitimate and the hostname of um
has edit distance less than 4 with at least one phishing target,
because the phishing URLs are usually similar to their
legitimate URLs that want to pretend. A total of 198,498 users’
access trails were used for training. (2) Test set: 134,432
phishing trails from January 1st to March 15th 2011 were used
for testing. All of legitimate access trails from the same time
period were used to reflect real-life users’ browsing behaviors.
In total, there are 6,496,860 legitimate trails.
The following two anti-phishing approaches based on click-
through data were compared to demonstrate their performance.
(1) Maximum Entropy: this model is a context-less method,
which only focuses on the features extracted from the target
URLs themselves. (2) Behavioral MEMM: this model is the
proposed approach for context-aware phishing threat detection.
Besides the target URL un, the previous accesses u1u2...u(n-1) is
also considered.
Table 1 shows the experimental results. The performance
difference between the two models was statistically significant
(p<0.01), no matter which metric was adopted. The Maximum
Entropy model slightly performed better than the Behavioral
MEMM model when recall was concerned. This implies that
considering the features selected from the target URL itself
only has the effect on detecting the phishing accesses. The
Behavioral MEMM model has better precision than the
Maximum Entropy model. This reveals that exploring the
collective intelligence embedded in browsing behaviors is
effective on predicting the categories of users’ next accesses.
The proposed model accomplished the best accuracy of 0.9973
and F1 score of 0.9311. These results show that contextual
information extracted from users’ behavioral perspectives has
strong impact on detecting phishing threats effectively.
Evaluation Metrics
Maximum Entropy
Behavioral MEMM
Table 2 shows the confusion matrix of using the proposed
Behavioral MEMM model for phishing threat detection.
Experimental result indicated that our model maintained a
favorably low false positive rate of 0.0006 (i.e., 3967/(6492893
+3972)), which is the proportion of legitimate accesses that are
incorrectly predicted as phishing. We found that most of false
positive cases are related to some specific web sites, e.g., the
error cases containing the hostname “”. This
can be solved with except-lists, which contain legitimate
hostnames to avoid being incorrectly predicted. Phishing URLs
which were not correctly detected result in false negative cases.
We found that some of these cases only exist in our test set.
This implies that collecting users’ access sequences as many as
possible is needed for reflecting real-life users’ browsing
behaviors during web surfing.
We also analyze the data sets to understand the major
categories of previous accesses that will result in phishing
threats. Empirical findings indicate many users visiting web
pages rating as the “Economy,” “Shopping,” or “Auction”
category, which are all involved in personally financial
payments or investments. It confirms the guideline that users
should be more careful to visit such kinds of web pages to
have more secured web surfing.
Behavioral MEMM
Predicted Results
This paper proposes a user-centric model that exploits
users’ browsing behaviors only for context-aware phishing
threat detection. Experimental results show that our behavioral
MEMM model, which explores browsing contexts of users’
previous accesses, yield favorable results in the large-scale
experiments. In practice, our cost-effective approach is a
lightweight process compared to the existing content-based
analysis for blocking phishing threats.
This work is our first exploration to adopt URL information
alone for anti-phishing. More discriminative features from
users’ behavioral perspectives will be investigated in the future
to further improve real-time filtering performance.
This research was partially supported by National Science
Council, Taiwan under grant NSC99-2221-E-002-167-MY3,
and the “Aim for the Top University Project” of National
Taiwan Normal University, sponsored by the Ministry of
Education, Taiwan. We are also grateful to Trend Micro
research laboratory for the support of click-through data.
[1] Y. Zhang, J. Hong, and L. Cranor, “CANTINA: A content-based
approach to detecting phishing web sites,” In Proceedings of the 16th
International World Wide Web Conference, pp. 639-648, 2007.
[2] A. Blum, B. Wardman, T. Solorio, and G. Warner, “Lexical feature
based phishing URL detection using online learning,” In Proceedings of
the 3rd CCS Workshop on Artifical Intelligence and Security, pp.54-60,
[3] G. Xiang, and J. Hong, A hybrid phish detection approach by identity
discovery and keyword retrieval,” In Proceedings of the 18th
International World Wide Web Conference, pp. 571-580, 2009.
[4] P. Prakash, M. Kumar, R. R. Kompella, and M. Gupta, “PhishNet:
predictive blacklisting to detect phishing attacks,” In Proceedings of the
29th IEEE Conference on Computer Communications, 2010.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Phishing is a form of cybercrime where spammed emails and fraudulent websites entice victims to provide sensitive information to the phishers. The acquired sensitive information is subsequently used to steal identities or gain access to money. This paper explores the possibility of utilizing confidence weighted classification combined with content based phishing URL detection to produce a dynamic and extensible system for detection of present and emerging types of phishing domains. Our system is capable of detecting emerging threats as they appear and subsequently can provide increased protection against zero hour threats unlike traditional blacklisting techniques which function reactively.
Conference Paper
Phishing has been easy and effective way for trickery and deception on the Internet. While solutions such as URL blacklisting have been effective to some degree, their reliance on exact match with the blacklisted entries makes it easy for attackers to evade. We start with the observation that attackers often employ simple modifications (e.g., changing top level domain) to URLs. Our system, PhishNet, exploits this observation using two components. In the first component, we propose five heuristics to enumerate simple combinations of known phishing sites to discover new phishing URLs. The second component consists of an approximate matching algorithm that dissects a URL into multiple components that are matched individually against entries in the blacklist. In our evaluation with real-time blacklist feeds, we discovered around 18,000 new phishing URLs from a set of 6,000 new blacklist entries. We also show that our approximate matching algorithm leads to very few false positives (3%) and negatives (5%).
Conference Paper
Phishing is a significant problem involving fraudul ent email and web sites that trick unsuspecting users into reveal ing private information. In this paper, we present the design, implementation, and evaluation of CANTINA, a novel, content-based approach to detecting phishing web sites, based on the TF-IDF i nformation retrieval algorithm. We also discuss the design and evaluation of several heuristics we developed to reduce false pos itives. Our experiments show that CANTINA is good at detecting phishing sites, correctly labeling approximately 95% of phis hing sites.
Conference Paper
Phishing is a significant security threat to the Internet, which causes tremendous economic loss every year. In this pa- per, we proposed a novel hybrid phish detection method based on information extraction (IE) and information re- trieval (IR) techniques. The identity-based component of our method detects phishing webpages by directly discover- ing the inconsistency between their identity and the identity they are imitating. The keywords-retrieval component uti- lizes IR algorithms exploiting the power of search engines to identify phish. Our method requires no training data, no prior knowledge of phishing signatures and specific imple- mentations, and thus is able to adapt quickly to constantly appearing new phishing patterns. Comprehensive experi- ments over a diverse spectrum of data sources with 11449 pages show that both components have a low false positive rate and the stacked approach achieves a true positive rate of 90.06% with a false positive rate of 1.95%.