PhishDef: URL names say it all
ABSTRACT Phishing is an increasingly sophisticated method to steal personal user information using sites that pretend to be legitimate. In this paper, we take the following steps to identify phishing URLs. First, we carefully select lexical features of the URLs that are resistant to obfuscation techniques used by attackers. Second, we evaluate the classification accuracy when using only lexical features, both automatically and hand-selected, vs. when using additional features. We show that lexical features are sufficient for all practical purposes. Third, we thoroughly compare several classification algorithms, and we propose to use an online method (AROW) that is able to overcome noisy training data. Based on the insights gained from our analysis, we propose PhishDef, a phishing detection system that uses only URL names and combines the above three elements. PhishDef is a highly accurate method (when compared to state-of-the-art approaches over real datasets), lightweight (thus appropriate for online and client-side deployment), proactive (based on online classification rather than blacklists), and resilient to training data inaccuracies (thus enabling the use of large noisy training data).
-
Citations (0)
-
Cited In (0)
Page 1
PhishDef: URL Names Say It All
Anh Le, Athina Markopoulou
University of California, Irvine
{anh.le, athina}@uci.edu
Michalis Faloutsos
University of California, Riverside
michalis@cs.ucr.edu
Abstract—Phishing is an increasingly sophisticated method to
steal personal user information using sites that pretend to be
legitimate. In this paper, we take the following steps to identify
phishing URLs. First, we carefully select lexical features of
the URLs that are resistant to obfuscation techniques used by
attackers. Second, we evaluate the classification accuracy when
using only lexical features, both automatically and hand-selected,
vs. when using additional features. We show that lexical features
are sufficient for all practical purposes. Third, we thoroughly
compare several classification algorithms, and we propose to use
an online method (AROW) that is able to overcome noisy training
data. Based on the insights gained from our analysis, we propose
PhishDef, a phishing detection system that uses only URL
names and combines the above three elements. PhishDef is
a highly accurate method (when compared to state-of-the-art
approaches over real datasets), lightweight (thus appropriate for
online and client-side deployment), proactive (based on online
classification rather than blacklists), and resilient to training data
inaccuracies (thus enabling the use of large noisy training data).
I. INTRODUCTION
Phishing is continuously evolving and becoming an increas-
ingly sophisticated criminal tool to steal sensitive information
and commit crimes on the Internet. According to the latest re-
port from the Anti-Phishing Working Group [11], the number
of commercial brands being attacked by phishing just hit a
new record: 356 brands in October 2009. With major industry
targets, such as, financial and payment services, phishing has
caused billions of dollars loss annually [21]. Because of the
severity of the problem, the Internet community has put a
significant amount of effort into defense mechanisms.
Currently, two of the most popular services that protect
the Internet users from visiting phishing sites are the Google
Safe Browsing service [1] and the Microsoft Smart Screen
service [4]. Both services provide client browsers with URL
blacklists. The browsers, in turn, protect users from visiting the
blacklisted URLs. The major problem of this protection model
is that it is reactive: a phishing URL can only be included in
the blacklist if it has already appeared somewhere else, e.g.,
in a spam email, or has been reported by a user. A proactive
model, where brand new phishing URLs could be identified
accurately, is highly desirable to better protect the users.
We argue that in order to provide a proactive protection, the
machine learning classification engine, which is typically used
to maintain the blacklists at the server side, must be pushed to
the client browser. This would allow new URLs to be classified
on-the-fly, at the time the users click on or type in the URLs.
One of the biggest challenges of classifying URLs on-the-
fly, as opposed to off-line at the server side, is the latency
constraint. The longer it takes to obtain the classification result
of a URL, the longer a user has to wait to load that URL, and
the worse the user experience. Furthermore, since page loading
time is a decisive factor when benchmarking web browsers,
classifying URLs should not introduce high latency.
There are two types features that can be used in URL
classification: lexical features, i.e., features which are readily
available from the URL names; and external features, i.e.,
features acquired from queries to remote servers. We refer to
lexical and external features together as full features. Lexical
features are based only on the URL names and are appropriate
for implementation at the client. External features rely on
the availability of remote servers, introduce additional latency
due to the required queries, and consume more resources
of the client, e.g., battery life and bandwidth of mobile
phones. Nonetheless, one would expect that relying on a more
comprehensive set of features, rather than lexical features only,
would lead to higher classification accuracy. In this paper, we
seek to answer the following question:
How well can one detect phishing URLs using only
lexical features compared to using full features?
To the best of our knowledge, this work is the first to
extensively study this question. We show that lexical features
are sufficient (i.e., if properly used, they can achieve accuracy
comparable to full features), and we propose a system called
PhishDef that achieves this goal.
In particular, we first introduce a way to extract lexical
features that are resistant to obfuscation. We then thoroughly
evaluate the classification accuracy achieved when using lexi-
cal features vs. full features with several state-of-the-art learn-
ing algorithms on real datasets. More specifically, we consider
the following algorithms: batch-based Support Vector Machine
(SVM), Online Perceptron (OP), Confidence-Weighted (CW),
and Adaptive Regularization of Weights (AROW). We find
that, using lexical features results in a modest decrease (about
1%) in classification accuracy compared to using full features;
however, the overall accuracy is still high (96–98%). This
suggests that using lexical features is sufficient and provides
a better latency-accuracy trade-off. Moreover, our proposed
obfuscation-resistant lexical features help to boost the overall
classification accuracy across all the datasets. In particular,
the reduction of error rate is up to 27%. We also observe
that state-of-the-art online linear classification algorithms,
namely, AROW and CW, are more accurate while imposing
less memory and computing overhead compared to other
techniques. Moreover, when there is noise in the training
data (noisy labels), AROW outperforms CW. Robustness in
a noisy environment is very important because (i) it allows
for training more comprehensive classification models by
working with larger datasets, which typically include noise,
such as, blacklisted URLs from Google used in [25]; and (ii)
it improves the system’s resilience to poisoning attacks, where
arXiv:1009.2275v1 [cs.CR] 12 Sep 2010
Page 2
2
attackers attempt to maliciously influence the classification
models by injecting mis-labeled data.
Based on the insights gained from our analysis, we propose
PhishDef, a classification engine that operates at the client
side, uses only lexical features, and implements the AROW
algorithm. PhishDef has the following desired properties:
• High accuracy: It has 96–97% classification accuracy,
only 1% less than full features.
• Light-weight: It has low latency and imposes a modest
amount of memory and computation overhead.
• Proactive approach: It can classify new URLs on-the-fly,
i.e., at the time the user clicks on or enters the URL at the
client side, as opposed to reactively relying on blacklists.
• Resilience to noise: It maintains high accuracy even when
trained with mislabeled data: 95%–86% accuracy when
there is 5%–45% noise.
The rest of this paper is organized as follows. Section II
discusses related work. Section III describes the datasets we
use and the feature extraction process. Section IV describes
the classification algorithms we compare. Section V presents
the evaluation results, i.e., the comparison of all algorithms
over all datasets and feature sets. Section VI discusses and
explains the classification performance. Section VII presents
PhishDef, our proposed solution based on the insights from
the analysis. Section VIII concludes the paper.
II. BACKGROUND
PhishTank [6] defines phishing as “a fraudulent attempt to
get you to provide personal information, including but not
limited to, account information.” This definition is somewhat
restricted. In this work, we adopt a broader definition of
phishing from Whittaker et al. [25], which defines a phishing
page as “any web page that, without permission, alleges to
act on behalf of a third party with the intention of confusing
viewers into performing an action with which the viewer
would only trust a true agent of the third party.”1
Garera et al. [17] studied the structure of phishing URLs.
They find four distinct categories of obfuscation techniques
that phishing URLs use. Based on these categories, they
propose eighteen manually selected features that can help to
produce high classification accuracy. Their selected features
include both lexical features and external features, such as
Google PageRank and Google page quality of the page. Part of
our work builds on these identified categories. We also propose
features that address the four common obfuscation techniques,
which, however, are directly extractable from the URL strings.
Whittaker et al. [25] describes the design of the Google’s
phishing classifier used to automatically maintain Google’s
phishing blacklist. This classifier uses a wide variety of fea-
tures: from lexical features, such as whether the URL contains
an IP address, to URL metadata, such as Google PageRank, as
well as features extracted from the page content and hosting
information. While this work describes the classifier used to
maintain blacklists at the server side, our work focuses on the
design of an on-the-fly classifier at the client side.
1This definition covers the typical case of phishing pages – pages that mimic
financial companies’ sites and request login credentials from the viewers –
and also phishing pages that display trusted companies’ logos to trick the
viewers to download and execute malicious binary.
In [19], Ma et al. examine the performance of several batch-
based learning algorithms on classifying malicious URLs,
which include phishing URLs and URLs present in spam
emails. The algorithms are evaluated when working with
various feature sets, for instance, host-based features, such as,
features from WHOIS queries, and lexical features. This work
shows that the combination of host-based and lexical features
results in the highest classification accuracy. This work also
hinted that using lexical features may lead to high accuracy;
however, it did not investigate this direction in sufficient depth.
Our work builds on this initial observation. We extensively
evaluate how both batch-based and online algorithms perform
when using only lexical features compared to full features.
In a follow-up work, Ma et al. [20] compare the perfor-
mance of batch-based algorithms to online algorithms when
using full features. The authors find that online algorithms, es-
pecially Confidence-Weighted (CW), outperform batch-based
algorithms. Our main difference from [20] is that we fo-
cus on lexical features instead of full features. We propose
obfuscation-resistant lexical features, show that online algo-
rithms outperform batch-based algorithms when working with
lexical features, and provide detailed analysis of the datasets
to explain why this is the case. In addition, we introduce the
use of AROW, which performs as well as CW but outperforms
CW when there is noise. To the best of our knowledge, AROW
has not been used before in the phishing context.
Other related work include PhishNet [23], which proposes
heuristics to predict phishing URLs; the comparative analysis
of phishing and non-phishing URLs drawn from PhishTank
[6] and DMOZ [5] by McGrath and Gupta [22]; CANTINA
[26], which uses a weighted sum of 8 features (4 content-
related, 3 lexical, and 1 WHOIS) to classify phishing URLs;
the classification of phishing emails by Fette et al. [16] and
Bergholz et al. [12]; and the comparison of various tools for
detecting fake websites by Abbasi and Chen [10].
Besides Google Safe Browsing [1] and Microsoft Smart
Screen [4], mentioned in the introduction, there are other
commercial products which aim at protecting users from
phishing sites, such as, McAfee SiteAdvisor [3] and WOT
Web of Trust [18]. The former incorporates proprietary feature
analysis, and the latter relies on community feedbacks. These
approaches are based on blacklists, thus reactive.
III. DATASETS AND FEATURE EXTRACTION
A. Malicious and Legitimate URLs
PhishTank. PhishTank [6] is a community site where anyone
can submit, verify, and share phishing URLs. A suspected
phishing URL will be manually checked by at least 2 other
members of the site. Once verified as a phishing URL, the
URL will be included in a downloadable database. We collect
our set of phishing URLs during the one month period of
June 2010. The set consists of 4,082 verified phishing URLs
ordered by their submission time.
MalwarePatrol. MalwarePatrol [2] is a free and user con-
tributed system where anyone can submit suspicious URLs that
may carry malware, viruses, or trojans. If a submitted URL
is verified as malicious by MalwarePatrol, the URL will be
put into a downloadable blacklist. We collect 2,001 malicious
Page 3
3
Fig. 1.External Feature Collection Process and Datasets
URLs during the last two weeks of June 2010. We order these
URLs by their appearance time. We note that the URLs here
have different characteristics from the URLs from PhishTank
because they are crafted to spread malware while the URLs
from PhishTank are crafted to steal sensitive information.
Yahoo Directory. Our first set of benign URLs is collected
from the Yahoo directory. Yahoo provides a generator URL
[9], which randomly generates a URL in its directory whenever
someone visits it. We used this generator URL in mid June
2010 to collect 4143 random URLs.
Open Directory. We collect our second set of benign URLs
from DMOZ [5], which is one of the largest open directory
of the Web maintained by volunteer editors. We collect 4012
random URLs from DMOZ directory in mid June 2010.
For the benign URLs, we order them by the order in
which we obtain them. We also note that our methodology of
collecting URL datasets is similar to recent work [19], [23].
B. External Feature Collection
We refer to features that require queries to remote servers as
external features. For each URL, we acquire external features
by querying two different remote servers:
WHOIS. We query the WHOIS server responsible for the top
level domain of the URL for its registration information, which
includes the primary domain name, the registrar, the registrant,
and the registration date. We implement our query engine by
adopting the pywhois module [8]. Intuitively, the features
that come from the WHOIS answers could play an important
role in classification. For example, a newly registered site is
more likely to be a phishing site as opposed to an old site.
Team Cymru. We also query Team Cymru server [24] to
obtain the network information and the geolocation of each
URL. In particular, we obtain the network BGP prefix, the
AS number, and the country code. These information are
complementary to the former WHOIS information and could
potentially help with the classification as well. For instance,
multiple phishing URLs are often hosted on the same (badly
administered) subnet; as such, the network BGP prefix will
give us the desired feature to correlate these sites.
Fig. 1 illustrates our external feature collection process. We
note that collecting these external features incur significant
latency. On average, the time it takes to collect all external
features of an URL in the PhishTank dataset is 1.64 second.
The latency depends on a variety of elements, such as, the
load of the WHOIS and Team Cymru servers, as well as the
geolocations of the WHOIS servers.
TABLE I
COMMONLY USED URL OBFUSCATION TECHNIQUES FROM [17]
TypeDescriptive Examples
I
http://0xd3.0xe9.0x27.0x91:3030/.www.paypal.com/uk/login.html
II
http://2-mad.com/hsbc.co.uk/index.html
III
http://sparkasse.de.redirector.webservices.aktuell.lasord.info
IV
http://mujweb.cz/Cestovani/iom3/SignIn.html?r=7785
http://210.80.154.30/˜test3/.signin.ebay.com/ebayisapidllsignin.html
http://21photo.cn/https://cgi3.ca.ebay.com/eBayISAPI.dllSignIn.php
http://www.volksbank.de.custsupportref1007.dllconf.info/r1/vm
http://www.wamuweb.com/IdentityManagement/
TABLE II
LEXICAL FEATURES OF A PHISHING URL
www.naturenilai.com/form2/paypal/webscr.php?cmd= login
name=www, name=naturenilai, tld=com, dir=form2, dir=paypal
file=webscr, ext=php, arg=cmd, arg=login
URLlen=54, n dot=3, blacklist=1
Domain Namelen=19,IP=0,
n hyphen=0, max len=11
Directorylen=14,n subdir=2,
max dot=0, max delim=0
File Namelen=10, n dot=1, n delim=0
Argument len=11,
max delim=1
URL
Auto-
Selected
Obfuscation-
Resistant
port=0,n token=3,
max len=6,
n var=1,max len=6,
C. Feature Extraction
We now describe our process of extracting lexical and
external features and how we prepare them for classification.
1) Lexical Features: Recall that lexical features can be
directly extracted from the URL string. We adopt the approach
by Ma et al. [19], [20] to automatically select binary lexical
features. In addition, motivated by the work by Garera et al.
[17], we propose a number of obfuscation-resistant lexical
features. We show through empirical results that these features
complement the former set of features and help to capture
additional obfuscated phishing URLs.
Automatically Selected Features. The URL string is broken
down into multiple tokens. Each token constitutes a binary
feature. The delimiters to obtain the tokens are ‘/’, ‘?’, ‘.’,
‘=’, ‘ ’, ‘&’, and ‘-’. Similar to [19], [20], we distinguish
tokens that appear in the domain name, the top level domain,
the directory, and the file extension. Different from [19], [20],
we also distinguish tokens that appear in the argument part of
the URL. In other words, the same token appearing in different
parts of the URL will constitute different binary features. This
representation of the URL is known as “bag-of-word.”
Hand-Selected (Obfuscation-Resistant) Features. In [17],
Garera et al. describe four different URL obfuscation tech-
niques that are commonly used by the attackers: (I) Obfus-
cating the host with an IP address, (II) Obfuscating the host
with another domain, (III) Obfuscating with large host names,
and (IV) Domain unknown or misspelled. Table I illustrates
these techniques. Here we propose the following hand-selected
lexical features to detect the identified obfuscation techniques;
our proposed features are classified into five categories:
(i) Features related to the full URL. These features include
the length of the URL, the number of dots in the URL,
and whether a blacklisted word appears in the URL. The
blacklist we use is similar to the one in [17], which in-
cludes the words: confirm, account, banking, secure,
ebayisapi, webscr, login, and signin; and we add
the words paypal, free, lucky, and bonus. The first
two features address Type II obfuscation while the blacklisted
words enhance the detection of Type IV obfuscation.
Page 4
4
TABLE III
SUMMARY OF DATASETS
Yahoo-Malware
2,001
4,143
8,791
16,665
Pairs
# Malicious URLs
# Legitimate URLs
# Lexical Features
# External Features
Yahoo-Phish
4,082
4,143
13,821
18,786
DMOZ-Phish
4,082
4,012
14,165
9,751
DMOZ-Malware
2,001
4,012
9,129
7,548
All Good - All Bad
6,083
8,155
22,100
24,843
(ii) Features related to the domain name. These features
include the length of the domain name, whether an IP address
or a port number is used in the domain name, the number of
tokens of the domain name, the number of hyphens used in
the domain name, and the length of the longest token. These
features address obfuscation Type I, Type III, and a technique
related to Type III, where hyphens are used instead of dots.
(iii) Features related to the directory. These features include
the length of the directory, the number of sub-directory to-
kens, the length of the longest sub-directory token, and the
maximum number of dots and other delimiters used in a
sub-directory token. These features mainly address Type II
obfuscation, where the obfuscated host name is put in the
directory. We also observe cases where instead of using ‘.’,
the attacker use another character, such as, the underscore,
‘ ’, or dash, ‘-’, as the delimiter of the obfuscated host name.
The features in this category address these instances as well.
(iv) Features related to the file name (page name). These
features include the length of the file name, and the number of
dots and other delimiters (‘ ’ and ‘-’) used in the file name.
These features also address Type II obfuscation, but in this
case, the obfuscated host name is put in the file name.
(v) Features related to the argument part. URLs that serve
pages written in server side scripting languages, such as,
php and asp, often have arguments. The features in this
category include the length of the argument part, the number
of variables, the length of the longest variable value, and the
maximum number of delimiters (‘.’, ‘ ’, and ‘-’) used in a
value. We observe that phishing URLs often include a long
list of arguments, as well as auto-generated argument values,
which are often unusually long. Also, there are instances where
the host name is obfuscated in the values assigned to variables.
The features here are designed to address these instances.
Table II illustrates how we obtain all lexical features.
2) External Features: We extract a number of binary fea-
tures and one real value feature from the responses we receive
from the WHOIS and Team Cymru servers. The registration
date gives the real value feature indicating the number of days
the site has been up. The other pieces of information that we
described in Section III-B give the binary features.
Finally, for all the real value features, we shift and scale
them so that their values lie between 0 and 1. The reason is
that we do not want to give a prior preference to any particular
feature. We want the weights of the features to be adjusted by
the learning algorithms themselves.
D. Summary of Datasets
We prepare the data for the classification algorithms by
combining the legitimate with the malicious URL datasets.
In total, we have 5 pairs: Yahoo and PhishTank (Yahoo-
Phish); Yahoo and MalwarePatrol (Yahoo-Malware); Open
Directory and PhishTank (DMOZ-Phish); Open Directory and
MalwarePatrol (DMOZ-Malware); and all good and all bad
URLs (All Good - All Bad), where we combine Yahoo with
Open Directory and PhishTank with MalwarePatrol. When
combining a legitimate dataset with a malicious dataset, we
interleave the URLs of the two sets so that the classification
algorithms would get a balanced number of instances of both
classes when training their models. Table III provides the
statistics of these five pairs.
IV. CLASSIFICATION ALGORITHMS
In this section, we describe four state-of-the-art classifica-
tion algorithms that we investigate in this work. These include
both batch-learning (Support Vector Machine (SVM)) and on-
line learning algorithms (Online Perceptron (OP), Confidence-
Weighted (CW), and Adaptive Regularization of Weights
(AROW)). All these algorithms come from the machine learn-
ing community; to the best of our knowledge, AROW, which
turns out to outperform the rest and become our choice, has
not been used in the phishing context before. We start by
introducing the notation and describing the general difference
between batch-based and online classification.
Notation. Denote the features of an URL as a vector x and its
label as y ∈ {1,−1}, where 1 indicates the URL is malicious
and -1 indicates otherwise. A classification algorithm receives
a number of data vectors, xi, together with their labels, yi,
and trains its model based on these labeled data. Then, given
a new data vector, x, the goal of the algorithm is to predict
the label, y, of this new data based on its trained model. For
SVM and OP, the models are a weight vector, w. For CW
and AROW, in addition to w, the model also includes the
covariance matrix of w, Σ. For all algorithms, the prediction,
h(x), is the sign of the inner product between w and x:
h(x) = sign(w · x)
(1)
Batched-based vs. Online. A batch-based algorithm initially
trains its model based on a batch of labeled data. It then uses
the trained model to predict a number of new data. After
some time, it retrains its model based on a new batch of
labeled data. Meanwhile, an online classification algorithm
continuously retrains its model upon receiving each labeled
data and makes prediction of a new data using the latest
updated model. Because training a model of a batch-based
algorithm requires a batch of data, batch-based algorithms
require significantly more memory than online algorithms.
A. Batch Learning
1) Support Vector Machine (SVM):
widely known for achieving accurate classification of high-
dimensional data. They are also shown recently to perform
well in the arena of classifying malicious URLs [19], [20]. An
SVM constructs a hyperplane that gives the largest distance to
the nearest training data points of any class. Finding this hyper-
plane involves solving an instance of quadratic programming.
The SVMs are
Page 5
5
The label of a new data point is predicted by determining on
which side of the hyperplane this point lies. For a tutorial on
SVMs, we refer the reader to [7]. In this work, we investigate
the performance of batch-based SVMs.
B. Online Learning
The online algorithms discussed below operate in rounds.
In round t, an online algorithm receives xtand predicts xt’s
label as ˆ ytusing the current model; it then receives the true
label, yt, and updates its model based on (xt,yt).
1) Online Perceptron (OP): OP updates w continuously
on error. In particular, w is updated if the predicted label,
ˆ yt = sign(wt· xt), disagrees with the true label, yt, of xt.
The update is as follows:
wt+1← wt+ ytxt.
(2)
OP suffers from a significant drawback: the update rate is
fixed and does not take into account the magnitude of classifi-
cation error. As a result, when making error on prediction, the
model may not adapt fast enough to the change of the data,
or it may make a drastic change even when the error is small.
Both cases lead to poor classification accuracy.
2) Confidence Weighted (CW): CW is a linear binary
classification algorithm recently introduced by Dredze et al.
[13]. CW captures the notion of confidence in the weight of a
feature. Intuitively, if the weight of a feature does not change
very much over time, then one should be more confident that
this weight is what it should be. With this confidence notion,
CW addresses the drawback of OP through two mechanisms:
First, CW updates the weights of the more confident features
less aggressively. For instance, using an IP address in the
domain name is a strong indicator of maliciousness; as a result,
it does not get updated abruptly over a period of time, thereby
having a high confidence value. Then, CW makes sure that this
weight will not change much even when it sees an instance of
legitimate URL using an IP address in its domain name.
Second, CW does not change the weights too much but just
enough to correct for the mistake. In other words, CW updates
its model just enough to adapt to the change of the data, while
trying to avoid changing too much. The rationale is that the
previous model carries a lot of valuable information about the
data and should not be changed too abruptly.
Formally, CW maintains a Gaussian distribution over the
weights with mean µ and covariance matrix Σ. The value µi
represents what is known about the weight wi, and the value
Σi,i captures the confidence in the weight of feature i. To
classify a new data x, the weight w is drawn from N(µ,Σ).
In practice, one can pick w = µ, the average weight vector.
The prediction is then as usual: h(x) = sign(w · x).
Unlike OP, CW updates its model, i.e., µ and Σ, contin-
uously on every labeled data instead of only when making
mistake. This is because making correct prediction also sug-
gests that one should increase his or her confidence of the
current weights. The update rule is as follows:
(µt+1,Σt+1) = arg min
µ,ΣDKL(N(µ,Σ)||N(µt,Σt)),
s.t. Prw∼N(µ,Σ)[yt(w · xt)] ≥ η .
(3)
(4)
Eq. (3) expresses that the new distribution given by the new
µ and Σ should be as close to the old distribution as possible.
The distance between the two distributions is measured by
the KL divergence (DKL). Eq. (4) expresses that the update
should be enough such that the probability of making correct
prediction when seeing the same data in the next round must
be bigger than η, where η is a configurable parameter and
must be larger than 50%.
We refer the reader to [13] for more details. The compu-
tational complexity of the update is linear in the number of
non-zero features in xt. The memory required is constant in
the input data, i.e., the memory for the current x.
3) Adaptive Regularization of Weights (AROW): The final
algorithm in this category that we examine is the AROW
algorithm by Crammer et al. [14]. AROW can be considered as
a modification of CW so that the classifier is more robust in the
presence of label noise. For example, if ‘whitehouse.gov’
is wrongly labeled as malicious (by an adversary) and fed to
CW, then CW will make changes to all features that this URL
has so that in the next time slot, if it sees this URL again, it will
be likely to flag this URL as malicious. CW, therefore, may
drastically increase the weight of the feature “top level domain
is .gov”. AROW avoids this drastic behavior by softening the
formulation of CW.
Formally, Crammer et al. [14] recast the constraint (4) of
CW as regularizers. The update rule is now as follows:
(µt+1,Σt+1) = arg min
µ,ΣDKL(N(µ,Σ)||N(µt,Σt))
+ λ1lh2(yt,µ · xt) + λ2xT
tΣxt,
(5)
where lh2(yt,µ·xt) = (max{0,1−yy(µ·xt)})2is the squared-
hinge loss suffered using µ to predict the label for xtwhen
its true label is yt, and λ1and λ2are configurable parameters.
Compared to CW, the optimization problem becomes un-
constrained. Consider the right hand side of (5): its first term
expresses that the new distribution should be as close to the old
distribution as possible. Intuitively, AROW tries to preserve the
valuable information of the old model as much as possible.
The second term expresses that the new parameters should
be able to predict the current example with low loss. Through
this term, AROW adapts to the change of the data. Finally, the
last term expresses that the confidence in the weights should
generally grow.
Similarly to CW, the running time of the update is linear
in the number of non-zero features in xt. The memory
requirement is constant in terms of the input data. We refer the
reader to [14] for more details. To the best of our knowledge,
this is the first time that AROW is used in the phishing context.
V. EVALUATION RESULTS
We conduct four sets of experiments on various datasets
in order to (i) compare batch-based to online algorithms
when using just lexical features, (ii) compare using lexical
features to using full features, (iii) evaluate the effectiveness
of obfuscation-resistant lexical features, and (iv) evaluate the
resilience of AROW when working with noisy data. Table IV
summarizes the experiment scenarios.