Content uploaded by Ahmet Selman Bozkir
Author content
All content in this area was uploaded by Ahmet Selman Bozkir on Dec 24, 2017
Content may be subject to copyright.
Use of HOG Descriptors in Phishing Detection
Ahmet Selman Bozkir
Hacettepe University Dept. of Computer Engineering
Ankara, Turkey
selman@cs.hacettepe.edu.tr
Ebru Akcapinar Sezer
Hacettepe University Dept. of Computer Engineering
Ankara, Turkey
ebru@hacettepe.edu.tr
Abstract— Phishing is a scamming activity which deals with
making a visual illusion on computer users by providing fake web
pages which mimic their legitimate targets in order to steal
valuable digital data such as credit card information or e-mail
passwords. In contrast to other anti-phishing attempts this paper
proposes to evaluate and solve this problem by leveraging a pure
computer vision based method in the concept of web page layout
similarity. Proposed approach employs Histogram of Oriented
Gradients (HOG) descriptor in order to capture cues of page
layout without the need of time consuming intermediate stage of
segmentation. Moreover, histogram intersection kernel has been
used as a similarity metric for computing similarity. Thus, an
efficient and fast phishing page detection scheme has been
developed in order to combat with zero-day phishing page attacks.
To verify the efficiency of our phishing page detection mechanism,
50 unique phishing pages and their legitimate targets have been
collected. Furthermore, 100 pairs of legitimate pages have been
gathered. As the next stage, the similarity scores in these two
groups were computed and compared. According to promising
results, similarity degree around 75% and above can be adequate
for alarming.
Keywords— Phishing; Anti phishing; Computer vision; HOG;
Page layout
I. INTRODUCTION
With the advent of e-commerce and on-line payment
systems, traditional banking transactions have been evolved to
online banking operations. This progress has not only eased the
life but also created an attraction to steal the private personal
data (e.g. credit card information, online governmental
credentials) by scammers. Thus, a web based attack called as
‘phishing’ emerged. By a term definition, [1] reported that it
derives from the concept of ‘fishing’ for target. Actually,
phishing is a scamming activity which deals with making a
visual illusion on computer users by providing fake web pages
which mimic their legitimate ones in order to steal valuable
digital data such as username or e-mail passwords.
Even though there exist various attempts to protect internet
users from phishing attacks, the amount of financial loss and the
number in new phishing web sites are still rising. According to
quarterly reports of Anti-Phishing Working Group (APWG) [2],
123.741 unique phishing web pages were reported in the first
half of 2014. The second half of the previous year (2013) has
received 115.565 phishing cases. According to [2], while the
average life time of phishing sites was determined around 32
hours, the median was reported as 8 hours and 42 minutes. This
fact implies that, a phishing web site lives shorter than one day.
Therefore, the need of zero-day phishing detection mechanisms
has emerged in recent years since the well-known blacklist
approaches have remained incapable to combat with these state-
of-art attacks.
There can be more than one type of classifications in
phishing detection works. However, in general manner, the
studies in anti-phishing literature can be classified into 4 groups:
(i) real-time black-list, whitelist, e-mail filtering and related
reactive solutions, (ii) content based works (iii) page structure
similarity based attempts and (iv) computer vision based studies.
Black-list approaches rely on gathering phishing web site
URLs from various sources. In general, whenever a user visits a
web page, its URL is queried against the generated black-list
corpus and one is allowed or blocked according to the result. As
an instance, Google Safe Browsing for Firefox [3], a black-list
based Firefox browser extension, warns users whenever a
sensitive financial information is requested by a suspicious page.
However, due to previously stated fast taking down cycles,
black-list based solutions have lost their effectiveness. On the
other hand, as stated in [4], whitelist approaches are based on
building a feature library from those legitimate web pages which
are likely to be imitated by phishers. Zhang et al. in [4] pointed
out the limitation of this technique by stating “as the white list
approach is based on similarity search instead of exact matching,
its detection speed is greatly affected by the feature library size
and searching strategy”.
Ant-phishing studies based on page structure similarity focus
on seeking similarities among structural features of web pages
such as DOM trees, style and size of HTML elements in
corresponding blocks. Liu et al. in [5] proposed an anti-phishing
system via computing block-level, layout and style similarities
by considering size and style information embedded in HTML
elements. Similarly, Medvet et al. in [6] suggested an approach
by considering some page features extracted from DOM tree
representation such as style of text pieces (e.g. font-color, font-
size), and 2D haar-wavelet transformation of images.
Although, structure based studies achieve good results, they
cover three main shortcomings. First, different DOM
organizations can be rendered in the same way that makes these
type of attempts vulnerable to attackers. Second, DOM trees are
not fully reliable source of information since scammers are able
to create dynamic loading phishing web pages by utilizing client
side programming techniques. Third, as of 2016, instead of
inline coding, style information of page elements are usually
stored in CSS (Cascading Style Sheet) pages which forces a
DOM based anti-phishing mechanism to suffer from complex
and time consuming deeper analyses. These facts reduce the
effectiveness of DOM based solutions.
In recent years, for several reasons, there is a growing trend
to use computer vision techniques in phishing detection. First of
all, since a web page itself is a kind of visual stimuli, computer
vision techniques are found appropriate to analyze and evaluate
visual similarities via suitable features. Second, pure vision
based approaches are classified as proactive solutions which are
robust to zero-day attacks. On the other hand, as the main trick
of phishing is mimicking the legitimate web sites, scammers
have started to create polymorphic web pages to breach the
defense of the anti-phishing mechanisms. In order to evade
phishing detection systems, attackers apply different
representation techniques to create visually similar web page
which has been called phishing page polymorphism [7]. As one
or several portions of a fake web page can be composed of
images or other type of interactive contents, polymorphic pages
cannot be detected by use of traditional DOM based methods.
Furthermore, the addressed drawbacks and limitations of
structure based methods empower the use of computer vision
methods in phishing detection.
In literature, there exist limited but growing number of
studies on phishing detection via image processing or computer
vision methods. For instance, Maurer and Herzner in [8]
employed texture and color histograms gathered from phishing
and legitimate web pages in order to detect phishing. As another
work, Lam et al. in [1] proposed a layout concerning similarity
metric to be used in phishing detection. They first segmented the
screenshots of web pages in order to reveal page blocks and then
they measured the block pair matches by considering some
properties such as size, location and symmetry. Although it
achieves good detection rates, Lam et al.’s approach suffers in
segmenting the web pages having complex background textures.
On the other hand, in [9] Fu et al. applied Earth Mover’s
Distance (EMD) algorithm on image-based anti-phishing
framework in order to compute the similarity between legitimate
and fake web pages. However as stated by Lam et al. in [8] EMD
based approach has some shortcomings: (1) all the web pages
must have same aspect ratio; (2) EMD can cause false alarms
when color dispositions of two legitimate pages are similar. As
another study, Wang et al. in [10] presented a phishing detection
strategy based on capturing logo similarity via scale and rotation
invariant SIFT [11] features. While their approach differs from
others by focusing only on the logos, some important issues
related to diversity and typography in logo types were addressed.
The approach closest to our study was proposed in Rao and Ali’s
paper [1]. They employed SURF detector to extract visual
features from suspicious and genuine web pages and then
measured the overall visual similarity by use of these scale and
rotation invariant features. Nevertheless, while their work
obtained scale and rotational invariance, it has the lack of partial
similarity.
In this paper, we try to detect phishing web pages
considering following aspects:
a. A well designed phishing page must mimic the
legitimate one. Therefore the layout of the phishing
page must be similar or identical to its target in order to
deceive even expert users. The importance and role of
layout in visual perception has been emphasized in [12].
b. Smart attackers who are aware of the state-of-art anti-
phishing solutions, consciously make some
modifications (e.g. adding or removing minor contents)
over the legitimate pages in development process of
their phishing pages. Therefore an efficient ant-phishing
solution must consider computing partial similarities as
well as overall similarity.
c. Considering complex backgrounds, the segmentation
stage should be eliminated in order to build a robust,
efficient and effective anti-phishing mechanism.
d. In contrast to other approaches such as EMD, an
efficient phishing detection solution should be able to
generate fast and easy to calculate page signatures
which will be employed on real time corpus querying at
next stages.
Concerning these design considerations, we proposed a
phishing detection scheme which employs Histogram of
Oriented Gradients (HOG) descriptors in order to capture visual
cues of page layout. By use of HOG, efficient and effective
layout signatures belonging suspicious and legitimate web pages
were generated and compared. The main motivation point of this
study is to verify whether HOG descriptors are suitable in the
field of phishing detection. According to promising results of
conducted experiments, the similarity value of 75% seems a
good threshold for phishing alarm.
The remaining sections of this paper is organized as follows.
Section 2 overviews the HOG descriptor and features. Section 3
describes the proposed approach by referring application
environment. Section 4 reports the results of conducted
experiments and Section 5 concludes the paper.
II. METHODOLOGY
Invented by Dalal and Triggs [13], Histogram of Oriented
Gradients is a powerful computer vision method which has been
used for characterizing and capturing local object appearance or
shapes by utilizing distribution of intensity gradients or edge
directions. In essence, HOG descriptors are designed to
represent and reveal orientations in a local patch of an image.
Since its invention, HOG descriptors have been employed in
various fields such as moving object detection [14] and shape
representation [15]. For the following reasons, HOG descriptors
were preferred in this study: (i) HOG descriptors are able to
capture visual cues of overall page layout; (ii) they are able to
provide a certain degree of rotation and translation invariance.
Extracting HOG descriptors require three main steps: (i)
gradient computation, (ii) orientation binning and (iii) block
normalization. At the first stage, grid of equal sized cells are
obtained by dividing the image. As the second stage, for each
pixel, gradient vector is converted to an angle and orientation
bins are built according to angle ranges. Moreover, the
normalization stage which works on grouped cells (blocks) is
carried out in order to avoid illuminance variations and obtain
more robust results. As the final stage, normalized histograms
are concatenated and final descriptor is formed.
During the gradient computations, different kinds of
derivative masks (e.g. sobel or 2×2 diagonal masks) are
employed. In [13], for the particular case of human recognition,
it was discovered that the simplest and
kernels perform best classification results. Let
the image I is given, the gradients in both axes are first computed
by applying the mentioned kernels via the formula
and . Due to limited number of pages, HOG
descriptors were not comprehensively explained in this paper.
For further reading, Dalal and Trigg’s paper [13] can be studied.
Following the process of building concatenated feature
vector, the similarity of two pages is calculated via histogram
intersection kernel which is expressed in the equation (1).
(1)
Application of HOG descriptors and histogram intersection
kernel is depicted in Fig. 1.
III. PROPOSED APPROACH
Main trick of phishing attacks is to deceive users via creating
visually similar fake pages which seems identical with their
legitimate ones. In this way, even experts users can be deceived
since visual appearance between the fake and its target cannot
be easily differentiated. Our proposed approach is primarily
designed to detect these type of zero-day phishing attacks by use
of HOG descriptors in order to capture layout cues.
Our system consists of two modules. The first module so
called “Wrapper”, was designed and implemented in order to
find out effective page boundaries and taking a screenshot of
web page. Following the stage of revealing target ROI (Region
of Interest) effective portion is cropped and prepared for being
an input to next module. The “Wrapper” module was coded in
C# language and Mozilla GeckoFX API [16] was employed.
Second module so called “Hogger” was implemented in
order to take a JPEG file and output a concatenated HOG feature
vector. “Hogger” module was coded in native C++ code for
achieving high performance. According to given parameters,
“Hogger” outputs variant sized of concatenated feature vector.
A. Identifying Region of Interest
Web pages currently designed especially for wide screens.
However, due to the backwards compatibility issues, most of the
web pages have been still in concordant with 1024 pixel wide
screen resolution. On the other hand, a web page may cover
wider than 1024 pixels. Similarly, there is no limitation for their
absolute height. This situation makes it as an essential pre-
requirement to determine the effective and discriminative region
of interest on web pages. Bozkir and Akcapinar Sezer in [12] has
pointed out that the most significant and discriminative visual
information in web pages is found at nonscroll requiring top
most 1024 pixels. Therefore, it was decided to crop and use the
top most 1024 pixels. For the sake of simplicity in computation
and the convention addressed above, the width of ROI was
preferred to be 1024 pixels.
In order to take a right screenshot, we employed Mozilla
GeckoFx.NET browser API. The “wrapper” window was
precisely set for taking 1024 pixel wide screen shots. At next
stage, we cropped the portion below 1024 pixels. For the cases
where height of web page is lower than 1024 pixels, we applied
a dominant color detection method for filling the empty lowest
part in order to have full square input image. In this way, input
images were generated concerning the existing dominant color
in web page. Finally the output image was converted to
grayscale in order to increase the gradient computation
accuracy.
B. Revealing the Cues of Page Layout via HOG Descriptors
As it was stated before, the main thought behind this
approach is to represent page layout by the help of distribution
of oriented gradients in grid of equal sized cells. In order to
reveal the appropriate cell size we applied two different grid
configurations. In first configuration (HOG128), the input
image consisting of 1024×1024 pixels was divided into grid of
8×8 cells having side length of 128 pixels. For the second
configuration (HOG64), the side length of square sized cells
was reduced to 64 pixels which totally results 16×16 grid. By
Fig. 1. HOG features generation and computing similarity
use of these two types of grid configuration, it was aimed to
understand and evaluate the levels of details. Moreover, the
translational and rotational invariance properties of HOG were
also examined on different configurations.
For block normalization, the scheme L2-norm,
was selected. For general use, the “Hogger” was
enabled to be take some parameters such as cell size, block size
and bin number. The native HOG implementation was adopted
from open source OpenCV [17] project. According to our
measurements, HOG feature extraction takes less than one
second on a computer with an Intel® Core ™ i5-2430M
processor with 4GB RAM.
C. Use Case Scenario
As we mentioned before, the proposed system was designed
to detect zero-day phishing attacks. In order to achieve this goal,
we first collect URLs of legitimate pages LPi which have
potential phishing risk and the layout signature of the LPi is
stored in legitimate corpus database along with its root domain.
Once all the pages which needs phishing protection were loaded
to the central corpus, a suspicious page SPj can be checked
against the legitimate corpus in order to verify whether it has a
high similar legitimate target. During the verification process,
Histogram Intersection Kernel (HIK) is employed as a similarity
metric. If a corresponding legitimate page is found then the root
domains of LPi and SPj are compared. If the root domains are
different then the system notifies the user for phishing page. The
proposed system is depicted in Fig 2.
IV. EXPERIMENT AND RESULTS
A. Experiments
In order to verify whether or not HOG method can be a
suitable feature extraction method for phishing detection, we
conducted an experiment by using two test data set. To establish
the first test set, we have collected 50 unique phishing pages
reported from Phishtank [18] covering the days between 14
December 2015 and 5 January 2016. The adjective unique here
refers to have unique visual appearances. Meanwhile, as it is
expected, most of the gathered phishing pages generally target
e-commerce, online payment and banking web sites. We also
gathered the legitimate targets of these pages. Since the taking
down cycles of phishing pages are so short, we decided to store
the phishing pages in a local folder. Thus, these web page pairs
were saved in HTML format by using a freeware utility.
For the second test set, we have collected 18 legitimate home
pages from Alexa [19] top 500 web site directory. Afterwards,
we have shuffled the page URLs in order to obtain 100 distinct
legitimate home page pairs.
In order to assess whether or not HOG descriptors are
applicable features for phishing detection, we followed the
following procedure:
1. For the test set 1, we computed the similarity scores of
pairs with HOG-64 and HOG-128 configurations. The
results are depicted in Fig. 3 and related statistic listed
in Table 1.
2. For the test set 2, we computed the similarity scores of
unique legitimate page pairs with HOG64 and HOG128
configurations. The results are depicted in Fig. 4 and
related statistic listed in Table 2.
TABLE I. STATISTICS OF PHISHING PAIRS IN HOG-64 AND HOG-128
Statistics
Similarity of Pairs of Phishing Pages
(50 pages)
HOG-64 px cells
HOG-128 px cells
min
51.873 %
49.910 %
max
98.861 %
98.390 %
mean
78.868 %
78.637 %
standard deviation
12.147 %
10.963 %
TABLE II. STATISTICS OF UNIQUE LEGITIMATE PAGE PAIRS IN HOG-64
AND HOG-128
Statistics
Similarity of Pairs of Legitimate Pages
(100 unique pairs)
HOG-64 px cells
HOG-128 px cells
min
38.420 %
45.683 %
max
74.459 %
77.092 %
mean
60.739 %
66.012 %
standard deviation
11.026 %
9.492 %
Fig. 2. Module design of proposed system
B. Results and Discussion
According to the obtained results, the following conclusions
were deduced:
Similarity scores between the pairs of test 1 and test 2
sets were found notably different which results that
HOG descriptors are suitable for phishing detection
tasks.
If the charts in Fig. 3 and Fig. 4 are investigated it can
be clearly seen that the similarity value around 75%
seems a good threshold value for phishing alarm.
As the HOG-64 configuration reveals slightly higher
scores between phishing page-legitimate page pairs (test
1 set), it produces lower outcomes in legitimate page
pairs (test 2 set) compared the HOG-128 configuration.
Therefore, it can be deduced that feature intersection at
smaller local patches gain us more robust and
discriminative results.
Fig. 4. Similarity scores of unique legitimate page pairs
Fig. 3. Similarity scores of phishing pages and their legitimate targets
It is observed that, in most of the phishing pages,
scammers may use different images (e.g. IMG tag)
while keeping the page layout as identical as their
legitimate targets. Since the HOG features are affected
by image content, as a future work, it is being planned
to detect and replace image contents with a placeholder
in order to improve detection accuracy.
V. CONCLUSION
In this paper, usage of an efficient and effective computer
vision method is proposed in the field of phishing detection.
Hence, the Histogram of Oriented Gradients descriptor has been
employed and verified. It is primarily aimed to detect zero-day
attacks by capturing visual cues of web page’s page layout via
HOG features and creating an easy to compare page layout
signature. In order to compute the similarity, histogram
intersection kernel has been employed. According to results, the
similarity value 75% was found as an appropriate threshold
value for phishing alarm. However, we believe that proposed
approach can be enhanced by providing image content
invariance. Therefore, we are planning to detect image contents
by computer vision and image processing techniques and
represent them with a virtually created gradient bin.
REFERENCES
[1] R.S. Rao and S.T. Ali, “A Computer Vision Technique to Detect Phishing
Attack”, Fifth International Conference on Communication Systems and
Network Technologies, 2015.
[2] APWG, Phishing activity trends paper. [Online]. Available at
http://www/antiphishing.org/resources/apwg-papers/
[3] Google Safe Browsing for Firefox. [Online]. Available at
http://www.google.com/tools/firefox/safebrowsing
[4] W. Zhang, H. Lu, B. Xu and H. Yang, “W eb Phi shing Detection Based
on Page Spatial Layout Similarity”, Informatica, vol. 37, pp. 231-244,
2013.
[5] W. Liu, X. Deng, G. Huang and A.Y. Fu, “ An Antiphishing Strategy
Based on Visual Similarity Assesment”, IEEE Internet C omputing, vol.
10, pp. 58-65, March 2006.
[6] E. Medvet, E. Kirda and C. Krueger, “Visual-Similarity-Based Phishing
Detection”, Securecomm ’08 International Conference on Security and
Privacy in Communication Networks, 2008.
[7] I.F. Lam, W.C. Xiao, S.C. Wang and K.T. Chen, “Counteracting Phishing
Page Polymorphism: An Image Layout Analysis Approach”, Third
International Conference and Workshops, ISA2009, 2009.
[8] M.E. Maurer and D. Herzner, “ Using visual website similarity for
phishing detection a nd reporting”, In CHI’12 Extended Abstacts on
Human Factors in Computing Systems, 2012.
[9] A.Y. Fu, L. Wenyin and X. Deng, “Detecting Phishing Web Pages with
Visual Similarity Assesment based Earth Mover’s Distance (EMD)”,
IEEE Transactions on Dependable and Secure Computing, pp. 301-311,
2006.
[10] G. Wang, H. Liu, S. Becerra, K. Wang, “Verilog: Proactive Phishing
Detection via Logo Recognition”, Technical Report CS2011-0669, UC
San Diego, 2011.
[11] D.G. Lowe, “Distinctive image features from scale-invariant keypoints”,
International Journal of Computer Vision. vol. 60, 2004.
[12] A.S. Bozkir and E. Akcapinar Sezer, “SimiLay: A Developing Web Page
Layout Based Visual Similarity Search Engine”, 10th International
Conference on Machine Learning and Data Mining, St.Petersburg, 2014.
[13] N. Dalal and B. Triggs, “Histogram of Oriented Gradients for Human
Detection”, Computer Vision and Pattern Recognition, 2005.
[14] C.W. Liang and C.F. Yuang, “Moving object classification using local
shape and HOG features in wavelet-transformed space with hierachical
SVM classifers, Applied Soft Computing, vol. 28, 2015.
[15] A. Bosch, A. Zisserman and X. Munoz, “Representing shape with a
spatial pyramid kernel”, Conference of CIVR’07, Netherlands, 2007.
[16] Gecko FX, [Online], Available at https://code.google.com/p/geckofx/
[17] OpenCV, [Online], Available at http://opencv.org/
[18] Phishtank, [Online], Available at https://www.phishtank.com/
[19] Alexa, [Online], Available at http://www.alexa.com/topsites/