Conference PaperPDF Available

Phish-IRIS: A New Approach for Vision Based Brand Prediction of Phishing Web Pages via Compact Visual Descriptors

Authors:

Abstract

Phishing, a continuously growing cyber threat, aims to obtain innocent users' credentials by deceiving them via presenting fake web pages which mimic their legitimate targets. To date, various attempts have been carried out in order to detect phishing pages. In this study, we treat the problem of phishing web page identification as an image classification task and propose a machine learning augmented pure vision based approach which extracts and classifies compact visual features from web page screenshots. For this purpose, we employed several MPEG7 and MPEG7-like compact visual descriptors (SCD, CLD, CEDD, FCTH and JCD) to reveal color and edge based discriminative visual cues. Throughout the feature extraction process we have followed two different schemas working on either whole screenshots in a "holistic" manner or equal sized "patches" constructing a coarse-to-fine "pyramidal" representation. Moreover, for the task of image classification, we have built SVM and Random Forest based machine learning models. In order to assess the performance and generalization capability of the proposed approach, we have collected a mid-sized corpus covering 14 distinct brands and involving 2852 samples. According to the conducted experiments, our approach reaches up to 90.5% F1 score via SCD. As a result, compared to other studies, the suggested approach presents a lightweight schema serving competitive accuracy and superior feature extraction and inferring speed that enables it to be used as a browser plugin.
A preview of the PDF is not available
... URL clustering has also been used in different domains, e.g., the detection of phishing pages. Such systems deduce page similarity either through visual analysis [41], [40], [13], [18] and comparing benign to suspicious webpages, or by utilizing URL and HTML related features that mainly focus on a page's textual contents [38], [34], [59]. Despite being successful for their respective goals, such approaches are not suitable in our context. ...
... [63] K. Zhang and D. Shasha, "Simple fast algorithms for the editing distance between trees and related problems," SIAM J. Comput., vol. 18,1989. ...
... For example, in some papers (see, e.g., [67], [105]) features are extracted from properties, such as Histogram of Oriented Gradients -representing local object appearance in a semi-rigid way through distribution of intensity gradients and edge directions -and color histograms -representing the spatial distribution of colors within an image. Similarly, in [106] features are extracted from color-based visual descriptors, such as Scalable Color Descriptor, Color Layout Descriptor, Fuzzy Color and Texture Histogram. ...
... Researchers making the datasets used in their papers available to the community are commendable (see, e.g., [49], [67], [76], [82], [91], [106], [128]). Nevertheless, the availability of these datasets heavily depends on the availability of authors websites where they are published, thus these datasets tend to disappear after some years. ...
Article
Full-text available
Phishing is a security threat with serious effects on individuals as well as on the targeted brands. Although this threat has been around for quite a long time, it is still very active and successful. In fact, the tactics used by attackers have been evolving continuously in the years to make the attacks more convincing and effective. In this context, phishing detection is of primary importance. The literature offers many diverse solutions that cope with this issue and in particular with the detection of phishing websites. This paper provides a broad and comprehensive review of the state of the art in this field by discussing the main challenges and findings. More specifically, the discussion is centered around three important categories of detection approaches, namely, list-based, similarity-based and machine learning-based. For each category we describe the detection methods proposed in the literature together with the datasets considered for their assessment and we discuss some research gaps that need to be filled.
... The last two decades of information security field have witnessed the establishment of numerous approaches aiming at identification of phishing characteristics in several modalities such as emails [95], [96], web pages [97], URLs [96], [98], SMS messages [99], and visual contents [97], [100]. The incorporation of machine learning methodologies and the continuous progress in AI systems have resulted in the pro- However, it should be noted that those studies generally focus on either phish/legitimate discrimination or brandbased identification, requiring the class labels to be obtained through automated threat services or human annotations. ...
... The last two decades of information security field have witnessed the establishment of numerous approaches aiming at identification of phishing characteristics in several modalities such as emails [95], [96], web pages [97], URLs [96], [98], SMS messages [99], and visual contents [97], [100]. The incorporation of machine learning methodologies and the continuous progress in AI systems have resulted in the pro- However, it should be noted that those studies generally focus on either phish/legitimate discrimination or brandbased identification, requiring the class labels to be obtained through automated threat services or human annotations. ...
Article
Full-text available
Phishing attacks are still seen as a significant threat to cyber security, and large parts of the industry rely on anti-phishing simulations to minimize the risk imposed by such attacks. This study conducted a large-scale anti-phishing training with more than 31000 participants and 144 different simulated phishing attacks to develop a data-driven model to classify how users would perceive a phishing simulation. Furthermore, we analyze the results of our large-scale anti-phishing training and give novel insights into users’ click behavior. Analyzing our anti-phishing training data, we find out that 66% of users do not fall victim to credential-based phishing attacks even after being exposed to twelve weeks of phishing simulations. To further enhance the phishing awareness-training effectiveness, we developed a novel manifold learning-powered machine learning model that can predict how many people would fall for a phishing simulation using the several structural and state-of-the-art NLP features extracted from the emails. In this way, we present a systematic approach for the training implementers to estimate the average "convincing power" of the emails prior to rolling out. Moreover, we revealed the top-most vital factors in the classification. In addition, our model presents significant benefits over traditional rule-based approaches in classifying the difficulty of phishing simulations. Our results clearly show that anti-phishing training should focus on the training of individual users rather than on large user groups. Additionally, we present a promising generic machine learning model for predicting phishing susceptibility.
... Content-based features include JavaScript, HTML, and images derived from or used to build websites. In terms of application, several studies have utilized these types of features, either focusing on a single type [11,13], or utilizing a mixture of the different feature categories [14]. ...
... Their experimental results portrayed that the proposed method resulted in a high test accuracy. Dalgic et al. proposed an image classification approach for malicious website detection [13]. Their idea was to extract several MPEG-7 features and study each one's influence on detection. ...
Article
Full-text available
Malicious websites in general, and phishing websites in particular, attempt to mimic legitimate websites in order to trick users into trusting them. These websites, often a primary method for credential collection, pose a severe threat to large enterprises. Credential collection enables malicious actors to infiltrate enterprise systems without triggering the usual alarms. Therefore, there is a vital need to gain deep insights into the statistical features of these websites that enable Machine Learning (ML) models to classify them from their benign counterparts. Our objective in this paper is to provide this necessary investigation, more specifically, our contribution is to observe and evaluate combinations of feature sources that have not been studied in the existing literature—primarily involving embeddings extracted with Transformer-type neural networks. The second contribution is a new dataset for this problem, GAWAIN, constructed in a way that offers other researchers not only access to data, but our whole data acquisition and processing pipeline. The experiments on our new GAWAIN dataset show that the classification problem is much harder than reported in other studies—we are able to obtain around 84% in terms of test accuracy. For individual feature contributions, the most relevant ones are coming from URL embeddings, indicating that this additional step in the processing pipeline is needed in order to improve predictions. A surprising outcome of the investigation is lack of content-related features (HTML, JavaScript) from the top-10 list. When comparing the prediction outcomes between models trained on commonly used features in the literature versus embedding-related features, the gain with embeddings is slightly above 1% in terms of test accuracy. However, we argue that even this somewhat small increase can play a significant role in detecting malicious websites, and thus these types of feature categories are worth investigating further.
... From another perspective, Clustering similar pages can reduce the access to forms with similar functionality and improve the efficiency of vulnerability detection. Existing research mainly uses visual analysis [23][24][25][26] or webpage text content [27][28][29] to determine similar pages. Although these methods are effective in their application scenarios, they are not suitable for our research objectives. ...
Article
Full-text available
Black-box web application scanning has been a popular technique to detect Cross-Site Scripting (XSS) vulnerabilities without prior knowledge of the application. However, several limitations lead to low efficiency of current black-box scanners, including (1) the scanners waste time by repetitively visiting similar states, such as similar HTML forms of two different products, and (2) using a First-In-First-Out (FIFO) fuzzing order for the collected forms has led to low efficiency in detecting XSS vulnerabilities, as different forms have different potential possibilities of XSS vulnerability. In this paper, we present a state-sensitive black-box web application scanning method, including a filtering method for excluding similar states and a heuristic ranking method for optimizing the fuzzing order of forms. The filtering method excludes similar states by comparing readily available characteristic information that does not require visiting the states. The ranking method sorts forms based on the number of injection points since it is commonly observed that forms with a greater number of injection points have a higher probability of containing XSS vulnerabilities. To demonstrate the effectiveness of our scanning method, we implement it in our black-box web scanner and conduct experimental evaluations on eight real-world web applications within a limited scanning time. Experimental results demonstrate that the filtering method improves the code coverage about 17% on average and the ranking method helps detect 53 more XSS vulnerabilities. The combination of the filtering and ranking methods helps detect 81 more XSS vulnerabilities.
Article
In recent years, phishing attacks have evolved considerably, causing existing adversarial features that were widely utilised for detecting phishing websites to become less discriminative. These developments have fuelled growing interests among security researchers towards an anti-phishing strategy known as the identity-based detection technique. Identity-based detection techniques have consistently achieved high true positive rates in a rapidly changing phishing landscape, owing to its capitalisation on fundamental brand identity relations that are inherent in most legitimate webpages. However, existing identity-based techniques often suffer higher false positive rates due to complexities and challenges in establishing the webpage’s brand identity. To close the existing performance gap, this paper proposes a new hybrid identity-based phishing detection technique that leverages webpage visual and textual identity. Extending earlier anti-phishing work based on the website logo as visual identity, our method incorporates novel image features that mimic human vision to enhance the logo detection accuracy. The proposed hybrid technique integrates the visual identity with a textual identity, namely, brand-specific keywords derived from the webpage content using textual analysis methods. We empirically demonstrated on multiple benchmark datasets that this joint visual-textual identity detection approach significantly improves phishing detection performance with an overall accuracy of 98.6%. Benchmarking results against an existing technique showed comparable true positive rates and a reduction of up to 3.4% in false positive rates, thus affirming our objective of reducing the misclassification of legitimate webpages without sacrificing the phishing detection performance. The proposed hybrid identity-based technique is proven to be a significant and practical contribution that will enrich the anti-phishing community with improved defence strategies against rapidly evolving phishing schemes.
Chapter
Phishing is one of the cyberattacks most feared by users who use transactional services over the Internet, although there are a lot of studies focused on detecting phishing attacks showing high accuracy, those have problems acting with the effectiveness required to prevent people to fall into these attacks in the early stages. In this article, a state-of-the-art overview of phishing detection is shown using a systematic literature review methodology for studies addressed between 2016 and 2022, such as other survey papers between 2020 and 2022, focused on the different detection stages, information sources, phishing characterization, and different methods used in the literature. Found that 83% of applications works selected are focused on the mitigation stage, where the methodologies act in reactive ways using statics features that provides high accuracy but turn the models fail through time. Finally, conclusions will be presented to highlight the importance of using brand information and mixing different methods to improve stage detection and assure durability in the detection model. The article’s contribution is focused on establishing another perspective that encourages future research and future related works to consider their models beyond a high accuracy and start thinking about how these models can to provide effective solutions that could be integrated into production environments to protect the users.
Article
Full-text available
Phishing attacks are one of the most challenging social engineering cyberattacks due to the large amount of entities involved in online transactions and services. In these attacks, criminals deceive users to hijack their credentials or sensitive data through a login form which replicates the original website and submits the data to a malicious server. Many anti-phishing techniques have been developed in recent years, using different resource such as the URL and HTML code from legitimate index websites and phishing ones. These techniques have some limitations when predicting legitimate login websites, since, usually, no login forms are present in the legitimate class used for training the proposed model. Hence, in this work we present a methodology for phishing website detection in real scenarios, which uses URL, HTML, and web technology features. Since there is not any updated and multipurpose dataset for this task, we crafted the Phishing Index Login Websites Dataset (PILWD), an offline phishing dataset composed of 134,000 verified samples, that offers to researchers a wide variety of data to test and compare their approaches. Since approximately three-quarters of collected phishing samples request the introduction of credentials, we decided to crawl legitimate login websites to match the phishing standpoint. The developed approach is independent of third party services and the method relies on a new set of features used for the very first time in this problem, some of them extracted from the web technologies used by the on each specific website. Experimental results show that phishing websites can be detected with 97.95% accuracy using a LightGBM classifier and the complete set of the 54 features selected, when it was evaluated on PILWD dataset.
Chapter
In today’s Internet era, Internet of Things (IoT) based products and applications are adopted by users for many different purposes like shopping, managing finances, smart home security hubs, etc. Some of them are implemented as web page applications hosted over the Internet, which essentially inherits existing threats and attacks on them. One of the most common security attacks on web page applications is phishing. Phishing is a social engineering attack in which an adversary tries to steal users’ sensitive information including credentials by tricking them into believing the user is on a legitimate web page. Adversaries tend to adopt new and sophisticated ways to forge the web page designs in a crafty way and trick users to visit the malicious links. The crafty phishing web pages are used as a medium to carry out the art of phishing even for IoT-based applications. This chapter focuses on the state-of-the-art technologies that can be utilized to defend against phishing attacks in the era of IoT. Specifically, technologies to detect web page, zero-day, and adversarial phishing attacks, including their features, are introduced and discussed.
Article
Full-text available
Phishing is a cyber-attack which targets naive online users tricking into revealing sensitive information such as username, password, social security number or credit card number etc. Attackers fool the Internet users by masking webpage as a trustworthy or legitimate page to retrieve personal information. There are many anti-phishing solutions such as blacklist or whitelist, heuristic and visual similarity-based methods proposed to date, but online users are still getting trapped into revealing sensitive information in phishing websites. In this paper, we propose a novel classification model, based on heuristic features that are extracted from URL, source code, and third-party services to overcome the disadvantages of existing anti-phishing techniques. Our model has been evaluated using eight different machine learning algorithms and out of which, the Random Forest (RF) algorithm performed the best with an accuracy of 99.31%. The experiments were repeated with different (orthogonal and oblique) random forest classifiers to find the best classifier for the phishing website detection. Principal component analysis Random Forest (PCA-RF) performed the best out of all oblique Random Forests (oRFs) with an accuracy of 99.55%. We have also tested our model with the third-party-based features and without third-party-based features to determine the effectiveness of third-party services in the classification of suspicious websites. We also compared our results with the baseline models (CANTINA and CANTINA+). Our proposed technique outperformed these methods and also detected zero-day phishing attacks.
Conference Paper
Full-text available
Phishing is a scamming activity which deals with making a visual illusion on computer users by providing fake web pages which mimic their legitimate targets in order to steal valuable digital data such as credit card information or e-mail passwords. In contrast to other anti-phishing attempts this paper proposes to evaluate and solve this problem by leveraging a pure computer vision based method in the concept of web page layout similarity. Proposed approach employs Histogram of Oriented Gradients (HOG) descriptor in order to capture cues of page layout without the need of time consuming intermediate stage of segmentation. Moreover, histogram intersection kernel has been used as a similarity metric for computing similarity. Thus, an efficient and fast phishing page detection scheme has been developed in order to combat with zero-day phishing page attacks. To verify the efficiency of our phishing page detection mechanism, 50 unique phishing pages and their legitimate targets have been collected. Furthermore, 100 pairs of legitimate pages have been gathered. As the next stage, the similarity scores in these two groups were computed and compared. According to promising results, similarity degree around 75% and above can be adequate for alarming.
Article
Full-text available
The success of sparse representations in image modeling and recovery has motivated its use in computer vision applications. Object recognition has been effectively performed by aggregating sparse codes of local features in an image at multiple spatial scales. Though sparse coding guarantees a high-fidelity representation, it does not exploit the dependence between the local features. By incorporating suitable locality constraints, sparse coding can be regularized to obtain similar codes for similar features. In this paper, we develop an algorithm to design dictionaries for local sparse coding of image descriptors and perform object recognition using the learned dictio-naries. Furthermore, we propose to perform kernel local sparse coding in order to exploit the non-linear similarity of features and describe an algo-rithm to learn dictionaries when the Radial Basis Function (RBF) kernel is used. In addition, we develop a supervised local sparse coding approach for image retrieval using sub-image heterogeneous features. Simulation results for object recognition demonstrate that the two proposed algorithms achieve higher classification accuracies in comparison to other sparse coding based approaches. By performing image retrieval on the Microsoft Research Cam-Preprint submitted to Pattern Recognition Letters December 31, 2011 bridge image dataset, we show that incorporating supervised information into local sparse coding results in improved precision-recall rates.
Conference Paper
The large-scale deployment of modern phishing attacks relies on the automatic exploitation of vulnerable websites in the wild, to maximize profit while hindering attack traceability, detection and blacklisting. To the best of our knowledge, this is the first work that specifically leverages this adversarial behavior for detection purposes. We show that phishing webpages can be accurately detected by highlighting HTML code and visual differences with respect to other (legitimate) pages hosted within a compromised website. Our system, named DeltaPhish, can be installed as part of a web application firewall, to detect the presence of anomalous content on a website after compromise, and eventually prevent access to it. DeltaPhish is also robust against adversarial attempts in which the HTML code of the phishing page is carefully manipulated to evade detection. We empirically evaluate it on more than 5,500 webpages collected in the wild from compromised websites, showing that it is capable of detecting more than 99% of phishing webpages, while only misclassifying less than 1% of legitimate pages. We further show that the detection rate remains higher than 70% even under very sophisticated attacks carefully designed to evade our system.
Article
Phishing is a fraudulent technique that is used over the Internet to deceive users with the goal of extracting their personal information such as username, passwords, credit card, and bank account information. The key to phishing is deception. Phishing uses email spoofing as its initial medium for deceptive communication followed by spoofed websites to obtain the needed information from the victims. Phishing was discovered in 1996, and today, it is one of the most severe cybercrimes faced by the Internet users. Researchers are working on the prevention, detection, and education of phishing attacks, but to date, there is no complete and accurate solution for thwarting them. This paper studies, analyzes, and classifies the most significant and novel strategies proposed in the area of phished website detection, and outlines their advantages and drawbacks. Furthermore, a detailed analysis of the latest schemes proposed by researchers in various subcategories is provided. The paper identifies advantages, drawbacks, and research gaps in the area of phishing website detection that can be worked upon in future research and developments. The analysis given in this paper will help academia and industries to identify the best anti-phishing technique. Copyright
Conference Paper
Phishing refers to cybercrime that use social engineering and technical subterfuge techniques to fool online users into revealing sensitive information such as username, password, bank account number or social security number. In this paper, we propose a novel solution to defend zero-day phishing attacks. Our proposed approach is a combination of white list and visual similarity based techniques. We use computer vision technique called SURF detector to extract discriminative key point features from both suspicious and targeted websites. Then they are used for computing similarity degree between the legitimate and suspicious pages. Our proposed solution is efficient, covers a wide range of websites phishing attacks and results in less false positive rate.
Article
Web phishing is becoming an increasingly severe security threat in the web domain. Effective and efficient phishing detection is very important for protecting web users from loss of sensitive private information and even personal properties. One of the keys of phishing detection is to efficiently search the legitimate web page library and to find those page that are the most similar to a suspicious phishing page. Most existing phishing detection methods are focused on text and/or image features and have paid very limited attention to spatial layout characteristics of web pages. In this paper, we propose a novel phishing detection method that makes use of the informative spatial layout characteristics of web pages. In particular, we develop two different options to extract the spatial layout features as rectangle blocks from a given web page. Given two web pages, with their respective spatial layout features, we propose a page similarity definition that takes into account their spatial layout characteristics. Furthermore, we build an R-tree to index all the spatial layout features of a legitimate page library. As a result, phishing detection based on the spatial layout feature similarity is facilitated by relevant spatial queries via the R-tree. A series of simulation experiments are conducted to evaluate our proposals. The results demonstrate that the proposed novel phishing detection method is effective and efficient.
Article
Phishing is a security threat which combines social engineering and website spoofing techniques to deceive users into revealing confidential information. In this paper, we propose a phishing detection method to protect Internet users from the phishing attacks. In particular, given a website, our proposed method will be able to detect if it is a phishing website. We use a logo image to determine the identity consistency between the real and the portrayed identity of a website. Consistent identity indicates a legitimate website and inconsistent identity indicates a phishing website. The proposed method consists of two processes, namely logo extraction and identity verification. The first process will detect and extract the logo image from all the downloaded image resources of a webpage. In order to detect the right logo image, we utilise a machine learning technique. Based on the extracted logo image, the second process will employ the Google image search to retrieve the portrayed identity. Since the relationship between the logo and domain name is exclusive, it is reasonable to treat the domain name as the identity. Hence, a comparison between the domain name returned by Google with the one from the query website will enable us to differentiate a phishing from a legitimate website. The conducted experiments show reliable and promising results. This proves the effectiveness and feasibility of using a graphical element such as a logo to detect a phishing website.