Conference PaperPDF Available

Phish-IRIS: A New Approach for Vision Based Brand Prediction of Phishing Web Pages via Compact Visual Descriptors

Authors:

Abstract

Phishing, a continuously growing cyber threat, aims to obtain innocent users' credentials by deceiving them via presenting fake web pages which mimic their legitimate targets. To date, various attempts have been carried out in order to detect phishing pages. In this study, we treat the problem of phishing web page identification as an image classification task and propose a machine learning augmented pure vision based approach which extracts and classifies compact visual features from web page screenshots. For this purpose, we employed several MPEG7 and MPEG7-like compact visual descriptors (SCD, CLD, CEDD, FCTH and JCD) to reveal color and edge based discriminative visual cues. Throughout the feature extraction process we have followed two different schemas working on either whole screenshots in a "holistic" manner or equal sized "patches" constructing a coarse-to-fine "pyramidal" representation. Moreover, for the task of image classification, we have built SVM and Random Forest based machine learning models. In order to assess the performance and generalization capability of the proposed approach, we have collected a mid-sized corpus covering 14 distinct brands and involving 2852 samples. According to the conducted experiments, our approach reaches up to 90.5% F1 score via SCD. As a result, compared to other studies, the suggested approach presents a lightweight schema serving competitive accuracy and superior feature extraction and inferring speed that enables it to be used as a browser plugin.
A preview of the PDF is not available
... Content-based features include JavaScript, HTML, and images derived from or used to build websites. In terms of application, several studies have utilized these types of features, either focusing on a single type [11,13], or utilizing a mixture of the different feature categories [14]. ...
... Their experimental results portrayed that the proposed method resulted in a high test accuracy. Dalgic et al. proposed an image classification approach for malicious website detection [13]. Their idea was to extract several MPEG-7 features and study each one's influence on detection. ...
Article
Full-text available
Malicious websites in general, and phishing websites in particular, attempt to mimic legitimate websites in order to trick users into trusting them. These websites, often a primary method for credential collection, pose a severe threat to large enterprises. Credential collection enables malicious actors to infiltrate enterprise systems without triggering the usual alarms. Therefore, there is a vital need to gain deep insights into the statistical features of these websites that enable Machine Learning (ML) models to classify them from their benign counterparts. Our objective in this paper is to provide this necessary investigation, more specifically, our contribution is to observe and evaluate combinations of feature sources that have not been studied in the existing literature—primarily involving embeddings extracted with Transformer-type neural networks. The second contribution is a new dataset for this problem, GAWAIN, constructed in a way that offers other researchers not only access to data, but our whole data acquisition and processing pipeline. The experiments on our new GAWAIN dataset show that the classification problem is much harder than reported in other studies—we are able to obtain around 84% in terms of test accuracy. For individual feature contributions, the most relevant ones are coming from URL embeddings, indicating that this additional step in the processing pipeline is needed in order to improve predictions. A surprising outcome of the investigation is lack of content-related features (HTML, JavaScript) from the top-10 list. When comparing the prediction outcomes between models trained on commonly used features in the literature versus embedding-related features, the gain with embeddings is slightly above 1% in terms of test accuracy. However, we argue that even this somewhat small increase can play a significant role in detecting malicious websites, and thus these types of feature categories are worth investigating further.
... The embedding techniques are used in the feature extraction phase and classification is performed using one-class SVM. A. Altaher [9] combined two machine learning algorithms KNN and SVM respectively to classify using a hybrid model with micro class labels such as : legitimate, suspicious and phishing. F.C. Dalgic et al. [10] in his paper proposes visual comparison by treating detection of phishing sites as an image classification problem. First, visual descriptor tools like MPEG7 are used to discriminate between different screenshots. ...
... In their paper F.C. Dalgic et al. [14] proposes visual comparison for phishing website detection. They approached phishing detection as an image classification problem. ...
Article
Full-text available
The advancement in network technology has led to an exponential rise in the number of internet users across the globe. The increase in internet usage has resulted in an increase in both the number of malicious websites and cybercrimes reported over the years. Therefore, it has become critical to devise an intelligent solution that can detect malicious websites and be used in real-time systems. In our paper, we perform a comparative analysis of various feature selection techniques to build a time-efficient and accurate predictive model. To build our predictive model, a set of features are selected by feature selection methods. The selected features consist of at least 70% of the categorical features in all feature selection techniques examined in this paper. Keeping the end goal of real-time deployment of models in context the cost of processing or storing these features is far cheaper when compared to text or image-based features. We started out with a class imbalance in our data which was later dealt with using Synthetic Minority Oversampling Technique. Our proposed model also bested the existing work in the literature when compared over various evaluation metrics. The result indicated that Embedded feature selection was the best technique considering the accuracy of the model. The Filter-based technique may also be used in the context of developing a low latency system at the cost of the accuracy of the model.
... Wenyin [10] first proposed this technique to calculate the similarity of two pages by measuring web page layout, frame, blocklevel, and overall style. F.C. Dalgic [11] proposed to utilize SPM to count the features of a website screenshot. Usually, the victims are attacked since they clicked on the URL of the phishing website posted by the attacker. ...
Article
Full-text available
A parallel neural joint model algorithm is proposed for the analysis and detection of malicious Uniform Resource Locator (URL). By detecting and analyzing malicious URL’s characteristics, the semantic and visual information will be extracted. First, a visualization algorithm is used to realize the visualization of the URL mapping to a gray image with texture characteristics. Second, the lexical feature and character feature of URL are extracted and further processed through word vector technology. These extracted features are transformed into lexical embedding vectors and character embedding vectors. To combine the texture features with text features, a parallel joint neural network combining capsule network (CapsNet) and independent recurrent neural network (IndRNN) is utilized to capture multi-modal vectors of visual and semantic information synchronously. The last layer utilizes the attention mechanism to further filter the deep features extracted from the overall network while concentrating on effective features improving the classification accuracy and analyzing and detect malicious URLs. Based on the experimental results, it is demonstrated that this algorithm has higher accuracy compared to the traditional algorithms.
... The results highlighted the effectiveness of using the HOG descriptor for webpage phishing detection, especially for zero-hour attacks. An extension to the research study in [10] was presented by the authors of [11].They conducted a comparative study on the performance of five compact visual descriptors (SCD, FCTH, JCD, CEDD and CLD) with two machine learning techniques (random forest and SVM) to detect web phishing attacks from screenshots of legitimate and phished websites. According to their results, the SCD with RF delivered the highest F1 score at 0.895 for the analysis of the whole website image. ...
Article
The most popular way to deceive online users nowadays is phishing. Consequently, to increase cybersecurity, more efficient web page phishing detection mechanisms are needed. In this paper, we propose an approach that rely on websites image and URL to deals with the issue of phishing website recognition as a classification challenge. Our model uses webpage URLs and images to detect a phishing attack using convolution neural networks (CNNs) to extract the most important features of website images and URLs and then classifies them into benign and phishing pages. The accuracy rate of the results of the experiment was 99.67%, proving the effectiveness of the proposed model in detecting a web phishing attack. © 2020, International Journal of Computer Networks and Communications,All rights reserved.
Article
Phishing attacks are one of the most challenging social engineering cyberattacks due to the large amount of entities involved in online transactions and services. In these attacks, criminals deceive users to hijack their credentials or sensitive data through a login form which replicates the original website and submits the data to a malicious server. Many anti-phishing techniques have been developed in recent years, using different resource such as the URL and HTML code from legitimate index websites and phishing ones. These techniques have some limitations when predicting legitimate login websites, since, usually, no login forms are present in the legitimate class used for training the proposed model. Hence, in this work we present a methodology for phishing website detection in real scenarios, which uses URL, HTML, and web technology features. Since there is not any updated and multipurpose dataset for this task, we crafted the Phishing Index Login Websites Dataset (PILWD), an offline phishing dataset composed of 134,000 verified samples, that offers to researchers a wide variety of data to test and compare their approaches. Since approximately three-quarters of collected phishing samples request the introduction of credentials, we decided to crawl legitimate login websites to match the phishing standpoint. The developed approach is independent of third party services and the method relies on a new set of features used for the very first time in this problem, some of them extracted from the web technologies used by the on each specific website. Experimental results show that phishing websites can be detected with 97.95% accuracy using a LightGBM classifier and the complete set of the 54 features selected, when it was evaluated on PILWD dataset.
Article
Network security has become an area of significant importance more than ever as highlighted by the eye-opening numbers of data breaches, attacks on critical infrastructure, and malware/ransomware/cryptojacker attacks that are reported almost every day. Increasingly, we are relying on networked infrastructure and with the advent of IoT, billions of devices will be connected to the Internet, providing attackers with more opportunities to exploit. Traditional machine learning methods have been frequently used in the context of network security. However, such methods are more based on statistical features extracted from sources such as binaries, emails, and packet flows. On the other hand, recent years witnessed a phenomenal growth in computer vision mainly driven by the advances in the area of convolutional neural networks. At a glance, it is not trivial to see how computer vision methods are related to network security. Nonetheless, there is a significant amount of work that highlighted how methods from computer vision can be applied in network security for detecting attacks or building security solutions. In this paper, we provide a comprehensive survey of such work under three topics; i ) phishing attempt detection, ii ) malware detection, and iii ) traffic anomaly detection . We also discuss existing research gaps and future research directions, especially focusing on how network security research community and the industry can leverage the exponential growth of computer vision methods to build much secure networked systems. Finally, we review a set of such commercial products for which public information is available and explore how computer vision methods are effectively used in those products and conclude with a brief overview of commonly used computer vision methods in this domain.
Article
Full-text available
Phishing is a cyber-attack which targets naive online users tricking into revealing sensitive information such as username, password, social security number or credit card number etc. Attackers fool the Internet users by masking webpage as a trustworthy or legitimate page to retrieve personal information. There are many anti-phishing solutions such as blacklist or whitelist, heuristic and visual similarity-based methods proposed to date, but online users are still getting trapped into revealing sensitive information in phishing websites. In this paper, we propose a novel classification model, based on heuristic features that are extracted from URL, source code, and third-party services to overcome the disadvantages of existing anti-phishing techniques. Our model has been evaluated using eight different machine learning algorithms and out of which, the Random Forest (RF) algorithm performed the best with an accuracy of 99.31%. The experiments were repeated with different (orthogonal and oblique) random forest classifiers to find the best classifier for the phishing website detection. Principal component analysis Random Forest (PCA-RF) performed the best out of all oblique Random Forests (oRFs) with an accuracy of 99.55%. We have also tested our model with the third-party-based features and without third-party-based features to determine the effectiveness of third-party services in the classification of suspicious websites. We also compared our results with the baseline models (CANTINA and CANTINA+). Our proposed technique outperformed these methods and also detected zero-day phishing attacks.
Conference Paper
Full-text available
Phishing is a scamming activity which deals with making a visual illusion on computer users by providing fake web pages which mimic their legitimate targets in order to steal valuable digital data such as credit card information or e-mail passwords. In contrast to other anti-phishing attempts this paper proposes to evaluate and solve this problem by leveraging a pure computer vision based method in the concept of web page layout similarity. Proposed approach employs Histogram of Oriented Gradients (HOG) descriptor in order to capture cues of page layout without the need of time consuming intermediate stage of segmentation. Moreover, histogram intersection kernel has been used as a similarity metric for computing similarity. Thus, an efficient and fast phishing page detection scheme has been developed in order to combat with zero-day phishing page attacks. To verify the efficiency of our phishing page detection mechanism, 50 unique phishing pages and their legitimate targets have been collected. Furthermore, 100 pairs of legitimate pages have been gathered. As the next stage, the similarity scores in these two groups were computed and compared. According to promising results, similarity degree around 75% and above can be adequate for alarming.
Article
Full-text available
The success of sparse representations in image modeling and recovery has motivated its use in computer vision applications. Object recognition has been effectively performed by aggregating sparse codes of local features in an image at multiple spatial scales. Though sparse coding guarantees a high-fidelity representation, it does not exploit the dependence between the local features. By incorporating suitable locality constraints, sparse coding can be regularized to obtain similar codes for similar features. In this paper, we develop an algorithm to design dictionaries for local sparse coding of image descriptors and perform object recognition using the learned dictio-naries. Furthermore, we propose to perform kernel local sparse coding in order to exploit the non-linear similarity of features and describe an algo-rithm to learn dictionaries when the Radial Basis Function (RBF) kernel is used. In addition, we develop a supervised local sparse coding approach for image retrieval using sub-image heterogeneous features. Simulation results for object recognition demonstrate that the two proposed algorithms achieve higher classification accuracies in comparison to other sparse coding based approaches. By performing image retrieval on the Microsoft Research Cam-Preprint submitted to Pattern Recognition Letters December 31, 2011 bridge image dataset, we show that incorporating supervised information into local sparse coding results in improved precision-recall rates.
Article
Phishing is a fraudulent technique that is used over the Internet to deceive users with the goal of extracting their personal information such as username, passwords, credit card, and bank account information. The key to phishing is deception. Phishing uses email spoofing as its initial medium for deceptive communication followed by spoofed websites to obtain the needed information from the victims. Phishing was discovered in 1996, and today, it is one of the most severe cybercrimes faced by the Internet users. Researchers are working on the prevention, detection, and education of phishing attacks, but to date, there is no complete and accurate solution for thwarting them. This paper studies, analyzes, and classifies the most significant and novel strategies proposed in the area of phished website detection, and outlines their advantages and drawbacks. Furthermore, a detailed analysis of the latest schemes proposed by researchers in various subcategories is provided. The paper identifies advantages, drawbacks, and research gaps in the area of phishing website detection that can be worked upon in future research and developments. The analysis given in this paper will help academia and industries to identify the best anti-phishing technique. Copyright
Conference Paper
Phishing refers to cybercrime that use social engineering and technical subterfuge techniques to fool online users into revealing sensitive information such as username, password, bank account number or social security number. In this paper, we propose a novel solution to defend zero-day phishing attacks. Our proposed approach is a combination of white list and visual similarity based techniques. We use computer vision technique called SURF detector to extract discriminative key point features from both suspicious and targeted websites. Then they are used for computing similarity degree between the legitimate and suspicious pages. Our proposed solution is efficient, covers a wide range of websites phishing attacks and results in less false positive rate.
Article
Web phishing is becoming an increasingly severe security threat in the web domain. Effective and efficient phishing detection is very important for protecting web users from loss of sensitive private information and even personal properties. One of the keys of phishing detection is to efficiently search the legitimate web page library and to find those page that are the most similar to a suspicious phishing page. Most existing phishing detection methods are focused on text and/or image features and have paid very limited attention to spatial layout characteristics of web pages. In this paper, we propose a novel phishing detection method that makes use of the informative spatial layout characteristics of web pages. In particular, we develop two different options to extract the spatial layout features as rectangle blocks from a given web page. Given two web pages, with their respective spatial layout features, we propose a page similarity definition that takes into account their spatial layout characteristics. Furthermore, we build an R-tree to index all the spatial layout features of a legitimate page library. As a result, phishing detection based on the spatial layout feature similarity is facilitated by relevant spatial queries via the R-tree. A series of simulation experiments are conducted to evaluate our proposals. The results demonstrate that the proposed novel phishing detection method is effective and efficient.
Article
Phishing is a security threat which combines social engineering and website spoofing techniques to deceive users into revealing confidential information. In this paper, we propose a phishing detection method to protect Internet users from the phishing attacks. In particular, given a website, our proposed method will be able to detect if it is a phishing website. We use a logo image to determine the identity consistency between the real and the portrayed identity of a website. Consistent identity indicates a legitimate website and inconsistent identity indicates a phishing website. The proposed method consists of two processes, namely logo extraction and identity verification. The first process will detect and extract the logo image from all the downloaded image resources of a webpage. In order to detect the right logo image, we utilise a machine learning technique. Based on the extracted logo image, the second process will employ the Google image search to retrieve the portrayed identity. Since the relationship between the logo and domain name is exclusive, it is reasonable to treat the domain name as the identity. Hence, a comparison between the domain name returned by Google with the one from the query website will enable us to differentiate a phishing from a legitimate website. The conducted experiments show reliable and promising results. This proves the effectiveness and feasibility of using a graphical element such as a logo to detect a phishing website.
Conference Paper
In this paper, we present a new solution, BaitAlarm, to detect phishing attack using features that are hard to evade. The intuition of our approach is that phishing pages need to preserve the visual appearance the target pages. We present an algorithm to quantify the suspicious ratings of web pages based on similarity of visual appearance between the web pages. Since CSS is the standard technique to specify page layout, our solution uses the CSS as the basis for detecting visual similarities among web pages. We prototyped our approach as a Google Chrome extension and used it to rate the suspiciousness of web pages. The prototype shows the correctness and accuracy of our approach with a relatively low performance overhead.