Conference Paper

PhishNet: Predictive Blacklisting to Detect Phishing Attacks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Phishing has been easy and effective way for trickery and deception on the Internet. While solutions such as URL blacklisting have been effective to some degree, their reliance on exact match with the blacklisted entries makes it easy for attackers to evade. We start with the observation that attackers often employ simple modifications (e.g., changing top level domain) to URLs. Our system, PhishNet, exploits this observation using two components. In the first component, we propose five heuristics to enumerate simple combinations of known phishing sites to discover new phishing URLs. The second component consists of an approximate matching algorithm that dissects a URL into multiple components that are matched individually against entries in the blacklist. In our evaluation with real-time blacklist feeds, we discovered around 18,000 new phishing URLs from a set of 6,000 new blacklist entries. We also show that our approximate matching algorithm leads to very few false positives (3%) and negatives (5%).

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Current phishing detection methods analyze specific features of suspicious websites. Traditional approaches involve filtering based on lists of previously identified malicious or legitimate Web addresses [8]- [10]. Additionally, visual analysis considers design elements such as company logos, color schemes, font types, and alignment [11]. ...
... In this subsection, we review some existing literature on phishing attack detection and examine their techniques. These methodologies were grouped into three categories: URL analysis [8]- [10], visual page similarity analysis [11] and page content analysis [21]- [23]. ...
... This tactic, known as URL spoofing 2 , involves incorporating the name of a trusted entity into the address of the phishing site. Prakash et al. [10] proposed an algorithm that matches phishing URLs with legitimate URLs by using regular expressions. However, this approach is inadequate when phishing sites utilize Web addresses that are distinct from legitimate domains, thereby evading detection. ...
Article
Full-text available
Phishing attacks are a growing threat that has evolved along with technological advancements. Existing detection methods struggle with constantly evolving tactics and "zero-day" attacks that exploit unknown vulnerabilities. This paper proposes a novel approach for identifying phishing web pages built on phishing kits and, consequently, detecting such attack attempts. Our method, Phish Fighter, analyzes the HTML code structure, focusing on recurring blocks across different phishing pages derived from the same source. These features are then fed to clustering and classification components to detect common structural patterns without relying on textual or visual content. This approach overcomes the limitations of existing solutions, and is robust against attacks that target various brands. Furthermore, we implemented a dedicated module for continuous data updates for the Phish Fighter. This module effectively recognizes even "zero-day" phishing attempts by analyzing only three pages associated with a new phishing kit. In addition, we successfully identified the phishing pages created by cloning the original source code of legitimate entities without requiring prior knowledge to distinguish such clones. The results support the efficiency and accuracy of this approach: weighted precision, recall and F1 score are all greater than 90%, and the respective micro-averaged metrics are above 95%.
... ii. PhishNet: Predictive Blacklisting PhishNet [28] solves the problem of exact matching (if a URL is slightly changed version of blacklisted one then it remains undetected). It generates almost all possible variants of a URL using five different variation heuristics, which are: i) Replace Top Level Domains (TLD) ii) ...
... To control phishing emails successfully, there were various solutions proposed as discussed in [19][20][21][22][23][24][25][26]. Similarly, to control phishing websites attacks, there were also various solutions proposed in [16][17][18][27][28][29][30][31][32][33][34][35][36][37][38][39]. Phishing email is not the only way to fraud, but phishers also uses fake websites and phishing emails have also the link of fake websites. ...
... The list maintained at the client side and is updated periodically; however, if URL is changed even a little bit from the blacklisted URL would result in no match. PhishNet [28] addresses the exact match limitation found in blacklists. As life of these, phishing attacks are very less, a large amount of data is consumed to store these blacklisted URIs and domain, which have no use in near future. ...
Preprint
Internet technology is so pervasive today, for example, from online social networking to online banking, it has made people's lives more comfortable. Due the growth of Internet technology, security threats to systems and networks are relentlessly inventive. One such a serious threat is "phishing", in which, attackers attempt to steal the user's credentials using fake emails or websites or both. It is true that both industry and academia are working hard to develop solutions to combat against phishing threats. It is therefore very important that organisations to pay attention to end-user awareness in phishing threat prevention. Therefore, the aim of our paper is twofold. First, we will discuss the history of phishing attacks and the attackers' motivation in details. Then, we will provide taxonomy of various types of phishing attacks. Second, we will provide taxonomy of various solutions proposed in literature to protect users from phishing based on the attacks identified in our taxonomy. Moreover, we have also discussed impact of phishing attacks in Internet of Things (IoTs). We conclude our paper discussing various issues and challenges that still exist in the literature, which are important to fight against with phishing threats.
... [18] worked on an efficient approach for generating blacklist URLs that takes advantage of the current harmful URL search structure neighborhoods to discover and validate unknown malicious websites in order to grow the URL blacklist database. [19] provided a strategy for detecting phishing based on the monitoring of URL alterations according to the blacklist method. They presented combinations of known phishing sites and an approximate matching method. ...
... [26] proposed a heuristic malicious URL detection technique by using scraping and web crawling methods. PhishNet presented in [19], which proposed to detect phishing URLs based on a combination of five heuristic rules and a matching algorithm. [27] described a heuristic technique for detecting phishing websites based on a set of 12 rules by evaluating the static features employed and observing the behaviour of the current phishing URLs. ...
... The authors in [18] worked on an efficient approach for generating blacklist URLs that takes advantage of the current harmful URL search structure neighborhoods to discover and validate unknown malicious websites in order to grow the URL blacklist database. [19] provides a strategy for detecting phishing based on the monitoring of URL alterations. They presented combinations of known phishing sites and an approximate matching method. ...
Article
Full-text available
Malicious Uniform Resource Locators (URLs) pose a significant cybersecurity threat by carrying out attacks such as phishing and malware propagation. Conventional malicious URL detection methods, relying on blacklists and heuristics, often struggle to identify new and obfuscated malicious URLs. To address this challenge, machine learning and deep learning have been leveraged to enhance detection capabilities, albeit relying heavily on large and frequently updated datasets. Furthermore, the efficacy of these methods is intrinsically tied to the quality of the training data, a requirement that becomes increasingly challenging to fulfill in real-world scenarios due to constraints such as data scarcity, privacy concerns, and the dynamic nature of evolving cyber threats. In this study, we introduce an innovative framework for malicious URL detection based on predefined static feature classification by allocating priority coefficients and feature evaluation methods. Our feature classification encompasses 42 classes, including blacklist, lexical, host-based, and content-based features. To validate our framework, we collected a dataset of 5000 real-world URLs from prominent phishing and malware websites, namely URLhaus and PhishTank. We assessed our framework’s performance using three supervised machine learning methods: Support Vector Machine (SVM), Random Forest (RF), and Bayesian Network (BN). The results demonstrate that our framework outperforms these methods, achieving an impressive detection accuracy of 98.95% and a precision value of 98.60%. Furthermore, we conducted a benchmarking analysis against three comprehensive malicious URL detection methods (PDRCNN, the Li method, and URLNet), demonstrating that our proposed framework excels in terms of accuracy and precision. In conclusion, our novel malicious URL detection framework substantially enhances accuracy, significantly bolstering cybersecurity defences against emerging threats.
... Phishing Websites and URLs: Prior research proposes many methods to detect phishing websites and URLs [13,37,56,52,42,50,26] and analzyed the infrastructure used to host phishing websites [4,31]. To detect malicious URLs, detectors can extract lexical features [13,37], or features based on a URL's domain [26]. ...
... Phishing Websites and URLs: Prior research proposes many methods to detect phishing websites and URLs [13,37,56,52,42,50,26] and analzyed the infrastructure used to host phishing websites [4,31]. To detect malicious URLs, detectors can extract lexical features [13,37], or features based on a URL's domain [26]. Prior work has also proposed machine learning detectors that use features relating to a website's content, such as text from the rendered HTML DOM and images embedded in web pages [52,56,50,33]. ...
Preprint
Full-text available
Phishing attacks on enterprise employees present one of the most costly and potent threats to organizations. We explore an understudied facet of enterprise phishing attacks: the email relay infrastructure behind successfully delivered phishing emails. We draw on a dataset spanning one year across thousands of enterprises, billions of emails, and over 800,000 delivered phishing attacks. Our work sheds light on the network origins of phishing emails received by real-world enterprises, differences in email traffic we observe from networks sending phishing emails, and how these characteristics change over time. Surprisingly, we find that over one-third of the phishing email in our dataset originates from highly reputable networks, including Amazon and Microsoft. Their total volume of phishing email is consistently high across multiple months in our dataset, even though the overwhelming majority of email sent by these networks is benign. In contrast, we observe that a large portion of phishing emails originate from networks where the vast majority of emails they send are phishing, but their email traffic is not consistent over time. Taken together, our results explain why no singular defense strategy, such as static blocklists (which are commonly used in email security filters deployed by organizations in our dataset), is effective at blocking enterprise phishing. Based on our offline analysis, we partnered with a large email security company to deploy a classifier that uses dynamically updated network-based features. In a production environment over a period of 4.5 months, our new detector was able to identify 3-5% more enterprise email attacks that were previously undetected by the company's existing classifiers.
... Larger coverage ensures flexibility in the attack phase. By the nature of these attacks, they could be thwarted by blacklists capable of approximate matching [20]. Nevertheless, these techniques are suited for adhoc swift exploits, to cause impact before a defender has time to respond. ...
... While the AP attacks have a higher EAR than the RE attacks (Table 1), the RE approach provides better diversity of attacks, as it uses a larger R Exploit = 0.5 (for AP,R Exploit = 0.1 ). Diversity ensures that simple countermeasures of blacklisting will not thwart an attack [20]. The effect of increasing the R Exploit for the AP attack, in an attempt to increase its diversity, is shown in Fig. 4a). ...
Preprint
The increasing scale and sophistication of cyberattacks has led to the adoption of machine learning based classification techniques, at the core of cybersecurity systems. These techniques promise scale and accuracy, which traditional rule or signature based methods cannot. However, classifiers operating in adversarial domains are vulnerable to evasion attacks by an adversary, who is capable of learning the behavior of the system by employing intelligently crafted probes. Classification accuracy in such domains provides a false sense of security, as detection can easily be evaded by carefully perturbing the input samples. In this paper, a generic data driven framework is presented, to analyze the vulnerability of classification systems to black box probing based attacks. The framework uses an exploration exploitation based strategy, to understand an adversary's point of view of the attack defense cycle. The adversary assumes a black box model of the defender's classifier and can launch indiscriminate attacks on it, without information of the defender's model type, training data or the domain of application. Experimental evaluation on 10 real world datasets demonstrates that even models having high perceived accuracy (>90%), by a defender, can be effectively circumvented with a high evasion rate (>95%, on average). The detailed attack algorithms, adversarial model and empirical evaluation, serve.
... As such, maintaining diversity is key, as larger coverage ensures more flexibility in attack generation. By the nature of these attacks, they can be thwarted by blacklists capable of approximate matching (Prakash et al., 2010). Nevertheless, they are suited for adhoc swift blitzkriegs, before the defender has time to respond. ...
... Blacklists are ubiquitous in security applications, as an approach to flag and block known malicious samples (Kantchelian et al., 2013). Modern blacklists are implemented using approximate matching techniques, such as Locality Sensitive Hashing, which can detect perturbations to existing flagged samples (Prakash et al., 2010). The goal of an attacker is to avoid detection by these blacklists, as they can make a large number of attack samples unusable with a quick filtering step. ...
Preprint
While modern day web applications aim to create impact at the civilization level, they have become vulnerable to adversarial activity, where the next cyber-attack can take any shape and can originate from anywhere. The increasing scale and sophistication of attacks, has prompted the need for a data driven solution, with machine learning forming the core of many cybersecurity systems. Machine learning was not designed with security in mind, and the essential assumption of stationarity, requiring that the training and testing data follow similar distributions, is violated in an adversarial domain. In this paper, an adversary's view point of a classification based system, is presented. Based on a formal adversarial model, the Seed-Explore-Exploit framework is presented, for simulating the generation of data driven and reverse engineering attacks on classifiers. Experimental evaluation, on 10 real world datasets and using the Google Cloud Prediction Platform, demonstrates the innate vulnerability of classifiers and the ease with which evasion can be carried out, without any explicit information about the classifier type, the training data or the application domain. The proposed framework, algorithms and empirical evaluation, serve as a white hat analysis of the vulnerabilities, and aim to foster the development of secure machine learning frameworks.
... Research on the malicious URLs detection can be divided into three sub-categories, which are; blacklists for detection, machine learning methods, and deep learning based on feature extraction. In blacklist detection sub-category, Pawan Prakash [1] established their blacklist using the Goggle browser. URL detection was done by matching the IP address, hostname and directory structure. ...
... Malicious URL classification using proposed CNN-BiGRU-Att model achieved comparatively higher accuracy than other models. The Accuracy is obtained by calculations through formula (1). As shown in Table 5, the accuracy of the CNN-BiGRU-Att model is 99.084%. ...
Research
Full-text available
The increasing popularity of cyber threats has led to a surge in malicious Uniform Resource Locators (URLs) that pose a significant risk to individuals, organizations, and the overall security of the digital landscape. As cybercriminals continuously evolve their techniques to bypass traditional security measures, the development of effective and accurate malicious URL classification systems has become a critical research area. New developments to the identification of malicious URLs pages has been made with Deep learning methods. The author aims to provide a comprehensive analysis of various state-of-the-art techniques/methodologies employed in classifying malicious URLs and proposes a malicious URL detection method based on one-dimensional convolutional neural network (CNN) and bi-directional gated recurrent unit (Bi-GRU) with attention mechanism. The experimental results demonstrate that the proposed method can achieve better classification results in malicious URL detection, which has high significance for practical applications.
... The list-based approach uses a list of known phishing websites to identify and block suspicious URLs. This approach involves the implementation of three distinct techniques: whitelist-based, blacklist-based, or a combination of both (Prakash et al., 2010;Li et al., 2014;Jain and Gupta, 2016;Rao and Pais, 2017;Azeez et al., 2021). In all three cases, the detection of phishing sites relies on the comparison of predefined databases containing approved and unapproved URLs, domains, IP addresses, etc. ...
... In all three cases, the detection of phishing sites relies on the comparison of predefined databases containing approved and unapproved URLs, domains, IP addresses, etc. Several studies have demonstrated the effectiveness of these techniques, such as their speed and ease of use (Ludl et al., 2007;Prakash et al., 2010;Li et al., 2014;Azeez et al., 2021). However, a majority of current studies contended that these techniques might have struggled to detect unlisted phishing sites, commonly referred to as zero-hour or zero-day attacks (Sonowal and Kuppusamy, 2018;Aljofey et al., 2022;Sanchez-Paniagua et al., 2022). ...
... Early Detection. PhishNet [68] proposes five heuristics to enumerate simple combinations of known phishing sites to discover new phishing URLs. Felegyhazi et al. [69] identify new malicious domains leveraging DNS zone files and WHOIS records. ...
Preprint
Internet miscreants increasingly utilize short-lived disposable domains to launch various attacks. Existing detection mechanisms are either too late to catch such malicious domains due to limited information and their short life spans or unable to catch them due to evasive techniques such as cloaking and captcha. In this work, we investigate the possibility of detecting malicious domains early in their life cycle using a content-agnostic approach. We observe that attackers often reuse or rotate hosting infrastructures to host multiple malicious domains due to increased utilization of automation and economies of scale. Thus, it gives defenders the opportunity to monitor such infrastructure to identify newly hosted malicious domains. However, such infrastructures are often shared hosting environments where benign domains are also hosted, which could result in a prohibitive number of false positives. Therefore, one needs innovative mechanisms to better distinguish malicious domains from benign ones even when they share hosting infrastructures. In this work, we build MANTIS, a highly accurate practical system that not only generates daily blocklists of malicious domains but also is able to predict malicious domains on-demand. We design a network graph based on the hosting infrastructure that is accurate and generalizable over time. Consistently, our models achieve a precision of 99.7%, a recall of 86.9% with a very low false positive rate (FPR) of 0.1% and on average detects 19K new malicious domains per day, which is over 5 times the new malicious domains flagged daily in VirusTotal. Further, MANTIS predicts malicious domains days to weeks before they appear in popular blocklists.
... Prakash et al. propose five different heuristics that allow synthesizing new URLs from existing ones. The authors use this idea to enlarge the existing blacklist of malicious URLs [29]. ...
Preprint
Unsafe websites consist of malicious as well as inappropriate sites, such as those hosting questionable or offensive content. Website reputation systems are intended to help ordinary users steer away from these unsafe sites. However, the process of assigning safety ratings for websites typically involves humans. Consequently it is time consuming, costly and not scalable. This has resulted in two major problems: (i) a significant proportion of the web space remains unrated and (ii) there is an unacceptable time lag before new websites are rated. In this paper, we show that by leveraging structural and content-based properties of websites, it is possible to reliably and efficiently predict their safety ratings, thereby mitigating both problems. We demonstrate the effectiveness of our approach using four datasets of up to 90,000 websites. We use ratings from Web of Trust (WOT), a popular crowdsourced web reputation system, as ground truth. We propose a novel ensemble classification technique that makes opportunistic use of available structural and content properties of webpages to predict their eventual ratings in two dimensions used by WOT: trustworthiness and child safety. Ours is the first classification system to predict such subjective ratings and the same approach works equally well in identifying malicious websites. Across all datasets, our classification performs well with average F1_1-score in the 74--90\% range.
... To enhance protection, various software-based techniques have been developed. Traditional methods like blacklists [9] and whitelists [10] have limitations in identifying new or modified phishing sites. Machine learning approaches use heuristic features such as URLs [11] [12] [13], HTML content [14] [15] [16], website traffic, search engine data, and WHOIS records for improved classification. ...
... Phishing detection techniques are classified into five approaches: whitelist [1], [2], blacklist [3], [4], content [5], [6], [7], [8], [9], visual similarity [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], and uniform resource locator (URL)based approaches [22], [23], [24], [25], [26]. ...
Article
Phishing is a cyberattack method employed by malicious actors who impersonate authorized personnel to illicitly obtain confidential information. Phishing URLs constitute a form of URL-based intrusion designed to entice users into disclosing sensitive data. This study focuses on URL-based phishing detection, which involves identifying phishing webpages by analyzing their URLs. We primarily address two unresolved issues of previous research: (i) insufficient word tokenization of URLs, which leads to the inclusion of meaningless and unknown words, thereby diminishing the accuracy of phishing detection, (ii) dataset-dependent lexical features, such as phish-hinted words extracted exclusively from a given dataset, which restricts the robustness of the detection method. To solve the first issue, we propose a new segmentation-based word-level tokenization algorithm called SegURLizer. The second issue is addressed by dataset-independent natural language processing (NLP) features, incorporating various features extracted from domain-part and path-part of URLs. Then, we employ the segmentation-based features with SegURLizer on two neural network models - long short-term memory (LSTM) and bidirectional long short-term memory (BiLSTM). We then train the 36 selected NLP-based features on deep neural networks (DNNs). Afterward, we combine the outputs of the two models to develop a hybrid DNN model, named “PhiSN, ” representing phishing URL detection through segmentation and NLP features. Our experiments, conducted on five open datasets, confirmed the higher performance of our PhiSN model compared to state-of-the-art methods. Specifically, our model achieved higher F1-measure and accuracy, ranging from 95.30% to 99.75% F1-measure and from 95.41% to 99.76% accuracy, on each dataset.
... To overcome this, many heuristic methods were used to improve the results' accuracy. It then attained 97% more accurate results compared with other state-of-the-art tools [16]. CANTINA was then upgraded and renamed CANTINA+ by [17], and this was thought to be the most thorough and feature-rich solution for content-based phishing detection. ...
Article
Full-text available
Phishing is one of the most widely observed types of internet cyber-attack, through which hundreds of clients using different internet services are targeted every day through different replicated websites. The phishing attacker spreads messages containing false URL links through emails, social media platforms, or messages, targeting people to steal sensitive data like credentials. Attackers generate phishing URLs that resemble those of legitimate websites to gain these confidential data. Hence, there is a need to prevent the siphoning of data through the duplication of trustworthy websites and raise public awareness of such practices. For this purpose, many machine learning and deep learning models have been employed to detect and prevent phishing attacks, but due to the ever-evolving nature of these attacks, many systems fail to provide accurate results. In this study, we propose a deep learning-based system using a 1D convolutional neural network to detect phishing URLs. The experimental work was performed using datasets from Phish-Tank, UNB, and Alexa, which successfully generated 200 thousand phishing URLs and 200 thousand legitimate URLs. The experimental results show that the proposed system achieved 99.7% accuracy, which was better than the traditional models proposed for URL-based phishing detection.
... In these cases, it is possible to modify blacklisting techniques to take into account the heuristics of the URL 13 . In addition, it is important to understand how long such websites leave on average. ...
Article
Full-text available
Cybercriminals create phishing websites that mimic legitimate websites to get sensitive information from companies, individuals, or governments. Therefore, using state-of-the-art artificial intelligence and machine learning technologies to correctly classify phishing and legitimate URLs is imperative. We report the results of applying deterministic and probabilistic neural network models to URL classification. Key achievements of this work are: (1) The development of a unique approach based on probabilistic neural networks that improves classification accuracy. (2) We show for the first time in URL phishing research that a machine learning model trained on a combination of open source and private datasets is successful in production. The dataset is constructed from open sources like Alexa, PhishTank, or OpenPhish and, most importantly, real-world production data from EasyDMARC. The daily validation of the model using daily reported URL data and corresponding labels, both from open-source platforms and private production, reach on average a 97% accuracy on the validation dataset, labeled by PhishTank, OpenPhish and EasdDMARC where possible mislabeled data can not be excluded and was not possible to check due to large number of URLs. Feature engineering was done without third-party dependencies. Lastly, the evaluation of both deterministic and probabilistic models shows high accuracy on short and long URLs, where short URLs are defined as having less than 50 characters.
... Traditional detection methodologies heavily depend on blacklist databases or heuristic rules. However, these methods often lag behind the rapidly evolving strategies employed by perpetrators of phishing attacks [5,3]. Consequently, machine learning and, in more recent years, deep learning techniques have been leveraged to automate and enhance the detection process. ...
... The urgency to identify phishing URLs has escalated with the advent of online transactions and the refinement of phishing ploys [2]. While earlier detection methods were reliant on static blacklists or heuristic analysis, their efficacy diminishes against the constantly evolving tactics of phishing attacks [3]. Consequently, machine learning and, notably, deep learning have emerged as pivotal tools to automate and refine the detection process, offering more adaptive and potent defenses against this escalating cyber threat [4]. ...
Article
Full-text available
Phishing attacks are a major digital threat, impacting individuals and organizations globally. This review paper examines evolving anti-phishing strategies by analyzing five key techniques: URL blacklists, visual similarity detection, heuristic methods, machine learning models, and deep learning techniques. Each technique is evaluated for its mechanisms, unique features, and challenges. A systematic literature survey (SLR) is conducted to compare these methods; effectiveness. The paper highlights significant research challenges and suggests future directions, emphasizing the integration of artificial intelligence and behavioral analytics to combat evolving phishing tactics, this study aims to advance understanding and inspire more effective anti-phishing solutions.
Chapter
Phishing is still one of the most common and dangerous types of cyber threats, and it keeps getting more sophisticated. The increasing prevalence of phishing attempts has made it necessary for enterprises across the globe to prioritize the development of offensive and defensive cybersecurity solutions. This case study explores the current state of phishing and the developments in cyber protection tactics that have accompanied them, showcasing actual situations in which businesses have successfully reduced these risks. Phishing uses human psychology to trick people into disclosing private data, such as login passwords, bank account details, or other personal information. It frequently takes the form of official correspondence. A more advanced variation called spear-phishing is more difficult to identify and counter since it uses carefully constructed communications that are targeted at certain people or companies. More proactive threat identification and response are now possible thanks to the integration of AI and ML with conventional SIEM systems.
Article
Malicious assaults, with an emphasis on URLs, are detected using a new technique that makes use of machine learning techniques. We use hybrid machine learning models in conjunction with ensemble approaches for Natural Language Processing (NLP). To extract pertinent information, we preprocess a dataset that includes both malicious and genuine URLs. We improve our models' accuracy and efficiency by using strategies like Grid Search Hyper Parameter Optimisation and Canopy feature selection. Evaluation measures that show the effectiveness of our method include precision, accuracy, recall, F1-score, and specificity. Our hybrid machine learning system, which incorporates natural language processing (NLP), performs better than current models, providing strong protection against malevolent threats and improving cyber security, according to comparative analysis. . Keywords: Machine learning, Natural Language Processing (NLP), Cybersecurity, Canopy feature selection, Grid Search Hyperparameter Optimization, evaluation metrics
Article
Full-text available
In the dynamic landscape of cybersecurity, organizations face an ever-evolving array of cyber threats that necessitate proactive defense strategies. Predictive analytics, leveraging advanced statistical methods, machine learning (ML), and artificial intelligence (AI), offers significant potential in anticipating and mitigating cyber threats before they materialize. This paper provides a comprehensive examination of predictive analytics in the context of cybersecurity, detailing its core methodologies, applications, and the benefits it brings to threat detection and prevention. Through an extensive literature review and analysis of recent case studies, we assess the effectiveness of predictive analytics in identifying emerging threats, enhancing incident response, and reducing the impact of cyberattacks. The study also explores the challenges associated with implementing predictive analytics, including data quality, model accuracy, and integration with existing security frameworks. Future research directions are proposed to address these challenges, emphasizing the need for more sophisticated models, improved data integration techniques, and collaborative threat intelligence sharing. The findings underscore the critical role of predictive analytics in fortifying organizational defenses and fostering a proactive cybersecurity posture.
Article
Phishing is a common cybercrime event with great harm. Various phishing attacks have occurred repeatedly and have caused huge economic losses. With the booming development of blockchain and cryptocurrency, the huge amount of money in the field and the immature ecosystem have induced phishing attacks to flood the field in large quantities. Unfortunately, phishing has become the main means of attack in the field, posing a huge security threat to users’ digital assets. The existing methods for detecting phishing websites rely on the quality of URL feature extraction, and the extraction angle is becoming increasingly rigid. Therefore, this paper proposes a phishing URL detection model that utilizes feature extension. This method uses the TextRank algorithm to generate a feature extension library and embeds the extracted features into the URL to be detected. After the URL is vectorized, it is input into the two-layer classification network proposed in this paper to classify the website. This classifier consists of an upstream task Bert layer and a downstream task CNN layer. It is possible to simultaneously learn the comprehensive representation information and local feature information of URLs, effectively avoiding overfitting problems and improving the ability to identify phishing websites. Comparative experiments are conducted using a dataset of real phishing websites. The experimental results show that this model has higher accuracy and stability compared to other phishing website detection models.
Chapter
Full-text available
BitM (Browser-in-the-middle) is a new generation phishing that has a high probabilistic rate to infringe present-day two or multifactor authentication logins which are said to be protected. In response to counter this security threat an internally curated dataset is generated to facilitate the detection of BitM within the context of phishing. As the existing approaches emphasize the mitigation of BitM phishing instead of its detection, this study introduces a novel empirical approach that utilizes data packets incorporating machine learning algorithms for the detection of BitM phishing attacks. The limited availability of training datasets due to inadequate BitM testing facilities restricted the expansion of our dataset. This study is inspired by the High severity score of the BitM attack according to Common Attack Pattern Enumeration and Classification (CAPEC). The accuracies of five classifiers being used namely SVM, MLP, Naive Bayes, Random Forest, and Decision Tree, are examined, with Random Forest having the highest performance of 99.1% accuracy.
Article
Full-text available
In recent years, the advancement of Artificial Intelligence (AI) has significantly impacted various fields, particularly cybersecurity. However, current approaches to combating cyber threats, such as phishing attacks, remain limited by their inability to address evolving vulnerabilities in online systems effectively. Despite this challenge, extensive research has demonstrated the efficacy of learning-based models, notably Machine Learning (ML), Ensemble Learning (EL), and Deep Learning (DL), in developing defensive mechanisms against these threats. However, these methods encounter challenges when dealing with adversarial examples (AEs). Multimodal model (MM) have emerged as a promising approach to address this issue. Despite their potential, there is a notable lack of research employing multimodal techniques for phishing website detection (PWD), especially in the context of adversarial websites. To tackle this challenge, this paper assesses 15 learning-based models, particularly multimodal ones, for phishing and adversarial detection, aiming to enhance their defense capabilities. Due to the scarcity of adversarial website examples, training and testing models are limited. Therefore, this study proposes an innovative attack framework AWG - AdversarialWebsite Generation that employs Generative Adversarial Networks (GAN) and transfer-based black box attacks to create AEs. This framework closely mirrors real-world attack scenarios, ensuring high effectiveness and realism. Finally, we present defense strategies with straightforward implementation and high effectiveness to enhance the resistance of models. The models underwent training and testing on a dataset collected from reputable sources such as OpenPhish, PhishTank, Phishing Database, and Alexa. This approach was chosen to ensure the dataset’s diversity and relevance to reflect real-world conditions. Experimental results highlight that the Generator’s effectiveness is demonstrated by a domain structure generation rate exceeding 90%. Moreover, AEs generated by this Generator effectively bypass most state-of-the-art ML, DL, and EL models with an evasion rate of up to 88%. Notably, the Support Vector Machine (SVM) model is the most vulnerable, with a detection rate of only 10.02%. On the other hand, the Multimodal model Shark-Eyes demonstrates outstanding resistance against AEs, with a detection rate of up to 99%. Upon applying our defense strategy, the resistance of models is significantly boosted, with all detection rates surpassing 90%. These findings underscore the robustness of our methods and pave the way for further exploration into advanced attack and defense strategies in the context of phishing website detection and adversarial attacks.
Article
Phishing attacks are increasing every year, both in terms of number and technique. Using only human weaknesses, an attacker can easily obtain the victim’s credentials or access their network. The problem persists despite many approaches offered by researchers, due to its dynamic nature, in which new phishing tactics are created every time. We, therefore, need more robust and effective methods to detect phishing emails. In this paper, we aim to detect phishing emails using the body text of the email with the hybrid approach combining case-based reasoning (CBR) and a deep learning model. Our proposed model, called DL-CBR, consists of a Bidirectional Long Short-Term Memory (Bi-LSTM) + Temporal Convolutional Network (TCN) network with an attention mechanism followed by a CBR classifier. The deep learning model is used for email representation, where it is trained using the [Formula: see text]-pair loss function. To demonstrate the performance of DL-CBR, evaluation metrics, such as precision, accuracy, recall, and F-measure, were used, where we obtained an accuracy of 98.28%. The results show that our model outperformed other CBRs that utilize classical text representations like TF-IDF and Bag-of-Words. Additionally, while our model’s performance is slightly below that of the state-of-the-art models, it offers several advantages inherent to CBR. For instance, it can learn from new cases and update their database accordingly.
Article
Full-text available
Protecting and preventing sensitive data from being used appropriately has become a challenging task. Even a small mistake in securing data can be exploited by phishing attacks to release private information such as passwords or financial information to a malicious actor. Phishing has now proven so successful; that it is the number one attack vector. Many approaches have been proposed to protect against this cyber-attack, from additional staff training, and enriched spam filters to large collaborative databases of known threats such as PhishTank and OpenPhish. However, they mostly rely upon a user falling victim to an attack and manually adding this new threat to the shared pool, which presents a constant disadvantage in the fight back against phishing. In this paper, we propose a novel approach to protect against phishing attacks using machine learning. Unlike previous work in this field, our approach uses an automated detection process and requires no further user interaction, which allows for a faster and more accurate detection process. The experiment results show that our approach has a high detection rate. Machine Learning is an effective method for detecting phishing. It also eliminates the disadvantages of the previous method. We thoroughly reviewed the literature and suggested a new method for detecting phishing websites using feature extraction and a machine learning algorithm. This research aims to use the dataset collected to train ML models and deep neural nets to anticipate phishing websites.
Article
Attackers perform malicious activities by sending URLs to victims via e-mail, SMS, social network messages, and other means. Recently, intruders have been generating malicious URLs algorithmically. They also use shortening or obfuscation services to bypass firewalls and other security barriers. Some machine learning methods have been presented in order to identify malicious URLs from normal ones, all of which are subject to classification errors. On the other hand, it is impractical to have a complete and up-to-date blacklist due to large number of daily generated malicious URLs. Therefore, calculating the URLs security risk would be more helpful than URLs classification. In this way a user can correctly decide whether to use an unfamiliar URL if they know its associated security risk. In this study, the problem of URLs security risk computation is introduced and two effective novel criteria for this problem are proposed. Based on these criteria, a security risk score can be estimated for each incoming URL. In the first criterion, based on previous malicious and non-malicious URL instances, the extracted features of a URL are divided into two categories, those increase the risk and those reduce the security risk. In the second criterion, security risk score of an unknown URL is estimated based on its distances to nearest known malicious and also safe URLs. For both criterion, corresponding formulations and algorithms are also designed and are described. Extensive empirical evaluations on various real datasets show the effectiveness of the proposed criteria in terms of malicious URL detection rate. Moreover, our experiments show that the proposed metrics significantly outperforms previously proposed risk score criteria.
Article
This study focuses primarily on phishing attacks, a prevalent form of cybercrime conducted over the internet. Despite originating in 1996, phishing has evolved into one of the most severe threats online. It relies on email deception, often coupled with fraudulent websites, to trick individuals into divulging sensitive information. While various studies have explored preventive measures and detection techniques, there remains a lack of a comprehensive solution. Hence, leveraging machine learning is crucial in combating such cybercrimes effectively. The study utilizes a phishing URL-based dataset sourced from a renowned repository, comprising attributes of both phishing and legitimate URLs collected from over 11,000 websites. Following data preprocessing, several machine learning algorithms are employed to thwart phishing URLs and safeguard users. These algorithms include decision trees (DT), linear regression (LR), random forest (RF), naive Bayes (NB), gradient boosting classifier (GBM), K-neighbors classifier (KNN), support vector classifier (SVC), and a novel hybrid model, LSD, which integrates logistic regression, support vector machine, and decision tree (LR+SVC+DT) with soft and hard voting mechanisms. Additionally, the canopy feature selection technique, cross-fold validation, and Grid Search Hyperparameter Optimization are employed with the proposed LSD model. To assess the effectiveness of the proposed approach, various evaluation metrics such as precision, accuracy, recall, F1-score, and specificity are employed.
Chapter
Phishing attacks have become a rampant cybercrime, where hackers use deceptive methods to trick individuals into sharing their personal information. However, with the advent of AI models like ChatGPT, detecting phishing URLs accurately has become increasingly challenging for traditional machine learning techniques due to the creation of fake content by hackers. To address this issue, we propose a new feature called “human-content” that helps in differentiating legitimate and phishing websites based on human-generated content on a website. To classify phishing URLs, we employed various features, such as domain-based, JavaScript-based, and lexical-based features, along with the novel “human-content” feature. To perform the classification, we used five machine learning classifiers, including Gradient Boosting Classifier, Random Forest, Support Vector Machine, K-Nearest Neighbors, and Logistic Regression. Our experimental results on a dataset consisting of both legitimate URLs obtained from Alexa and phishing URLs from PhishTank demonstrate that the Gradient Boosting Classifier has the best performance, achieving an accuracy of 95.8%. Our proposed approach provides promising results in detecting phishing URLs and is more robust to AI-generated fake content. As phishing attacks continue to become more sophisticated, our method can help enhance online user security by effectively detecting phishing URLs.
Chapter
In this paper, we have described the use of distributed representations for malicious domain detection using Random Indexing and Machine Learning. At first, the proposed approach focuses on distributed representations of the context accumulated from domains, subdomains, and the path of each URL in the given set using Random Indexing and then applies the machine learning approaches for the classification to detect malicious and benign domains. In order to measure the classification performance, we have built five machine learning classifiers using Logistic Regression, Decision Tree, kk-Nearest Neighbors, Support Vector Machines, and Random Forest. All these machine learning models are used to detect malicious domains from others in a given set of URLs. We have used two datasets: one consisting of malicious domains collected from 360.net Lab and another one consisting of benign domains collected from Alexa’s top 1 million domains. We have compared the performance of the existing malicious detection approach with the proposed Random Indexing and machine learning-based approach on different distributions of the training and test dataset. It has been observed that the proposed approach with the Random Forest classifier identifies malicious URLs with a precision score of 99.5%.
Conference Paper
Full-text available
Banks and other organisations deal with fraudulent phishing websites by pressing hosting service providers to remove the sites from the Internet. Until they are removed, the fraud- sters learn the passwords, personal identication numbers (PINs) and other personal details of the users who are fooled into visiting them. We analyse empirical data on phishing website removal times and the number of visitors that the websites attract, and conclude that website removal is part of the answer to phishing, but it is not fast enough to com- pletely mitigate the problem. The removal times have a good t to a lognormal distribution, but within the general pat- tern there is ample evidence that some service providers are faster than others at removing sites, and that some brands can get fraudulent sites removed more quickly. We particu- larly examine a major subset of phishing websites (operated by the 'rock-phish' gang) which accounts for around half of all phishing activity and whose architectural innovations have extended their average lifetime. Finally, we provide a ballpark estimate of the total loss being suered by the banking sector from the phishing websites we observed.
Conference Paper
Full-text available
The notion of blacklisting communication sources has been a well-established defensive measure since the ori- gins of the Internet community. In particular, the prac- tice of compiling and sharing lists of the worst offenders of unwanted traffic is a blacklisting strategy that has re- mained virtually unquestioned over many years. But do the individuals who incorporate such blacklists into their perimeter defenses benefit from the blacklisting contents as much as they could from other list-generation strate- gies? In this paper, we will argue that there exist better alternative blacklist generation strategies that can pro- duce higher-quality results for an individual network. In particular, we introduce a blacklisting system based on a relevance ranking scheme borrowed from the link- analysis community. The system produces customized blacklists for individuals who choose to contribute data to a centralized log-sharing infrastructure. The ranking scheme measures how closely related an attack source is to a contributor, using that attacker's history and the con- tributor's recent log production patterns. The blacklisting system also integrates substantive log prefiltering and a severity metric that captures the degree to which an at- tacker's alert patterns match those of common malware- propagation behavior. Our intent is to yield individual- ized blacklists that not only produce significantly higher hit rates, but that also incorporate source addresses that pose the greatest potential threat. We tested our scheme on a corpus of over 700 million log entries produced from the DShield data center and the result shows that our blacklists not only enhance hit counts but also can proactively incorporate attacker addresses in a timely fashion. An early form of our system have been fielded to DShield contributors over the last year.
Conference Paper
Unsolicited bulk e-mail, or SPAM, is a means to an end. For virtually all such messages, the intent is to attract the recipient into entering a commercial transaction -- typically via a linked Web site. While the prodigious infrastructure used to pump out billions of such solicitations is essential, the engine driving this process is ultimately the "point-of-sale" -- the various money-making "scams" that extract value from Internet users. In the hopes of better understanding the business pressures exerted on spammers, this paper focuses squarely on the Internet infrastructure used to host and support such scams. We describe an opportunistic measurement technique called spamscatter that mines emails in real-time, follows the embedded link structure, and automatically clusters the destination Web sites using image shingling to capture graphical similarity between rendered sites. We have implemented this approach on a large real-time spam feed (over 1M messages per week) and have identified and analyzed over 2,000 distinct scams on 7,000 distinct servers.
Conference Paper
Malicious Web sites are a cornerstone of Internet criminal activi- ties. As a result, there has been broad interest in developing sys- tems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lex- ical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features poten- tially indicative of suspicious URLs. The resulting classifiers ob- tain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest false positives.
Conference Paper
Phishing is a significant problem involving fraudul ent email and web sites that trick unsuspecting users into reveal ing private information. In this paper, we present the design, implementation, and evaluation of CANTINA, a novel, content-based approach to detecting phishing web sites, based on the TF-IDF i nformation retrieval algorithm. We also discuss the design and evaluation of several heuristics we developed to reduce false pos itives. Our experiments show that CANTINA is good at detecting phishing sites, correctly labeling approximately 95% of phis hing sites.
Conference Paper
We develop new techniques to map botnet membership using traces of spam email. To group bots into botnets we look for multiple bots participating in the same spam email campaign. We have applied our technique against a trace of spam email from Hotmail ...
Article
[Catalan abstract] El Projecte de Directori Obert és el major directori de pàgines web existents a l'actualitat, mantingut totalment per persones. La presentació dóna un cop d'ull a la feina que es fa per mantenir l'apartat català. [English abstract] The Project of Open Directory is the greater pages Web's directory existing at the present time, maintained totally by people. The presentation offers the work that is made to maintain the catalan section.
Article
Thesis (M.S.)--University of California, San Diego, 2007. Includes bibliographical references (leaves 43-45).
Dmoz open directory project Dmoz open directory project
  • Netscape
NetScape, " Dmoz open directory project. " http:www.dmoz.org. [9] YURG, " Dmoz open directory project. " http://random.yahoo.com/bin/ryl.