Chapter

A Framework to Detect Compromised Websites Using Link Structure Anomalies: Proceedings of the Computational Intelligence in Information Systems Conference (CIIS 2018)

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Compromised or malicious websites are a serious threat to cyber security. Malicious users prefer to do malicious activities like phishing, spamming, etc., using compromised websites because they can mask their original identities behind these compromised sites. Compromised websites are more difficult to detect than malicious websites because compromised websites work in masquerade mode. This is one of the main reasons for us to take this topic as our research. This paper introduces the related work first and then introduces a framework to detect compromised websites using link structure analysis. One of the most popular link structures based ranking algorithm used by the Google search engine algorithm called PageRank, is implemented in our experiment using the Java language, computation is done before a website is compromised and after a website is compromised and the results are compared. The results show that when a website is compromised, its PageRank can go do down to indicate that this website is compromised.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The World Wide Web supports a wide range of criminal activities such as spam-advertised e-commerce, financial fraud and malware dissemination. Although the precise motivations behind these schemes may differ, the common denominator lies in the fact that unsuspecting users visit their sites. These visits can be driven by email, web search results or links from other web pages. In all cases, however, the user is required to take some action, such as clicking on a desired Uniform Resource Locator (URL). In order to identify these malicious sites, the web security community has developed blacklisting services. These blacklists are in turn constructed by an array of techniques including manual reporting, honeypots, and web crawlers combined with site analysis heuristics. Inevitably, many malicious sites are not blacklisted either because they are too recent or were never or incorrectly evaluated. In this paper, we address the detection of malicious URLs as a binary classification problem and study the performance of several well-known classifiers, namely Na¨ıveNa¨ıve Bayes, Support Vector Machines, Multi-Layer Perceptron, Decision Trees, Random Forest and k-Nearest Neighbors. Furthermore, we adopted a public dataset comprising 2.4 million URLs (instances) and 3.2 million features. The numerical simulations have shown that most classification methods achieve acceptable prediction rates without requiring either advanced feature selection techniques or the involvement of a domain expert. In particular, Random Forest and Multi-Layer Perceptron attain the highest accuracy.
Article
Full-text available
Link spammers are constantly seeking new methods and strategies to deceive the search engine ranking algorithms. The search engines need to come up with new methods and approaches to challenge the link spammers and to maintain the integrity of the ranking algorithms. In this paper, we proposed a methodology to detect link spam contributed by zero-out link or dangling pages. We randomly selected a target page from live web pages, induced link spam according to our proposed methodology, and applied our algorithm to detect the link spam. The detail results from amazon.com pages showed that there was a considerable improvement in their PageRank after the link spam was induced; our proposed method detected the link spam by using eigenvectors and eigenvalues.
Conference Paper
Full-text available
Malicious URLs have been widely used to mount various cyber attacks including spamming, phishing and malware. Detection of malicious URLs and identification of threat types are critical to thwart these attacks. Knowing the type of a threat enables estimation of severity of the attack and helps adopt an effective countermeasure. Existing methods typically detect malicious URLs of a single attack type. In this paper, we propose method using machine learning to detect malicious URLs of all the popular attack types and identify the nature of attack a malicious URL attempts to launch. Our method uses a variety of discriminative features including textual properties, link structures, webpage contents, DNS information, and network traffic. Many of these features are novel and highly effective. Our experimental studies with 40,000 benign URLs and 32,000 malicious URLs obtained from real-life Internet sources show that our method delivers a superior performance: the accuracy was over 98% in detecting malicious URLs and over 93% in identifying attack types. We also report our studies on the effectiveness of each group of discriminative features, and discuss their evadability.
Conference Paper
Full-text available
Web-based malware attacks have become one of the most serious threats that need to be addressed urgently. Several approaches that have attracted attention as promising ways of detecting such malware include employing various blacklists. However, these conventional approaches often fail to detect new attacks owing to the versatility of malicious websites. Thus, it is difficult to maintain up-to-date blacklists with information regarding new malicious websites. To tackle this problem, we propose a new method for detecting malicious websites using the characteristics of IP addresses. Our approach leverages the empirical observation that IP addresses are more stable than other metrics such as URL and DNS. While the strings that form URLs or domain names are highly variable, IP addresses are less variable, i.e., IPv4 address space is mapped onto 4-bytes strings. We develop a lightweight and scalable detection scheme based on the machine learning technique. The aim of this study is not to provide a single solution that effectively detects web-based malware but to develop a technique that compensates the drawbacks of existing approaches. We validate the effectiveness of our approach by using real IP address data from existing blacklists and real traffic data on a campus network. The results demonstrate that our method can expand the coverage/accuracy of existing blacklists and also detect unknown malicious websites that are not covered by conventional approaches.
Article
Full-text available
Phishing is a fraudulent act to acquire sensitive information from unsuspecting users by masking as a trustworthy entity in an electronic commerce. Several mechanisms such as spoofed e-mails, DNS spoofing and chat rooms which contain links to phishing websites are used to trick the victims. Though there are many existing anti-phishing solutions, phishers continue to lure the victims. In this paper, we present a novel approach that not only overcomes many of the difficulties in detecting phishing websites but also identifies the phishing target that is being mimicked. We have proposed an anti-phishing technique that groups the domains from hyper links having direct or indirect association with the given suspicious webpage. The domains gathered from the directly associated webpages are compared with domains gathered from the indirectly associated webpages to arrive at a target domain set. On applying Target Identification(TID) algorithm on this set, we zero-in the target domain. We then perform third-party DNS look up of the suspicious domain and the target domain and on comparison we identify the legitimacy of the suspicious page.
Conference Paper
Full-text available
Link analysis algorithms for Web search engines determine the importance and relevance of Web pages. Among the link analysis algorithms, PageRank is the state of the art ranking mechanism that is used in Google search engine today. The PageRank algorithm is modeled as the behavior of a randomized Web surfer; this model can be seen as Markov chain to predict the behavior of a system that travels from one state to another state considering only the current condition. However, this model has the dangling node or hanging node problem because these nodes cannot be presented in a Markov chain model. This paper focuses on the application of Markov chain on PageRank algorithm and discussed a few methods to handle the dangling node problem. The Experiment is done running on WEBSPAM-UK2007 to show the rank results of the dangling nodes.
Conference Paper
Full-text available
Compromised websites are often used by attackers to deliver malicious content or to host phishing pages designed to steal private information from their victims. Unfortunately, most of the targeted websites are managed by users with little security background - often unable to detect this kind of threats or to afford an external professional security service. In this paper we test the ability of web hosting providers to detect compromised websites and react to user complaints. We also test six specialized services that provide security monitoring of web pages for a small fee. During a period of 30 days, we hosted our own vulnerable websites on 22 shared hosting providers, including 12 of the most popular ones. We repeatedly ran five different attacks against each of them. Our tests included a bot-like infection, a drive-by download, the upload of malicious files, an SQL injection stealing credit card numbers, and a phishing kit for a famous American bank. In addition, we also generated traffic from seemingly valid victims of phishing and drive-by download sites. We show that most of these attacks could have been detected by free network or file analysis tools. After 25 days, if no malicious activity was detected, we started to file abuse complaints to the providers. This allowed us to study the reaction of the web hosting providers to both real and bogus complaints. The general picture we drew from our study is quite alarming. The vast majority of the providers, or "add-on" security monitoring services, are unable to detect the most simple signs of malicious activity on hosted websites.
Article
Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by exploits, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. It is imperative to detect and act on such threats in a timely manner. Traditionally, this detection is done mostly through the usage of blacklists. However, blacklists cannot be exhaustive, and lack the ability to detect newly generated malicious URLs. To improve the generality of malicious URL detectors, machine learning techniques have been explored with increasing attention in recent years. This article aims to provide a comprehensive survey and a structural understanding of Malicious URL Detection techniques using machine learning. We present the formal formulation of Malicious URL Detection as a machine learning task, and categorize and review the contributions of literature studies that addresses different dimensions of this problem (feature representation, algorithm design, etc.). Further, this article provides a timely and comprehensive survey for a range of different audiences, not only for machine learning researchers and engineers in academia, but also for professionals and practitioners in cybersecurity industry, to help them understand the state of the art and facilitate their own research and practical applications. We also discuss practical issues in system design, open research challenges, and point out some important directions for future research.
Article
This article presents an algorithm to determine the relevancy of hanging pages in the link-structure-based ranking algorithms. Recently, hanging pages have become major obstacles in web information retrieval (IR). As an increasing number of meaningful hanging pages appear on the Web, their relevancy has to be determined according to the query term in order to make the search engine result pages fair and relevant. Exclusion of these pages in ranking calculation can give biased/inconsistent results, but inclusion of these pages will reduce the speed significantly. Most of the IR ranking algorithms exclude the hanging pages. But there are relevant and important hanging pages on the Web, and they cannot be ignored. In our proposed method, we use anchor text to determine the hanging relevancy, and we use stability analysis to show that rank results are consistent before and after altering the link structure.
Article
Malicious JavaScript code in webpages on the Internet is an emergent security issue because of its universality and potentially severe impact. Because of its obfuscation and complexities, detecting it has a considerable cost. Over the last few years, several machine learning-based detection approaches have been proposed; most of them use shallow discriminating models with features that are constructed with artificial rules. However, with the advent of the big data era for information transmission, these existing methods already cannot satisfy actual needs. In this paper, we present a new deep learning framework for detection of malicious JavaScript code, from which we obtained the highest detection accuracy compared with the control group. The architecture is composed of a sparse random projection, deep learning model, and logistic regression. Stacked denoising auto-encoders were used to extract high-level features from JavaScript code; logistic regression as a classifier was used to distinguish between malicious and benign JavaScript code. Experimental results indicated that our architecture, with over 27 000 labeled samples, can achieve an accuracy of up to 95%, with a false positive rate less than 4.2% in the best case. Copyright
Article
Link analysis algorithms for Web search engines determine the importance and relevance of Web pages. Among the link analysis algorithms, PageRank is the state of the art ranking mechanism that is used in Google search engine today. The PageRank algorithm is modeled as the behavior of a randomized Web surfer; this model can be seen as Markov chain to predict the behavior of a system that travels from one state to another state considering only the current condition. However, this model has the dangling node or hanging node problem because these nodes cannot be presented in a Markov chain model. This paper focuses on the application of Markov chain on PageRank algorithm and discussed a few methods to handle the dangling node problem. The Experiment is done running on WEBSPAM-UK2007 to show the rank results of the dangling nodes.
Article
Standard Web graph representation fails to capture topic association and functional groupings of links and their occurrence across pages in the site. That limits its applicability and usefulness. In this paper we introduce a novel method for representing hypertext organization of Web sites in the form of Link Structure Graphs (LSGs). The LSG captures both the organization of links at the page level and the overall hyperlink structure of the collection of pages. It comprises vertices that correspond to link blocks of several types and edges that describe reuse of such blocks across pages. Identification of link blocks is approximated by the analysis of the HTML Document Object Model (DOM). Further differentiation of blocks into types is based on the recurrence of block elements across pages. The method gives rise to a compact representation of all the hyperlinks on the site and enables novel analysis of the site organization. Our approach is supported by the findings of an exploratory user study that reveals how the hyperlink structure is generally perceived by the users. We apply the algorithm to a sample of Web sites and discuss their link structure properties. Furthermore, we demonstrate that selective crawling strategies can be applied to generate key elements of the LSG incrementally. This further broadens the scope of LSG applicability.
Article
Malicious Web sites are a cornerstone of Internet criminal activities. The dangers of these sites have created a demand for safeguards that protect end-users from visiting them. This article explores how to detect malicious Web sites from the lexical and host-based features of their URLs. We show that this problem lends itself naturally to modern algorithms for online learning. Online algorithms not only process large numbers of URLs more efficiently than batch algorithms, they also adapt more quickly to new features in the continuously evolving distribution of malicious URLs. We develop a real-time system for gathering URL features and pair it with a real-time feed of labeled URLs from a large Web mail provider. From these features and labels, we are able to train an online classifier that detects malicious Web sites with 99% accuracy over a balanced dataset.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.
How malicious websites infect you in unexpected ways
  • P Cucu
Cucu. P. How Malicious Websites Infect You in Unexpected Ways. Retrieved June 30, 2018, from https://heimdalsecurity.com/blog/malicious-websites/
Compromised websites an owner’s perspective
  • Stopbadware
  • Org
Stopbadware.org. Compromised Websites an Owner's Perspective. Retrieved June 14, 2018, from https://www.stopbadware.org/files/compromised-websites-an-owners-perspective.pdf
The 10 signs you have a compromised website
  • D Mcentee
McEntee. D. The 10 Signs You Have a Compromised Website. Retrieved on June 27 2018. https://www.webwatchdog.io/2016/10/21/the-10-signs-youve-a-hacked-or-compromisedwebsite/
Deep Learning Approach for Detecting Malicious Javascript Code. Security and Communication Networks
  • Y Wang
  • W D Cai
  • P C Wei
Wang, Y., Cai, W.D., Wei, P.C.: Deep Learning Approach for Detecting Malicious Javascript Code. Security and Communication Networks. Wiley Online Library, 9(11), 15201534. DOI: 10.1002/sec.1441. (2016).
  • D Sahoo
  • C Liu
  • S C H Hoi
Sahoo, D., Liu, C., Hoi, S.C.H.: Malicious URL Detection using Machine Learning: A Survey. arXiv:1701.07179v2 [cs.LG]. (2017).