Conference Paper

Cloak and dagger: Dynamics of web search cloaking

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Cloaking is a common 'bait-and-switch' technique used to hide the true nature of a Web site by delivering blatantly different semantic content to different user segments. It is often used in search engine optimization (SEO) to obtain user traffic illegitimately for scams. In this paper, we measure and characterize the prevalence of cloaking on different search engines, how this behavior changes for targeted versus untargeted advertising and ultimately the response to site cloaking by search engine providers. Using a custom crawler, called Dagger, we track both popular search terms (e.g., as identified by Google, Alexa and Twitter) and targeted keywords (focused on pharmaceutical products) for over five months, identifying when distinct results were provided to crawlers and browsers. We further track the lifetime of cloaked search results as well as the sites they point to, demonstrating that cloakers can expect to maintain their pages in search results for several days on popular search engines and maintain the pages themselves for longer still.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Web page comparison has been intensively performed in the detection of phishing web pages [20,27,31,32,35,36,39,42,48,51,52] and web cloaking [28,46,49]. Page comparison using the HTML document tree [28,35,36,42,46,49,51] or the entire web page's look and feel [20,27,28,51] is ineffective for detecting proactive web defacements [24]. ...
... Web page comparison has been intensively performed in the detection of phishing web pages [20,27,31,32,35,36,39,42,48,51,52] and web cloaking [28,46,49]. Page comparison using the HTML document tree [28,35,36,42,46,49,51] or the entire web page's look and feel [20,27,28,51] is ineffective for detecting proactive web defacements [24]. For example, the comparison based on the document tree could fail to recognize a page proactively defaced in mask mode, where only one or two deliberately styled elements are added to the HTML document, as illustrated in Section 2.3. ...
... Web page similarity has been intensively used in the detection of phishing web pages [20,27,31,32,35,36,39,42,48,51,52], which have similar or the same look and feel as legitimate pages, and web cloaking [28,46,49], a common technique to deliver inconsistent content to different user segments and hide the true nature of a website. Web page similarity can be computed based on the HTML document tree [28,35,36,42,46,49,51] or the entire web page's look and feel [20,27,28,51]. ...
Conference Paper
Web defacement is one of the major promotional channels for online underground economies. It regularly compromises benign websites and injects fraudulent content to promote illicit goods and services. It inflicts significant harm to websites’ reputations and revenues and may lead to legal ramifications. In this paper, we uncover proactive web defacements, where the involved web pages (i.e., landing pages) proactively deface themselves within browsers using JavaScript (i.e., control scripts). Proactive web defacements have not yet received attention from research communities, anti-hacking organizations, or law-enforcement officials. To detect proactive web defacements, we designed a practical tool, PACTOR. It runs in the browser and intercepts JavaScript API calls that manipulate web page content. It takes snapshots of the rendered HTML source code immediately before and after the intercepted API calls and detects proactive web defacements by visually comparing every two consecutive snapshots. Our two-month empirical study, using PACTOR, on 2,454 incidents of proactive web defacements shows that they can evade existing URL safety-checking tools and effectively promote the ranking of their landing pages using legitimate content/keywords. We also investigated the vendor network of proactive web defacements and reported all the involved domains to law-enforcement officials and URL-safety checking tools.
... Cloaking techniques can be categorized into two groups: server-side and client-side (Table I shows examples of each type). Server-side cloaking techniques identify users via information in HTTP requests [59]. Client-side cloaking is implemented through code that runs in the visitor's browser (JavaScript) to apply filters using attributes such as cookies or mouse movement. ...
... Existing anti-cloaking methodologies focus on bypassing server-side cloaking by comparing the visual and textual features of different versions of a crawled website retrieved by sending multiple web requests with different configurations (e.g., user agents or IP addresses) [25,30,59]. Client-side cloaking techniques, however, are still poorly understood due to challenges in automatically analyzing JavaScript code and understanding its semantics. ...
... In our dataset, each website generated 46.3 screenshots on average. CrawlPhish compares the screenshots of each execution path within one website against the original screenshot to detect if cloaking exists, because the presence of cloaking will result in significant visual layout changes [59]. The code structure features include web API calls, web event handlers, and ASTs, which can characterize different types of cloaking techniques and reveal how the cloaking techniques are implemented. ...
Article
Full-text available
Phishing websites with advanced evasion techniques are a critical threat to Internet users because they delay detection by current antiphishing systems. We present CrawlPhish, a framework for automatically detecting and categorizing the client-side (e.g., JavaScript) evasion used by phishing websites.
... Cloaking techniques can be categorized into two groups: server-side and client-side (Table I shows examples of each type). Server-side cloaking techniques identify users via information in HTTP requests [59]. Client-side cloaking is implemented through code that runs in the visitor's browser (JavaScript) to apply filters using attributes such as cookies or mouse movement. ...
... Existing anti-cloaking methodologies focus on bypassing server-side cloaking by comparing the visual and textual features of different versions of a crawled website retrieved by sending multiple web requests with different configurations (e.g., user agents or IP addresses) [25,30,59]. Client-side cloaking techniques, however, are still poorly understood due to challenges in automatically analyzing JavaScript code and understanding its semantics. ...
... In our dataset, each website generated 46.3 screenshots on average. CrawlPhish compares the screenshots of each execution path within one website against the original screenshot to detect if cloaking exists, because the presence of cloaking will result in significant visual layout changes [59]. The code structure features include web API calls, web event handlers, and ASTs, which can characterize different types of cloaking techniques and reveal how the cloaking techniques are implemented. ...
... Cloaking, also known as 'bait and switch' is a common technique used to hide the true nature of a Web site by delivering different semantic content to some selected specific user group-based [41,42]. Wang and Savage [41] presented four cloaking types: Repeat cloaking (delivering different web content based on visit times of visitors), User-agent cloaking (delivering specific web content based on visitors' User-agent String), Redirection cloaking (redirecting users to another website by using JavaScript), and IP Cloaking (delivering specific web content based on visitors' IP). ...
... Cloaking, also known as 'bait and switch' is a common technique used to hide the true nature of a Web site by delivering different semantic content to some selected specific user group-based [41,42]. Wang and Savage [41] presented four cloaking types: Repeat cloaking (delivering different web content based on visit times of visitors), User-agent cloaking (delivering specific web content based on visitors' User-agent String), Redirection cloaking (redirecting users to another website by using JavaScript), and IP Cloaking (delivering specific web content based on visitors' IP). ...
... Other approaches mainly target compromised webservers and identify clusters of URLs with trending keywords that are irrelevant to the other content hosted on page [45]. Wang et al. [41] identify cloaking in near real-time by examining the dynamics of cloaking over time. Invernizzi et al. [42] develop an anti-cloaking system that detects split-view content returned to two or more distinct browsing profiles by building a classifier that detects deviations in the content. ...
Preprint
Full-text available
Online social networks (OSNs) are ubiquitous attracting millions of users all over the world. Being a popular communication media OSNs are exploited in a variety of cyber attacks. In this article, we discuss the Chameleon attack technique, a new type of OSN-based trickery where malicious posts and profiles change the way they are displayed to OSN users to conceal themselves before the attack or avoid detection. Using this technique, adversaries can, for example, avoid censorship by concealing true content when it is about to be inspected; acquire social capital to promote new content while piggybacking a trending one; cause embarrassment and serious reputation damage by tricking a victim to like, retweet, or comment a message that he wouldn't normally do without any indication for the trickery within the OSN. An experiment performed with closed Facebook groups of sports fans shows that (1) Chameleon pages can pass by the moderation filters by changing the way their posts are displayed and (2) moderators do not distinguish between regular and Chameleon pages. We list the OSN weaknesses that facilitate the Chameleon attack and propose a set of mitigation guidelines.
... Various approaches have been proposed to detect cloaking sites and measure their preva-lence. Previous detection studies [26,24,7,14,8,23,17] focused mainly on using heuristics such as search query monetizability, features in websites such as page size and HTML tags, to improve detection accuracy. However, these approaches have three drawbacks. ...
... Although many systems [17,26,24,7,14,8,23] have been proposed to detect cloaking, they share three drawbacks. First, they cannot detect IP cloaking. ...
... Third, these systems detect cloaking on the server side and cannot protect individual user visits in real-time. Wang et al [23] measured the search engine response time to cloaking practice and the cloaking duration of websites. The search engine response time is defined as the time from cloaking pages show up in top 100 pages for a specific keyword until they are not. ...
Article
Cloaking has long been exploited by spammers for the purpose of increasing the exposure of their websites. In other words, cloaking has long served as a major malicious technique in search engine optimization (SEO). Cloaking hides the true nature of a website by delivering blatantly different content to users versus web crawlers. Recently, we have also witnessed a rising trend of employing cloaking in search engine marketing (SEM). However, detecting cloaking is challenging. Existing approaches cannot detect IP cloaking and are not suitable for detecting cloaking in SEM because their search-and-visit method leads to click fraud. In addition, they focus on detecting and measuring cloaking on the server side, but the results are not visible to users to help them avoid frauds. Our work focuses on mitigating IP cloaking and SEM cloaking, and providing client-based real-time cloaking detection services. To achieve these goals, we first propose the Simhash-based Website Model (SWM), a condensed representation of websites, which can model natural page dynamics. Based on SWM, we design and implement Cloaker Catcher, an accurate, efficient and privacy-preserving system, that consists of a server that crawls websites visited by users on demand and a client-side extension that fetches spider views of websites from the server and compares them with user views to detect cloaking. Since Cloaker Catcher checks on the client side for each real user, IP cloaking can be detected whenever it occurs and click fraud in SEM can also be prevented. Using our system, we conducted the first analysis of SEM cloaking and found that the main purpose of SEM cloakers is to provide illicit services.
... Such blocklists are populated with the help of web security crawlers that regularly scout web-pages to evaluate them. However, in order to evade these crawlers, miscreants employ many cloaking techniques [23,38,39,49,52]. ...
... Note that PhishPrint can also be easily re-deployed to evaluate crawlers against many of these cloaking attacks in [32,52] (except timing attacks) in the future. Many research works have focused on studying in-the-wild cloaking and evasive techniques [23,38,39,42,49,52] which was not our focus. ...
Conference Paper
Full-text available
Security companies often use web crawlers to detect phishing and other social engineering attack websites. We built a novel, scalable, low-cost framework named PhishPrint to enable the evaluation of such web security crawlers against multiple cloaking attacks. PhishPrint is unique in that it completely avoids the use of any simulated phishing sites and blocklisting measurements. Instead, it uses web pages with benign content to profile security crawlers. We used PhishPrint to evaluate 23 security crawlers including highly ubiquitous services such as Google Safe Browsing and Microsoft Outlook e-mail scanners. Our 70-day evaluation found several previously unknown cloaking weaknesses across the crawler ecosystem. In particular, we show that all the crawlers' browsers are either not supporting advanced fingerprinting related web APIs (such as Canvas API) or are severely lacking in fingerprint diversity thus exposing them to new fingerprinting-based cloaking attacks. We confirmed the practical impact of our findings by deploying 20 evasive phishing web pages that exploit the found weaknesses. 18 of the pages managed to survive indefinitely despite aggressive self-reporting of the pages to all crawlers. We confirmed the specificity of these attack vectors with 1150 volunteers as well as 467K web users. We also proposed countermeasures that all crawlers should take up in terms of both their crawling and reporting infrastructure. We have relayed the found weaknesses to all entities through an elaborate vulnerability disclosure process that resulted in some remedial actions as well as multiple vulnerability rewards.
... New spam tactics emerge time to time, and spammers use different tricks for different types of pages. These tricks varies from violating recommended practices (such as keyword stuffing [2], cloaking and redirection [3] etc) to violating laws (such as compromising web sites to poison search results [4], [5] etc.). After a heuristic for web spam detection is developed, the bubble of Web visibility tends to resurface somewhere else. ...
... There are many surveys done on web spam and detection methods [1], [6]. Many modern techniques of spamming as analyzed by authors in their researches in [1], [2], [3], [4], [5], [6]. In our knowledge there is no scholarly paper which covers actual implementation of spam detection algorithms by today's search engines like Google. ...
Article
L’importance croissante des moteurs de recherche sur les performances commerciales des entreprises a conduit au développement de techniques permettant d'optimiser le référencement des sites web (SEO). À côté des techniques encouragées par les moteurs de recherche, plus éthiques, se sont développées des techniques dites « Black Hat SEO » (BH). Régulièrement soumises aux contre-mesures des moteurs de recherche tels que Google, ces techniques ont connu un renouveau avec le développement des outils en ligne d'automatisation de tâches et des IA génératives telles que ChatGPT. Dans cette recherche exploratoire, nous proposons, après avoir présenté les techniques classiques de référencement abusif (spamindexing), l’analyse des techniques promues par un réseau de gestionnaires de sites web et de comptes de réseaux sociaux. Ces techniques sont basées sur la transformation de contenus massivement collectés en ligne à l’aide d’outils no code ou par programmation. Elles posent des questions sur le plan de la propriété intellectuelle mais aussi sur les moyens de lutte des moteurs de recherche permettant de préserver la qualité de leur index, face à la prolifération de contenus générés massivement et automatiquement, sans pour autant handicaper le développement des pratiques légitimes.
... In the context of illicit promotion on information retrieval (IR) systems (e.g., web search engine, Wiki search), cybercriminals have two main goals [62], [65], [66], [85], [89]: (1) to advertise illegal businesses on compromised pages, and (2) to enhance the ranking of the compromised pages in search results to increase their visibility. ...
... Another blackhat SEO technique is link farm spam [89], where links directing users to illicit businesses are built on compromised websites. Cybercriminals also use cloaking [85], where malicious pages are cloaked by benign pages with popular search keywords, to increase their rankings under such keywords. It is worth noting that the adversarial ranking attacks were proposed to attack deep learning-based ranking models, with the aim of enhancing the ranking of targeted articles. ...
Preprint
As a prominent instance of vandalism edits, Wiki search poisoning for illicit promotion is a cybercrime in which the adversary aims at editing Wiki articles to promote illicit businesses through Wiki search results of relevant queries. In this paper, we report a study that, for the first time, shows that such stealthy blackhat SEO on Wiki can be automated. Our technique, called MAWSEO, employs adversarial revisions to achieve real-world cybercriminal objectives, including rank boosting, vandalism detection evasion, topic relevancy, semantic consistency, user awareness (but not alarming) of promotional content, etc. Our evaluation and user study demonstrate that MAWSEO is able to effectively and efficiently generate adversarial vandalism edits, which can bypass state-of-the-art built-in Wiki vandalism detectors, and also get promotional content through to Wiki users without triggering their alarms. In addition, we investigated potential defense, including coherence based detection and adversarial training of vandalism detection, against our attack in the Wiki ecosystem.
... Unfortunately, the popularity of social platforms has attracted the attention of scammers and other malicious users, who use social platforms to distribute malicious links exposing users to a plethora of security risks ranging from online scams and spam to more concerning risks such as exploitation of 0-day vulnerabilities in mobile devices (see, i.e., [3]). The security risks of visiting malicious web pages have been at the center of the attention of the past decades of research activities, focusing on, for example, detection techniques [13], [20], evaluation of defenses (e.g., [22], [30], [21]), studying the attacker behavior (e.g., [16], [5]), and detection of evasion techniques (e.g., [39], [18]). Only recently, the attention has shifted on studying the extent to which these attacks entered and adapted to social platforms. ...
... However, cloaking attacks can be detected, and over the past years, the research community has proposed several ideas. For example, Wang et al. [39] show four techniques to detect user agent and IP cloaking put in place by web sites to deceive search engine crawlers. Similarly, Invernizzi et al. [18] used ready-touse cloaking programs retrieved from the underground market to create a classifier for the detection. ...
... In particular, we first compare two pages crawled by common browser user-agent and spider user-agent strings, determining if cloaking performed, which is widely used for Blackhat SEO. Then we follow the method proposed by Wang et al. [17] to find cloaking pages: if there is no similarity in visual effect or page structures between two pages, the domain is labeled as cloaking. Next, we go through the page content and check if it is used to promote illegal business like porn, gamble or fake shops. ...
... As such, we use the new gTLD list published by ICANN [23] to filter the e2LDs in DS All . It turns out a prominent ratio of e2LDs (17,716,27.63% of 64,124) are under new gTLDs, which aligns with the discovery of previous works. We think the the major reason is that most new gTLDs are cheap and lack of maintenance. ...
Chapter
In this paper, we present a large-scale analysis about an emerging new type of domain-name fraud, which we call levelsquatting. Unlike existing frauds that impersonate well-known brand names (like google.com) by using similar second-level domain names, adversaries here embed brand name in the subdomain section, deceiving users especially mobile users who do not pay attention to the entire domain names.
... In particular, we first compare two pages crawled by common browser user-agent and spider user-agent strings, determining if cloaking performed, which is widely used for Blackhat SEO. Then we follow the method proposed by Wang et al. [17] to find cloaking pages: if there is no similarity in visual effect or page structures between two pages, the domain is labeled as cloaking. Next, we go through the page content and check if it is used to promote illegal business like porn, gamble or fake shops. ...
... As such, we use the new gTLD list published by ICANN [23] to filter the e2LDs in DS All . It turns out a prominent ratio of e2LDs (17,716,27.63% of 64,124) are under new gTLDs, which aligns with the discovery of previous works. We think the the major reason is that most new gTLDs are cheap and lack of maintenance. ...
Conference Paper
Full-text available
In this paper, we present a large-scale analysis about an emerging new type of domain-name fraud, which we call levelsquatting. Unlike existing frauds that impersonate well-known brand names (like google.com) by using similar second-level domain names, adversaries here embed brand name in the subdomain section, deceiving users especially mobile users who do not pay attention to the entire domain names. First, we develop a detection system, LDS, based on passive DNS data and web-page content. Using LDS, we successfully detect 817,681 levelsquatting domains. Second, we perform detailed characterization on levelsquatting scams. Existing blacklists are less effective against levelsquatting domains, with only around 4% of domains reported by VirusTotal and PhishTank respectively. In particular, we find a number of levelsquatting domains impersonate well-known search engines. So far, Baidu security team has acknowledged our findings and removed these domains from its search result. Finally, we analyze how levelsquatting domain names are displayed in different browsers. We find 2 mobile browsers (Firefox and UC) and 1 desktop browser (Internet Explorer) that can confuse users when showing levelsquatting domain names in the address bar. In summary, our study sheds light to the emerging levelsquatting fraud and we believe new approaches are needed to mitigate this type of fraud.
... • Referer: Referer is an HTTP request header that identifies the location a client followed to reach the resources on the website. Referrer settings can significantly influence the number of attacks a user is subjected to as shown in studies on web spam and cross-side scripting attacks [23][24][25][26][27][28]. Google search result referer was selected based on our experiment on a large number of malicious websites which yielded significantly higher number of attacks compared to a null value referer. ...
... If the referer setting of IP1s retrieval phase1 were to change to blank/null and retrieval phase2 to Googles referrer setting, it would result in significantly higher detected attacks in phase2. Googles referrer has been confirmed by studies to mimic a regular user more accurately and result in higher number of attacks [23][24][25][26]. ...
Article
Full-text available
IP tracking and cloaking are practices for identifying users which are used legitimately by websites to provide services and content tailored to particular users. However, it is believed that these practices are also used by malicious websites to avoid detection by anti-virus companies crawling the web to find malware. In addition, malicious websites are also believed to use IP tracking in order to deliver targeted malware based upon a history of previous visits by users. In this paper, we empirically investigate these beliefs and collect a large data-set of suspicious URLs in order to identify at what level IP tracking takes place that is at the level of an individual address or at the level of their network provider or organization (network tracking). We perform our experiments using HAZard and OPerability study to control the effects of a large number of other attributes which may affect the result of the analysis. Our results illustrate that IP tracking is used in a small subset of domains within our data-set, while no strong indication of network tracking was observed.
... Attackers leverage browser fingerprinting to target clients by redirecting an exploitable user to a malicious URL on the basis of the client's fingerprint. This technique, called "cloaking," is also abused for circumventing the detection of CSIRTs/security vendors by redirecting them to a benign URL [11]. ...
... Wang et al. [11] examined the dynamics of cloaking and uncovered the lifetime of cloaked websites using a system designed to crawl search results three times with different user-agents and referers. They measured and characterized the prevalence of cloaking on different search engines and search terms in addition to user-agent cloaking and referer cloaking. ...
Article
An incident response organization such as a CSIRT contributes to preventing the spread of malware infection by analyzing compromised websites and sending abuse reports with detected URLs to webmasters. However, these abuse reports with only URLs are not sufficient to clean up the websites. In addition, it is difficult to analyze malicious websites across different client environments because these websites change behavior depending on a client environment. To expedite compromised website clean-up, it is important to provide fine-grained information such as malicious URL relations, the precise position of compromised web content, and the target range of client environments. In this paper, we propose a new method of constructing a redirection graph with context, such as which web content redirects to malicious websites. The proposed method analyzes a website in a multi-client environment to identify which client environment is exposed to threats. We evaluated our system using crawling datasets of approximately 2,000 compromised websites. The result shows that our system successfully identified malicious URL relations and compromised web content, and the number of URLs and the amount of web content to be analyzed were sufficient for incident responders by 15.0% and 0.8%, respectively. Furthermore, it can also identify the target range of client environments in 30.4% of websites and a vulnerability that has been used in malicious websites by leveraging target information. This fine-grained analysis by our system would contribute to improving the daily work of incident responders.
... Para além de uma boa escolha de keywords, a estratégia SEO tem de ter em conta outros aspetos, nomeadamente uma boa localização do website. Ou seja, deve ser evitada a utilização de links para sites onde se verifique por exemplo [8]: ...
Article
Full-text available
Resumo As estratégias de comunicação e criação de conteúdos digitais dão visibilidade a uma intervenção artística, social e/ou política. A intervenção que deu origem ao projeto Merak(i): en franchissant la Ligne Verte foi criada por uma cidadã cipriota que pretendeu criar um espaço para o diálogo entre os próprios cipriotas e comunidade internacional sobre o conflito vivido até hoje no Chipre. O projeto foi uma iniciativa original no âmbito da intervenção das estratégias de comunicação digital em questões de consciencialização social através da arte. A metodologia utilizada foi o estudo de caso com recurso a análise documental, entrevistas, observação direta e observação participante. O trabalho apresenta o desenvolvimento de uma estratégia de marketing digital operacionalizada com a criação de um website que serve de plataforma para apresentar e informar sobre a intervenção artística e o desenvolvimento de uma estratégia SEO e de redes sociais. Os resultados permitiram concluir que o website e as redes sociais propiciaram uma comunicação eficaz no período em que decorreu o projeto. Com este projeto foi possível difundir a mensagem pretendida, ter a adesão e interesse do público e estabelecer contacto com pessoas interessadas em apoiar a intervenção artística num plano futuro.
... Client-side protections also help to overcome split view or time-of-check versus time-of-use attacks. For example, malicious content on the web is often cloaked, appearing benign to security crawlers, but delivering scams and malware when rendered on a user's device [78]. ...
Preprint
Full-text available
With the accelerated adoption of end-to-end encryption, there is an opportunity to re-architect security and anti-abuse primitives in a manner that preserves new privacy expectations. In this paper, we consider two novel protocols for on-device blocklisting that allow a client to determine whether an object (e.g., URL, document, image, etc.) is harmful based on threat information possessed by a so-called remote enforcer in a way that is both privacy-preserving and trustworthy. Our protocols leverage a unique combination of private set intersection to promote privacy, cryptographic hashes to ensure resilience to false positives, cryptographic signatures to improve transparency, and Merkle inclusion proofs to ensure consistency and auditability. We benchmark our protocols -- one that is time-efficient, and the other space-efficient -- to demonstrate their practical use for applications such as email, messaging, storage, and other applications. We also highlight remaining challenges, such as privacy and censorship tensions that exist with logging or reporting. We consider our work to be a critical first step towards enabling complex, multi-stakeholder discussions on how best to provide on-device protections.
... In terms of evasions, it is well known that attackers engage in crawler evasion to increase the lifetime of their malicious pages and domains [40], [80], [83], [84]. While CryptoScamTracker does take a number of steps to avoid being evaded (e.g. using multiple user-agent headers and a combination of both a real browser as well as an HTTP-requests library) there are additional evasion techniques (e.g. ...
... Manipulation of search results for erectile dysfunction medications was published nearly a decade ago by Leontiadis et al [15,16] and Wang et al [17]. Sildenafil was the first commercially available phosphodiesterase type 5 (PDE5) inhibitor available since 1998, followed by vardenafil, tadalafil, and avanafil [18]. ...
Article
Full-text available
Background: Illegal online pharmacies function as affiliate networks, in which search engine results pages (SERPs) are poisoned by several links redirecting site visitors to unlicensed drug distribution pages upon clicking on the link of a legitimate, yet irrelevant domain. This unfair online marketing practice is commonly referred to as search redirection attack, a most frequently used technique in the online illegal pharmaceutical marketplace. Objective: This study is meant to describe the mechanism of search redirection attacks in Google search results in relation to erectile dysfunction medications in European countries and also to determine the local and global scales of this problem. Methods: The search engine query results regarding 4 erectile dysfunction medications were documented using Google. The search expressions were "active ingredient" and "buy" in the language of 12 European countries, including Hungary. The final destination website legitimacy was checked at LegitScript, and the estimated number of monthly unique visitors was obtained from SEMrush traffic analytics. Compromised links leading to international illegal medicinal product vendors via redirection were analyzed using Gephi graph visualization software. Results: Compromised links redirecting to active online pharmacies were present in search query results of all evaluated countries. The prevalence was highest in Spain (62/160, 38.8%), Hungary (52/160, 32.5%), Italy (46/160, 28.8%), and France (37/160, 23.1%), whereas the lowest was in Finland (12/160, 7.5%), Croatia (10/160, 6.3%), and Bulgaria (2/160, 1.3%), as per data recorded in November 2020. A decrease in the number of compromised sites linking visitors to illegitimate medicine sellers was observed in the Hungarian data set between 2019 and 2021, from 41% (33/80) to 5% (4/80), respectively. Out of 1920 search results in the international sample, 380 (19.79%) search query results were compromised, with the majority (n=342, 90%) of links redirecting individuals to 73 international illegal medicinal product vendors. Most of these illegal online pharmacies (41/73, 56%) received only 1 or 2 compromised links, whereas the top 3 domains with the highest in-degree link value received more than one-third of all incoming links. Traffic analysis of 35 pharmacy specific domains, accessible via compromised links in search engine queries, showed a total of 473,118 unique visitors in November 2020. Conclusions: Although the number of compromised links in SERPs has shown a decreasing tendency in Hungary, an analysis of the European search query data set points to the global significance of search engine poisoning. Our research illustrates that search engine poisoning is a constant threat, as illegitimate affiliate networks continue to flourish while uncoordinated interventions by authorities and individual stakeholders remain insufficient. Ultimately, without a dedicated and comprehensive effort on the part of search engine providers for effectively monitoring and moderating SERPs, they may never be entirely free of compromised links leading to illegal online pharmacy networks.
... For client-side cloaking techniques, Zhang et al. [52] proposed CrawlPhish to force-execute JavaScript snippets in the HTML response to reveal malicious content. As for server-side cloaking in phishing, previous work [14,23,42] categorizes types of server-side cloaking through analysis of compromised phishing kits. ...
... Crawlers and other bots make up approx. 37% of traffic on the Web [29] and it has been shown that this significantly affects crawling studies [62,30,41]. Consequently, some service providers define behavior guidelines to limit crawling traffic, or try to detect and block them altogether [33]. ...
Conference Paper
Full-text available
Web measurement studies can shed light on not yet fully understood phenomena and thus are essential for analyzing how the modern Web works. This often requires building new and adjusting existing crawling setups, which has led to a wide variety of analysis tools for different (but related) aspects. If these efforts are not sufficiently documented, the reproducibility and replicability of the measurements may suffer—two properties that are crucial to sustainable research. In this paper, we survey 117 recent research papers to derive best practices for Web-based measurement studies and specify criteria that need to be met in practice. When applying these criteria to the surveyed papers, we find that the experimental setup and other aspects essential to reproducing and replicating results are often missing. We underline the criticality of this finding by performing a large-scale Web measurement study on 4.5 million pages with 24 different measurement setups to demonstrate the influence of the individual criteria. Our experiments show that slight differences in the experimental setup directly affect the overall results and must be documented accurately and carefully.
... However, this technique is now being abused by adversaries to circumvent static web security checks [18]. Compared with directly delivering malicious content to any visitor, this method is able to ensure the targeted delivery of malicious content by conducting multi-layer verification during the process of visitors being redirected [13,25]. For instance, the adversary checks whether the current visitor is a static crawler by inspecting the 'User-Agent' field of HTTP request header. ...
Chapter
Full-text available
An increasing number of adversaries tend to cover up their malicious sites by leveraging the elaborate redirection chains. Prior works mostly focused on the specific attacks that users suffered, and seldom considered how users were exposed to such attacks. In this paper, we conduct a comprehensive measurement study on the malicious redirections that leverage squatting domain names as the start point. To this end, we collected 101,186 resolved squatting domain names that targeted 2,302 top brands from the ISP-level DNS traffic. After dynamically crawling these squatting domain names, we pioneered the application of performance log to mine the redirection chains they involved. Afterward, we analyzed the nodes that acted as intermediaries in malicious redirections and found that adversaries preferred to conduct URL redirection via imported JavaScript codes and iframes. Our further investigation indicates that such intermediaries have obvious aggregation, both in the domain name and the Internet infrastructure supporting them.
... Regarding threat context, as evidenced by Figure 7 below, participants tended to reach later funnel stages including intend to transact more often for URLs presented at the top and bottom of the search results. The latter finding is important since previous research has shown that users' click-through rates are influenced by search result placement [40], and scammers often engage in search engine optimization (SEO) to take advantage of this fact [41]. Interestingly, domain and threat type did not have a significant effect on funnel traversal behavior: users interacted with pharmacy and bank concocted and spoof sites in a similar manner. ...
... Researchers have repeatedly reported that large-scale Internet measurements, especially those that use automated crawlers, are prone to being blocked or served fake content by security solutions designed to block malicious bots and content scrapers [49,66]. In order to minimize this risk during our measurement, we used a real browser (i.e., Google Chrome) for most steps in our methodology. ...
Preprint
Full-text available
Web cache deception (WCD) is an attack proposed in 2017, where an attacker tricks a caching proxy into erroneously storing private information transmitted over the Internet and subsequently gains unauthorized access to that cached data. Due to the widespread use of web caches and, in particular, the use of massive networks of caching proxies deployed by content distribution network (CDN) providers as a critical component of the Internet, WCD puts a substantial population of Internet users at risk. We present the first large-scale study that quantifies the prevalence of WCD in 340 high-profile sites among the Alexa Top 5K. Our analysis reveals WCD vulnerabilities that leak private user data as well as secret authentication and authorization tokens that can be leveraged by an attacker to mount damaging web application attacks. Furthermore, we explore WCD in a scientific framework as an instance of the path confusion class of attacks, and demonstrate that variations on the path confusion technique used make it possible to exploit sites that are otherwise not impacted by the original attack. Our findings show that many popular sites remain vulnerable two years after the public disclosure of WCD. Our empirical experiments with popular CDN providers underline the fact that web caches are not plug & play technologies. In order to mitigate WCD, site operators must adopt a holistic view of their web infrastructure and carefully configure cache settings appropriate for their applications.
... In their study of CSRF attacks Li et al. stated that setting Referrer Policy to accordingly, one can prevent user agents (UA) to suppress the referrer header in HTTP requests that originate from HTTPS domains, preventing the UA from omitting this header by default [28]. Using HTTP Referers are also analysed as a cloaking technique in [20,32,34]. ...
Chapter
Full-text available
Online collaboration services (OCS) are appealing since they provide ease of access to resources and the ability to collaborate on shared files. Documents on these services are frequently shared via secret links, which allows easy collaboration between different users. The security of this secret link approach relies on the fact that only those who know the location of the secret resource (i.e., its URL) can access it. In this paper, we show that the secret location of OCS files can be leaked by the improper handling of links embedded in these files. Specifically, if a user clicks on a link embedded into a file hosted on an OCS, the HTTP Referer contained in the resulting HTTP request might leak the secret URL. We present a study of 21 online collaboration services and show that seven of them are vulnerable to this kind of secret information disclosure caused by the improper handling of embedded links and HTTP Referers. We identify two root causes of these issues, both having to do with an incorrect application of the Referrer Policy, a countermeasure designed to restrict how HTTP Referers are shared with third parties. In the first case, six services leak their referrers because they do not implement a strict enough and up-to-date policy. In the second case, one service correctly implements an appropriate Referrer Policy, but some web browsers do not obey it, causing links clicked through them to leak their HTTP Referers. To fix this problem, we discuss how services can apply the Referrer Policy correctly to avoid these incidents, as well as other server and client side countermeasures.
... Downstream processing receives only well-structured, canonicalized HTML documents. Finally, it is harder for the server to detect that it is being accessed by an automated process, which might cause it to send back different material than a human would receive [180]. ...
... Downstream processing receives only well-structured, canonicalized HTML documents. Finally, it is harder for the server to detect that it is being accessed by an automated process, which might cause it to send back different material than a human would receive [180]. ...
Thesis
Full-text available
Although Internet censorship is a well-studied topic, to date most published studies have focused on a single aspect of the phenomenon, using methods and sources specific to each researcher. Results are difficult to compare, and global, historical perspectives are rare. Because each group maintains their own software, erroneous methods may continue to be used long after the error has been discovered. Because censors continually update their equipment and blacklists, it may be impossible to reproduce historical results even with the same vantage points and testing software. Because “probe lists” of potentially censored material are labor-intensive to compile, requiring an understanding of the politics and culture of each country studied, researchers discover only the most obvious and long-lasting cases of censorship. In this dissertation I will show that it is possible to make progress toward addressing all of these problems at once. I will present a proof-of concept monitoring system designed to operate continuously, in as many different countries as possible, using the best known techniques for detection and analysis. I will also demonstrate improved techniques for verifying the geographic location of a monitoring vantage point; for distinguishing innocuous network problems from censorship and other malicious network interference; and for discovering new web pages that are closely related to known-censored pages. These techniques improve the accuracy of a continuous monitoring system and reduce the manual labor required to operate it. This research has, in addition, already led to new discoveries. For example, I have confirmed reports that a commonly-used heuristic is too sensitive and will mischaracterize a wide variety of unrelated problems as censorship. I have been able to identify a few cases of political censorship within a much longer list of cases of moralizing censorship. I have expanded small seed groups of politically sensitive documents into larger groups of documents to test for censorship. Finally, I can also detect other forms of network interference with a totalitarian motive, such as injection of surveillance scripts. In summary, this work demonstrates that mostly-automated measurements of Internet censorship on a worldwide scale are feasible, and that the elusive global and historical perspective is within reach.
... and does not redirect the other browsers by comparing the UserAgent strings. Such a technique is also abused for circumventing the detection of security researchers/vendors by redirecting them to a benign URL or responding with empty content, called "cloaking" [17]. ...
Article
Security researchers/vendors detect malicious websites based on several website features extracted by honeyclient analysis. However, web-based attacks continue to be more sophisticated along with the development of countermeasure techniques. Attackers detect the honeyclient and evade analysis using sophisticated JavaScript code. The evasive code indirectly identifies vulnerable clients by abusing the differences among JavaScript implementations. Attackers deliver malware only to targeted clients on the basis of the evasion results while avoiding honeyclient analysis. Therefore, we are faced with a problem in that honeyclients cannot analyze malicious websites. Nevertheless, we can observe the evasion nature, i.e., the results in accessing malicious websites by using targeted clients are different from those by using honeyclients. In this paper, we propose a method of extracting evasive code by leveraging the above differences to investigate current evasion techniques. Our method analyzes HTTP transactions of the same website obtained using two types of clients, a real browser as a targeted client and a browser emulator as a honeyclient. As a result of evaluating our method with 8,467 JavaScript samples executed in 20,272 malicious websites, we discovered previously unknown evasion techniques that abuse the differences among JavaScript implementations. These findings will contribute to improving the analysis capabilities of conventional honeyclients.
... Finally, attackers use 'cloaking' on the pages they advertise to evade detection by the search engine crawler. Cloaking has traditionally been used to poison search results [32][33][34]36], and attackers have developed many different kinds of cloaking over the years that fraudulent advertisers now also employ. In this paper, it is precisely these fraudulent advertisers that we characterize on the Bing ad network. ...
Conference Paper
Most search engines generate significant revenue through search advertising, wherein advertisements are served alongside traditional search results. These advertisements are attractive to advertisers because ads can be targeted and prominently presented to users at the exact moment that the user is searching for relevant topics. Deceptive advertising is harmful to all legitimate actors in the search ad ecosystem: Users are less likely to find what they are looking for and may lose trust in ads or the search engine, advertisers lose potential revenue and face unfair competition from advertisers who are not playing by the rules, and the search engine's ecosystem suffers when both users and advertisers are unhappy. This paper explores search advertiser fraud on Microsoft's Bing search engine platform. We characterize three areas: the scale of search advertiser fraud, the targeting and bidding behavior of fraudulent advertisers, and how fraudulent advertisers impact other advertisers in the ecosystem.
... Features regarding web content are deemed effective in detecting web spam [63], phishing sites [85], URL spam [77], and general malicious pages [19]. Malicious sites usually hide themselves behind web redirections, but their redirection pattern is different from legitimate cases, which can be leveraged to spot those malicious sites [49,74,84]. Invernizzi et al. [41] showed that a query result returned from search engines can be used to guide the process of finding malicious sites. ...
Conference Paper
Full-text available
Domain names have been exploited for illicit online activities for decades. In the past, miscreants mostly registered new domains for their attacks. However, the domains registered for malicious purposes can be deterred by existing reputation and blacklisting systems. In response to the arms race, miscreants have recently adopted a new strategy, called domain shadowing, to build their attack infrastructures. Specifically, instead of registering new domains, miscreants are beginning to compromise legitimate ones and spawn malicious subdomains under them. This has rendered almost all existing countermeasures ineffective and fragile because subdomains inherit the trust of their apex domains, and attackers can virtually spawn an infinite number of shadowed domains. In this paper, we conduct the first study to understand and detect this emerging threat. Bootstrapped with a set of manually confirmed shadowed domains, we identify a set of novel features that uniquely characterize domain shadowing by analyzing the deviation from their apex domains and the correlation among different apex domains. Building upon these features, we train a classifier and apply it to detect shadowed domains on the daily feeds of VirusTotal, a large open security scanning service. Our study highlights domain shadowing as an increasingly rampant threat. Moreover, while previously confirmed domain shadowing campaigns are exclusively involved in exploit kits, we reveal that they are also widely exploited for phishing attacks. Finally, we observe that instead of algorithmically generating subdomain names, several domain shadowing cases exploit the wildcard DNS records.
... In particular, crawling and sampling procedures induce biases (Achlioptas et al., 2009), and we have not attempted to correct for those. Another crawling process limitation is that cloaking (Wang et al., 2011) was not considered and that might limit our visibility to the malware files and websites. When it comes to the definition of what constitutes maliciousness of files and websites, we faced couple of additional trade-offs that should be pointed out. ...
Article
Knowledge about the graph structure of the Web is important for understanding this complex socio-technical system and for devising proper policies supporting its future development. Knowledge about the differences between clean and malicious parts of the Web is important for understanding potential treats to its users and for devising protection mechanisms. In this study, we conduct data science methods on a large crawl of surface and deep Web pages with the aim to increase such knowledge. To accomplish this, we answer the following questions. Which theoretical distributions explain important local characteristics and network properties of websites? How are these characteristics and properties different between clean and malicious (malware-affected) websites? What is the prediction power of local characteristics and network properties to classify malware websites? To the best of our knowledge, this is the first large-scale study describing the differences in global properties between malicious and clean parts of the Web. In other words, our work is building on and bridging the gap between \textit{Web science} that tackles large-scale graph representations and \textit{Web cyber security} that is concerned with malicious activities on the Web. The results presented herein can also help antivirus vendors in devising approaches to improve their detection algorithms.
Article
This study investigates the emerging cybersecurity threat of search engine optimization (SEO) poisoning and its impact on small and medium-sized enterprises’ (SMEs) digital marketing efforts. Through a comprehensive analysis of SEO poisoning techniques employed by attackers, the study reveals the significant risks and consequences for SMEs, including reputational damage, financial losses, and disrupted operations. To address these threats, the study proposes tailored mitigation strategies aligned with the principles of the NIST Cybersecurity Framework while considering the resource constraints facing SMEs. The mitigation recommendations encompass technical measures such as website security audits and employee training alongside non-technical initiatives to foster a culture of cybersecurity awareness. Additionally, the study offers several discussions that elucidate the multifaceted challenges posed by SEO poisoning in the SME context from both internal and external perspectives. These contributions will empower SMEs and digital marketers to implement proactive safeguards against SEO poisoning risks and preserve their online presence. The study underscores the need for continued vigilance and adaptive security to combat the evolving tactics of cyber adversaries in the digital marketing domain.
Article
OAuth 2.0 is an important and well studied protocol. However, despite the presence of guidelines and best practices, the current implementations are still vulnerable and error-prone. This research mainly focused on the Cross-Site Request Forgery (CSRF) attack. This attack is one of the dangerous vulnerabilities in OAuth protocol, which has been mitigated through state parameter. However, despite the presence of this parameter in the OAuth deployment, many websites are still vulnerable to the OAuth-CSRF (OCSRF) attack. We studied one of the most recurrent type of OCSRF attack through a variety range of novel attack strategies based on different possible implementation weaknesses and the state of the victim’s browser at the time of the attack. In order to validate them, we designed a repeatable methodology and conducted a large-scale analysis on 395 high-ranked sites to assess the prevalence of OCSRF vulnerabilities. Our automated crawler discovered about 36% of targeted sites are still vulnerable and detected about 20% more well-hidden vulnerable sites utilizing the novel attack strategies. Based on our experiment, there was a significant rise in the number of OCSRF protection compared to the past scale analyses and yet over 25% of sites are exploitable to at least one proposed attack strategy. Despite a standard countermeasure exists to mitigate the OCSRF, our study shows that lack of awareness about implementation mistakes is an important reason for a significant number of vulnerable sites.
Article
Full-text available
URL redirection has become an important tool for adversaries to cover up their malicious campaigns. In this paper, we conduct the first large-scale measurement study on how adversaries leverage URL redirection to circumvent security checks and distribute malicious content in practice. To this end, we design an iteratively running framework to mine the domains used for malicious redirections constantly. First, we use a bipartite graph-based method to dig out the domains potentially involved in malicious redirections from real-world DNS traffic. Then, we dynamically crawl these suspicious domains and recover the corresponding redirection chains from the crawler’s performance log. Based on the collected redirection chains, we analyze the working mechanism of various malicious redirections, involving the abused modes and methods, and highlight the pervasiveness of node sharing. Notably, we find a new redirection abuse, redirection fluxing, which is abused to enhance the concealment of malicious sites by introducing randomness into the redirection. Our case studies reveal the adversary’s preference for abusing JavaScript methods to conduct redirection, even by introducing time-delay and fabricating user clicks to simulate normal users.
Chapter
OAuth 2.0 is a popular and industry-standard protocol. To date, different attack classes and relevant countermeasures have been proposed. However, despite the presence of guidelines and best practices, the current implementations are still vulnerable and error-prone. In this research, we focus on OAuth Cross-Site Request Forgery (OCSRF) as an overlooked attack scenario. We studied one of the most recurrent types of OCSRF attacks by proposing several novel attack strategies based on different status of the victim browser. In order to validate them, we designed a repeatable methodology and conducted a large-scale analysis on 314 high-ranked sites to assess the prevalence of OCSRF vulnerabilities. Our automated crawler discovered about 36% of targeted sites are still vulnerable and detected about 20% more well-hidden vulnerable sites utilizing the novel attack strategies. Although our experiment revealed a significant increase in the number of OCSRF protection compared to the past scale analyses, over one-fourth are still vulnerable to at least one proposed attack strategy.
Article
Full-text available
Phishing is a significant security concern for organizations, threatening employees and members of the public. Phishing threats against employees can lead to severe security incidents, whereas those against the public can undermine trust, satisfaction, and brand equity. At the root of the problem is the inability of Internet users to identify phishing attacks even when using anti-phishing tools. We propose the phishing funnel model (PFM), a framework for predicting user susceptibility to phishing websites. PFM incorporates user, threat, and tool-related factors to predict actions during four key stages of the phishing process: visit, browse, consider legitimate, and intention to transact. We evaluated the efficacy of PFM in a 12-month longitudinal field experiment in two organizations involving 1,278 employees and 49,373 phishing interactions. PFM significantly outperformed competing models in terms of its ability to predict user susceptibility to phishing attacks. A follow-up three-month field study revealed that employees using PFM were significantly less likely to interact with phishing threats relative to comparison models and baseline warnings. Results of a cost-benefit analysis suggest that interventions guided by PFM could reduce annual phishing-related costs by nearly $1,900 per employee relative to comparison prediction methods.
Article
Malicious websites often mimic top brands to host malware and launch social engineering attacks, e.g., to collect user credentials. Some such sites often attempt to hide malicious content from search engine crawlers (e.g., Googlebot), but show harmful content to users/client browsers—a technique known as cloaking. Past studies uncovered various aspects of cloaking, using selected categories of websites (e.g., mimicking specific types of malicious sites). We focus on understanding cloaking behaviors using a broader set of websites. As a way forward, we built a crawler to automatically browse and analyze content from 100000 squatting (mostly) malicious domains—domains that are generated through typo-squatting and combo-squatting of 2883 popular websites. We use a headless Chrome browser and a search-engine crawler with user-agent modifications to identify cloaking behaviors—a challenging task due to dynamic content, served at random; e.g., consecutive requests serve very different malicious or benign content. Most malicious sites (e.g., phishing and malware) go undetected by current blacklists; only a fraction of cloaked sites (127, 3.3%) are flagged as malicious by VirusTotal. In contrast, we identify 80% cloaked sites as malicious, via a semi-automated process implemented by extending the content categorization functionality of Symantec’s SiteReview tool. Even after 3 months of observation, nearly a half (1024, 45.4%) of the cloaked sites remained active, and only a few (31, 3%) of them are flagged by VirusTotal. This clearly indicate that existing blacklists are ineffective against cloaked malicious sites. Our techniques can serve as a starting point for more effective and scalable early detection of cloaked malicious sites.
Thesis
This Master's thesis examines the validity of automation web measurements in the field of research, data analysis, and security scans. In addition to the common fields of search queries from Internet users and the advertising industry, web bots are used by researchers to scan millions of web sites to find evidence for their assumptions and hypotheses. While website owners have plenty of tools to detect and protect web bots, scientists often rely on open-source databases or ranking platforms such as Alexa and can only access the information from the client perspective. In addition, companies like Distil Networks offer services to protect their clients from web bots. In due consideration that website owners use protection mechanisms against web bots, there is a high probability that large-scale website scans by researchers will lead to false results and wrong conclusions. In this study, an experimental approach was used to show how to identify web bot detection from a client perspective. It is shown that inconsistencies in HTTP headers are critical for the detection of bot detection. Therefore, a crawler has been developed , and over 300,000 HTTP GET Requests have been executed and evaluated. ii
Chapter
Web bots are used to automate client interactions with websites, which facilitates large-scale web measurements. However, websites may employ web bot detection. When they do, their response to a bot may differ from responses to regular browsers. The discrimination can result in deviating content, restriction of resources or even the exclusion of a bot from a website. This places strict restrictions upon studies: the more bot detection takes place, the more results must be manually verified to confirm the bot’s findings.
Chapter
The drastic increase of JavaScript exploitation attacks has led to a strong interest in developing techniques to analyze malicious JavaScript. Existing analysis techniques fall into two general categories: static analysis and dynamic analysis. Static analysis tends to produce inaccurate results (both false positive and false negative) and is vulnerable to a wide series of obfuscation techniques. Thus, dynamic analysis is constantly gaining popularity for exposing the typical features of malicious JavaScript. However, existing dynamic analysis techniques possess limitations such as limited code coverage and incomplete environment setup, leaving a broad attack surface for evading the detection. To overcome these limitations, we present the design and implementation of a novel JavaScript forced execution engine named JSForce which drives an arbitrary JavaScript snippet to execute along different paths without any input or environment setup. We evaluate JSForce using 220,587 HTML and 23,509 PDF real-world samples. Experimental results show that by adopting our forced execution engine, the malicious JavaScript detection rate can be substantially boosted by 206.29% using same detection policy without any noticeable false positive increase.
Conference Paper
There is a common approach to detecting drive-by downloads using a classifier based on the static and dynamic features of malicious websites collected using a honeyclient. However, attackers detect the honeyclient and evade analysis using sophisticated JavaScript code. The evasive code indirectly identifies clients by abusing the differences among JavaScript implementations. Attackers deliver malware only to targeted clients on the basis of the evasion results while avoiding honeyclient analysis. Therefore, we are faced with a problem in that honeyclients cannot extract features from malicious websites and the subsequent classifier does not work. Nevertheless, we can observe the evasion nature, i.e., the results in accessing malicious websites by using targeted clients are different from those by using honeyclients. In this paper, we propose a method of extracting evasive code by leveraging the above differences to investigate current evasion techniques and to use them for analyzing malicious websites. Our method analyzes HTTP transactions of the same website obtained using two types of clients, a real browser as a targeted client and a browser emulator as a honeyclient. As a result of evaluating our method with 8,467 JavaScript samples executed in 20,272 malicious websites, we discovered unknown evasion techniques that abuse the differences among JavaScript implementations. These findings will contribute to improving the analysis capabilities of conventional honeyclients.
Conference Paper
Full-text available
Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of “roughly the same” and “roughly contained.” The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints
Article
Full-text available
We perform an in-depth study of SEO attacks that spread malware by poisoning search results for popular queries. Such attacks, although recent, appear to be both widespread and effective. They compromise legitimate Web sites and generate a large number of fake pages targeting trendy keywords. We first dissect one exam-ple attack that affects over 5,000 Web domains and at-tracts over 81,000 user visits. Further, we develop de-SEO, a system that automatically detects these attacks. Using large datasets with hundreds of billions of URLs, deSEO successfully identifies multiple malicious SEO campaigns. In particular, applying the URL signatures derived from deSEO, we find 36% of sampled searches to Google and Bing contain at least one malicious link in the top results at the time of our experiment.
Article
Full-text available
We present a study of Fake Anti-Virus attacks on the web. Fake AV software masquerades as a legitimate se-curity product with the goal of deceiving victims into paying registration fees to seemingly remove malware from their computers. Our analysis of 240 million web pages collected by Google's malware detection infras-tructure over a 13 month period discovered over 11,000 domains involved in Fake AV distribution. We show that the Fake AV threat is rising in prevalence, both abso-lutely, and relative to other forms of web-based mal-ware. Fake AV currently accounts for 15% of all mal-ware we detect on the web. Our investigation reveals several characteristics that distinguish Fake AVs from other forms of web-based malware and shows how these characteristics have changed over time. For instance, Fake AV attacks occur frequently via web sites likely to reach more users including spam web sites and on-line Ads. These attacks account for 60% of the malware dis-covered on domains that include trending keywords. As of this writing, Fake AV is responsible for 50% of all malware delivered via Ads, which represents a five-fold increase from just a year ago.
Conference Paper
Full-text available
Rogue antivirus software has recently received extensive attention, justified by the diffusion and efficacy of its propagation. We present a longitudinal analysis of the rogue antivirus threat ecosystem, focusing on the structure and dynamics of this threat and its economics. To that end, we compiled and mined a large dataset of characteristics of rogue antivirus domains and of the servers that host them. The contributions of this paper are threefold. Firstly, we offer the first, to our knowledge, broad analysis of the infrastructure underpinning the distribution of rogue security software by tracking 6,500 malicious domains. Secondly, we show how to apply attack attribution methodologies to correlate campaigns likely to be associated to the same individuals or groups. By using these techniques, we identify 127 rogue security software campaigns comprising 4,549 domains. Finally, we contextualize our findings by comparing them to a different threat ecosystem, that of browser exploits. We underline the profound difference in the structure of the two threats, and we investigate the root causes of this difference by analyzing the economic balance of the rogue antivirus ecosystem. We track 372,096 victims over a period of 2 months and we take advantage of this information to retrieve monetization insights. While applied to a specific threat type, the methodology and the lessons learned from this work are of general applicability to develop a better understanding of the threat economies.
Conference Paper
Full-text available
Spam-based advertising is a business. While it has engendered both widespread antipathy and a multi-billion dollar anti-spam industry, it continues to exist because it fuels a profitable enterprise. We lack, however, a solid understanding of this enterprise's full structure, and thus most anti-Spam interventions focus on only one facet of the overall spam value chain (e.g., spam filtering, URL blacklisting, site takedown).In this paper we present a holistic analysis that quantifies the full set of resources employed to monetize spam email -- including naming, hosting, payment and fulfillment -- using extensive measurements of three months of diverse spam data, broad crawling of naming and hosting infrastructures, and over 100 purchases from spam-advertised sites. We relate these resources to the organizations who administer them and then use this data to characterize the relative prospects for defensive interventions at each link in the spam value chain. In particular, we provide the first strong evidence of payment bottlenecks in the spam value chain, 95% of spam-advertised pharmaceutical, replica and software products are monetized using merchant services from just a handful of banks.
Article
Full-text available
We investigate the manipulation of web search results to promote the unauthorized sale of prescription drugs. We focus on search-redirection attacks, where miscreants compromise high-ranking websites and dynamically redirect traffic to different pharmacies based upon the particular search terms issued by the consumer. We constructed a representative list of 218 drug-related queries and automatically gathered the search results on a daily basis over nine months in 2010-2011. We find that about one third of all search results are one of over 7 000 infected hosts triggered to redirect to a few hundred pharmacy websites. Legitimate pharmacies and health resources have been largely crowded out by search-redirection attacks and blog spam. Infections persist longest on websites with high PageRank and from .edu domains. 96% of infected domains are connected through traffic redirection chains, and network analysis reveals that a few concentrated communities link many otherwise disparate pharmacies together. We calculate that the conversion rate of web searches into sales lies between 0.3% and 3%, and that more illegal drugs sales are facilitated by search-redirection attacks than by email spam. Finally, we observe that concentration in both the source infections and redirectors presents an opportunity for defenders to disrupt online pharmacy sales.
Conference Paper
Full-text available
Cloaking and redirection are two possible search en- gine spamming techniques. In order to understand cloaking and redirection on the Web, we downloaded two sets of Web pages while mimicking a popular Web crawler and as a common Web browser. We estimate that 3% of the rst data set and 9% of the second data set utilize cloaking of some kind. By checking manually a sample of the cloaking pages from the sec- ond data set, nearly one third of them appear to aim to manipulate search engine ranking. We also examined redirection methods present in the rst data set. We propose a method of detecting cloaking pages by calculating the dierence of three copies of the same page. We examine the dieren t types of cloaking that are found and the distribution of dieren t types of redirection.
Conference Paper
Full-text available
By supplying dieren t versions of a web page to search en- gines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Seman- tic cloaking refers to dierences in meaning between pages which have the eect of deceiving search engine ranking al- gorithms. In this paper, we propose an automated two-step method to detect semantic cloaking pages based on dier- ent copies of the same page downloaded by a web crawler and a web browser. The rst step is a ltering step, which generates a candidate list of semantic cloaking pages. In the second step, a classier is used to detect semantic cloaking pages from the candidates generated by the ltering step. Experiments on manually labeled data sets show that we can generate a classier with a precision of 93% and a re- call of 85%. We apply our approach to links from the dmoz Open Directory Project and estimate that more than 50,000 of these pages employ semantic cloaking.
Article
Search spam is an attack on search engines' ranking algorithms to promote spam links into top search ranking that they do not deserve. Cloaking is a well-known search spam technique in which spammers serve one page to search-engine crawlers to optimize ranking, but serve a different page to browser users to maximize potential profit. In this experience report, we investigate a different and relatively new type of cloaking, called Click-Through Cloaking, in which spammers serve non-spam content to browsers who visit the URL directly without clicking through search results, in an attempt to evade spam detection by human spam investigators and anti-spam scanners. We survey different cloaking techniques actually used in the wild and classify them into three categories: server-side, client-side, and combination. We propose a redirection-diff approach to spam detection by turning spammers' cloaking techniques against themselves. Finally, we present eight case studies in which we used redirection-diff in IP subnet-based spam hunting to defend a major search engine against stealth spam pages that use click-through cloaking.
Article
Internet sensor networks, including honeypots and log analysis centers such as the SANS Internet Storm Center, are used as a tool to detect malicious Internet traffic. For maximum effectiveness, such networks publish public reports without disclosing sensor locations, so that the Internet community can take steps to counteract the malicious traffic. Maintaining sensor anonymity is critical because if the set of sensors is known, a malicious attacker could avoid the sensors entirely or could overwhelm the sensors with errant data. Motivated by the growing use of Internet sensors as a tool to monitor Internet traffic, we show that networks that publicly report statistics are vulnerable to intelligent probing to determine the location of sensors. In particular, we develop a new "probe response" attack technique with a number of optimizations for locating the sensors in currently deployed Internet sensor networks and illustrate the technique for a specific case study that shows how the attack would locate the sensors of the SANS Internet Storm Center using the published data from those sensors. Simulation results show that the attack can determine the identity of the sensors in this and other sensor networks in less than a week, even under a limited adversarial model. We detail critical vulnerabilities in several current anonymization schemes and demonstrate that we can quickly and efficiently discover the sensors even in the presence of sophisticated anonymity preserving methods such as prefix-preserving permutations or Bloom filters. Finally, we consider the characteristics of an Internet sensor which make it vulnerable to probe response attacks and discuss potential countermeasures.
Conference Paper
Forum spamming has become a major means of search engine spamming. To evaluate the impact of forum spamming on search quality, we have conducted a comprehensive study from three perspectives: that of the search user, the spammer, and the forum hosting site. We examine spam blogs and spam comments in both legitimate and honey forums. Our study shows that forum spamming is a widespread problem. Spammed forums, powered by the most popular software, show up in the top 20 search results for all the 189 popular keywords. On two blog sites, more than half (75% and 54% respectively) of the blogs are spam, and even on a major and reputably well maintained blog site, 8.1% of the blogs are spam . The observation on our honey forums confirms that spammers target abandoned pages and that most comment spam is meant to increase page rank rather than generate immediate traffic. We propose context-based analyses, consisting of redirection and cloaking analysis, to detect spam automatically and to overcome shortcomings of content-based analyses. Our study shows that these analyses are very effective in identifying spam pages.
Article
In this paper we present an analysis of an AltaVista Search Engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks. This represents almost 285 million user sessions, each an attempt to fill a single information need. We present an analysis of individual queries, query duplication, and query sessions. We also present results of a correlation analysis of the log entries, studying the interaction of terms within queries. Our data supports the conjecture that web users differ significantly from the user assumed in the standard information retrieval literature. Specifically, we show that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query. This suggests that traditional information retrieval techniques may not work well for answering web search requests. The correlation analysis showed that the most highly correlated items are constituents of phrases. This result indicates it may be useful for search engines to consider search terms as parts of phrases even if the user did not explicitly specify them as such.
Article
Web spam attempts to influence search engine ranking algorithm in order to boost the rankings of specific web pages in search engine results. Cloaking is a widely adopted technique of concealing web spam by replying different content to search engines’ crawlers from that displayed in a web browser. Previous work on cloaking detection is mainly based on the differences in terms and/or links between multiple copies of a URL retrieved from web browser and search engine crawler perspectives. This work presents three methods of using difference in tags to determine whether a URL is cloaked. Since the tags of a web page generally do not change as frequently and significantly as the terms and links of the web page, tag-based cloaking detection methods can work more effectively than the term- or link-based methods. The proposed methods are tested with a dataset of URLs covering short-, medium- and long-term users’ interest. Experimental results indicate that the tag-based methods outperform term- or link-based methods in both precision and recall. Moreover, a Weka J4.8 classifier using a combination of term and tag features yields an accuracy rate of 90.48%.
Conference Paper
Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords – one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.
Angelos Keromytis, and Marc Dacier. An Analysis of Rogue AV Campaigns
  • Marco Cova
  • Corrado Leita
  • Olivier
Marco Cova, Corrado Leita, Olivier Thonnard, Angelos Keromytis, and Marc Dacier. An Analysis of Rogue AV Campaigns. In Proceedings of the 13th International Symposium on Recent Advances in Intrusion Detection (RAID), September 2010.
Keyword Phrase Value: Click-Throughs vs
  • Jason Tabeling
Jason Tabeling. Keyword Phrase Value: Click-Throughs vs. Conversions. http://searchenginewatch.com/ 3641985, March 8, 2011.
Five Reasons Why Wordtracker Blows Other Keywords Tools Away
  • Wordtracker
Wordtracker. Five Reasons Why Wordtracker Blows Other Keywords Tools Away. http://www.wordtracker.com/ find-the-best-keywords.html.
The Fast Track to Profit
  • G Lee
  • Caldwell
Lee G. Caldwell. The Fast Track to Profit. Pearson Education, 2002.
Google Penalizes Overstock for Search Tactics
  • Amir Efrati
Amir Efrati. Google Penalizes Overstock for Search Tactics. http://online.wsj.com/article/ SB10001424052748704520504576162753779521700. html, February 24, 2011.
Huge Decline in Fake AV Following Credit Card Processing Shakeup
  • Brian Krebs
Brian Krebs. Huge Decline in Fake AV Following Credit Card Processing Shakeup. http:
Tricks to easily detect malware and scams in search results
  • Julien Sobrier
Julien Sobrier. Tricks to easily detect malware and scams in search results. http://research.zscaler.com/2010/ 06/tricks-to-easily-detect-malware-and.html, June 3, 2010.
System and method for identifying cloaked web servers, United States Patent number 6,910,077
  • A Marc
  • Najork
Marc A. Najork. System and method for identifying cloaked web servers, United States Patent number 6,910,077. Issued June 21, 2005.
Search Engine Optimization Firm Sold For $95 Million
  • Danny Sullivan
Danny Sullivan. Search Engine Optimization Firm Sold For $95 Million. http://searchenginewatch.com/ 2163001, September 2000. Search Engine Watch.
Martín Abadi, deSEO: combating search-result poisoning
  • P John
  • Fang John
  • Yinglian Yu
  • Arvind Xie
  • Krishnamurthy
Search Engine Watch. Danny Sullivan. Search Engine Optimization Firm Sold For $95 Million
  • Danny Sullivan
Julien Sobrier. Tricks to easily detect malware and scams in search results
  • Julien Sobrier