Conference Paper

Hybrid Focused Crawling for Homemade Explosives Discovery on Surface and Dark Web

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly traverse the Surface Web and several darknets present in the Dark Web (i.e. Tor, I2P and Freenet) during a single crawl by automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the network type. This hybrid focused crawler is demonstrated for the discovery of Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the effectiveness of the proposed ap-proach both for the Surface and the Dark Web.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The Surface Web constitutes the part of the Web gathered and indexed by conventional general-purpose search engines such as Google, Firefox, Bing, etc. However, such search engines are capable of indexing just a small portion of available Web information (Beshiri & Susuri, 2019;EMCDDA & Europol, 2017;Iliou, Kalpakis, Tsikrika, Vrochidis, & Kompatsiaris, 2016). ...
... Another part of the Internet is the Deep Web which comprises content that cannot be detected by the crawlers employed by conventional search engines, and includes information on the private networks and intranets that are password-protected behind logins, encrypted, or disallowed by the owner. By definition, private social media profiles on Facebook or Twitter are considered part of the Deep Web, too (Beshiri & Susuri, 2019;Iliou et al., 2016;Schäfer et al., 2019). ...
... There is also a part of the Deep Web, known as the Dark Web, that provides anonymity both from a user and a data perspective, as its content is intentionally hidden and cannot be accessed by standard web browsers, but instead requires the use of special software. For this reason, the Dark Web has become popular for material such as "child pornography, unauthorised leaks of sensitive information, money laundering, copyright infringement, credit card fraud, identity theft, illegal sales of weapons and disseminating extremist content" (Weimann, 2016) The Dark Web is formed by several darknets such as The Onion Router (TOR)which enables online anonymous communicationand the Invisible Internet Project (I2P), which is used for anonymous communication, users' traffic encryption, etc. (Beshiri & Susuri, 2019;EMCDDA & Europol, 2017;Iliou et al., 2016;Schäfer et al., 2019). The anonymity afforded by the Dark Web enables those engaged in the purchase and supply of drugs to conceal their identities. ...
Article
Full-text available
This systematic review attempts to understand how people keep secrets online, and in particular how people use the internet when engaging in covert behaviours and activities regarding the procurement and supply of illicit drugs. With the Internet and social media being part of everyday life for most people in western and non-western countries, there are ever-growing opportunities for individuals to engage in covert behaviours and activities online that may be considered illegal or unethical. A search strategy using Medical Subject Headings terms and relevant key words was developed. A comprehensive literature search of published and unpublished studies in electronic databases was conducted. Additional studies were identified from reference lists of previous studies and (systematic) reviews that had similar objectives as this search, and were included if they fulfilled our inclusion criteria. Two researchers independently screened abstracts and full-texts for study eligibility and evaluated the quality of included studies. Disagreements were resolved by a consensus procedure. The systematic review includes 33 qualitative studies and one cross-sectional study, published between 2006 and 2018. Five covert behaviours were identified: the use of communication channels; anonymity; visibility reduction; limited posts in public; following forum rules and recommendations. The same technologies that provide individuals with easy access to information, such as social networking sites and forums, digital devices, digital tools and services, also increase the prevalence of inaccurate information, loss of privacy, identity theft and disinhibited communication. This review takes a rigorous interdisciplinary approach to synthesising knowledge on the strategies adopted by people in keeping secrets online. Whilst the focus is on the procurement and supply of illicit drugs, this knowledge is transferrable to a range of contexts where people keep secrets online. It has particular significance for those who design online/social media applications, and for law enforcement and security agencies.
... It is an open question whether the fundamental and often necessary protections that Tor provides its users is worth its cost: the same features that protect the privacy of virtuous users also make Tor an eective means to carry out illegal activities and to evade law enforcement. Various positions on this question have been documented [25], but empirical evidence is limited to studies that have crawled, extracted, and analyzed specic subsets of Tor based on the type of hosted information, such as drug tracking [10], homemade explosives [16], terrorist activities [5], or forums [29]. ...
... Dolliver et al. crawls Silk Road 2 with the goal of comparing its nature in drug tracking operations with that of the original site [9]. Other related works propose tools to support the collection of specic information, such as a focused crawler by Iliou et al. [16], new crawling frameworks for Tor by Zhang et al. [29], and advanced crawling and indexing systems like LIGHTS by Ghosh et al. [12]. ...
Conference Paper
Full-text available
Tor is among most well-known dark net in the world. It has noble uses, including as a platform for free speech and information dissemination under the guise of true anonymity, but may be culturally better known as a conduit for criminal activity and as a platform to market illicit goods and data. Past studies on the content of Tor support this notion, but were carried out by targeting popular domains likely to contain illicit content. A survey of past studies may thus not yield a complete evaluation of the content and use of Tor. This work addresses this gap by presenting a broad evaluation of the content of the English Tor ecosystem. We perform a comprehensive crawl of the Tor dark web and, through topic and network analysis, characterize the 'types' of information and services hosted across a broad swath of Tor domains and their hyperlink relational structure. We recover nine domain types defined by the information or service they host and, among other findings, unveil how some types of domains intentionally silo themselves from the rest of Tor. We also present measurements that (regrettably) suggest how marketplaces of illegal drugs and services do emerge as the dominant type of Tor domain. Our study is the product of crawling over 1 million pages from 20,000 Tor seed addresses, yielding a collection of over 150,000 Tor pages. The domain structure is publicly available as a dataset at \urlhttps://github.com/wsu-wacs/TorEnglishContent.
... Dynamically generated materials inaccessible via search engines due to the need for user input, rather than any desire for covertness or privacy (Florescu et al., 1998;Schadd et al., 2012); Unseemly or nefarious content such as that created by extremist/hate groups (Abbasi and Chen, 2007;Yang et al., 2009;Li et al., 2013); and Networks providing anonymity for content users and content providers (Iliou et al., 2016). This paper exclusively refers to 'dark web' in a manner akin to (Iliou et al., 2016), though with further clarification. ...
... Dynamically generated materials inaccessible via search engines due to the need for user input, rather than any desire for covertness or privacy (Florescu et al., 1998;Schadd et al., 2012); Unseemly or nefarious content such as that created by extremist/hate groups (Abbasi and Chen, 2007;Yang et al., 2009;Li et al., 2013); and Networks providing anonymity for content users and content providers (Iliou et al., 2016). This paper exclusively refers to 'dark web' in a manner akin to (Iliou et al., 2016), though with further clarification. Following (Guitton, 2013), we regard anonymity to be the non-coordinatability of traits (Wallace, 1999). ...
Article
Research into the nature and structure of 'Dark Webs' such as Tor has largely focused upon manually labelling a series of crawled sites against a series of categories, sometimes using these labels as a training corpus for subsequent automated crawls. Such an approach is adequate for establishing broad taxonomies, but is of limited value for specialised tasks within the field of law enforcement. Contrastingly, existing research into illicit behaviour online has tended to focus upon particular crime types such as terrorism. A gap exists between taxonomies capable of holistic representation and those capable of detailing criminal behaviour. The absence of such a taxonomy limits interoperability between agencies, curtailing development of standardised classification tools.We introduce the Tor-use Motivation Model (TMM), a two-dimensional classification methodology specifically designed for use within a law enforcement context. The TMM achieves greater levels of granularity by explicitly distinguishing site content from motivation, providing a richer labelling schema without introducing inefficient complexity or reliance upon overly broad categories of relevance. We demonstrate this flexibility and robustness through direct examples, showing the TMM's ability to distinguish a range of unethical and illegal behaviour without bloating the model with unnecessary detail.The authors of this paper received permission from the Australian government to conduct an unrestricted crawl of Tor for research purposes, including the gathering and analysis of illegal materials such as child pornography. The crawl gathered 232,792 pages from 7651 Tor virtual domains, resulting in the collation of a wide spectrum of materials, from illicit to downright banal. Existing conceptual models and their labelling schemas were tested against a small sample of gathered data, and were observed to be either overly prescriptive or vague for law enforcement purposes - particularly when used for prioritising sites of interest for further investigation.In this paper we deploy the TMM by manually labelling a corpus of over 4000 unique Tor pages. We found a network impacted (but not dominated) by illicit commerce and money laundering, but almost completely devoid of violence and extremism. In short, criminality on this 'dark web' is based more upon greed and desire, rather than any particular political motivations.
... Custom crawlers in the literature either lack technical solutions for the challenges posed by Fu et al. in 2010, or they do not clearly explain how they deal with them, for example, when dealing with login functionality [12] or with CAPTCHA challenges that prevent Track: Industry WWW 2018, April 23-27, 2018, Lyon, France auto-login [22]. Others opt to exclude forums that require registration [43]. ...
... These include: not identifying individuals (including not publishing usernames); taking care to present results objectively; dealing appropriately with personal data (such as credit card data belonging to victims); and taking steps to protect the researchers. Some researchers also take the step of not disclosing which forums have been analysed [22,29]. When it comes to presenting results, the researchers can ensure that they do not make comments that are likely to offend the community being studied, to protect themselves as well as the research participants. ...
Conference Paper
Underground forums allow criminals to interact, exchange knowledge, and trade in products and services. They also provide a pathway into cybercrime, tempting the curious to join those already motivated to obtain easy money. Analysing these forums enables us to better understand the behaviours of offenders and pathways into crime. Prior research has been valuable, but limited by a reliance on datasets that are incomplete or outdated. More complete data, going back many years, allows for comprehensive research into the evolution of forums and their users. We describe CrimeBot, a crawler designed around the particular challenges of capturing data from underground forums. CrimeBot is used to update and maintain CrimeBB, a dataset of more than 48m posts made from 1m accounts in 4 different operational forums over a decade. This dataset presents a new opportunity for large-scale and longitudinal analysis using up-to-date information. We illustrate the potential by presenting a case study using CrimeBB, which analyses which activities lead new actors into engagement with cybercrime. CrimeBB is available to other academic researchers under a legal agreement, designed to prevent misuse and provide safeguards for ethical research.
... It is an open question whether the fundamental and o en necessary protections that Tor provides its users is worth its cost: the same features that protect the privacy of virtuous users also make Tor an e ective means to carry out illegal activities and to evade law enforcement. Various positions on this question have been documented [16,22,30], but empirical evidence is limited to studies that have crawled, extracted, and analyzed speci c subsets of Tor based on the type of hosted information, such as drug tra cking [12], homemade explosives [20], terrorist activities [7], or forums [39]. ...
... Dolliver et al. also conduct an investigation on psychoactive substances sold on Agora and the countries supporting this dark trade [13]. Other related works propose tools to support the collection of speci c information, such as a focused crawler by Iliou et al. [20], new crawling frameworks for Tor by Zhang et al. [39], and advanced crawling and indexing systems like LIGHTS by Ghosh et al. [15]. ...
Preprint
Full-text available
Tor is among most well-known dark net in the world. It has noble uses, including as a platform for free speech and information dissemination under the guise of true anonymity, but may be culturally better known as a conduit for criminal activity and as a platform to market illicit goods and data. Past studies on the content of Tor support this notion, but were carried out by targeting popular domains likely to contain illicit content. A survey of past studies may thus not yield a complete evaluation of the content and use of Tor. This work addresses this gap by presenting a broad evaluation of the content of the English Tor ecosystem. We perform a comprehensive crawl of the Tor dark web and, through topic and network analysis, characterize the types of information and services hosted across a broad swath of Tor domains and their hyperlink relational structure. We recover nine domain types defined by the information or service they host and, among other findings, unveil how some types of domains intentionally silo themselves from the rest of Tor. We also present measurements that (regrettably) suggest how marketplaces of illegal drugs and services do emerge as the dominant type of Tor domain. Our study is the product of crawling over 1 million pages from 20,000 Tor seed addresses, yielding a collection of over 150,000 Tor pages. We make a dataset of the intend to make the domain structure publicly available as a dataset at https://github.com/wsu-wacs/TorEnglishContent.
... It is an open question whether the fundamental and o en necessary protections that Tor provides its users is worth its cost: the same features that protect the privacy of virtuous users also make Tor an e ective means to carry out illegal activities and to evade law enforcement. Various positions on this question have been documented [16,22,30], but empirical evidence is limited to studies that have crawled, extracted, and analyzed speci c subsets of Tor based on the type of hosted information, such as drug tra cking [12], homemade explosives [20], terrorist activities [7], or forums [39]. ...
... Dolliver et al. also conduct an investigation on psychoactive substances sold on Agora and the countries supporting this dark trade [13]. Other related works propose tools to support the collection of speci c information, such as a focused crawler by Iliou et al. [20], new crawling frameworks for Tor by Zhang et al. [39], and advanced crawling and indexing systems like LIGHTS by Ghosh et al. [15]. ...
Preprint
Tor is among most well-known dark net in the world. It has noble uses, including as a platform for free speech and information dissemination under the guise of true anonymity, but may be culturally better known as a conduit for criminal activity and as a platform to market illicit goods and data. Past studies on the content of Tor support this notion, but were carried out by targeting popular domains likely to contain illicit content. A survey of past studies may thus not yield a complete evaluation of the content and use of Tor. This work addresses this gap by presenting a broad evaluation of the content of the English Tor ecosystem. We perform a comprehensive crawl of the Tor dark web and, through topic and network analysis, characterize the ‘types’ of information and services hosted across a broad swath of Tor domains and their hyperlink relational structure. We recover nine domain types defined by the information or service they host and, among other findings, unveil how some types of domains intentionally silo themselves from the rest of Tor. We also present measurements that (regrettably) suggest how marketplaces of illegal drugs and services do emerge as the dominant type of Tor domain. Our study is the product of crawling over 1 million pages from 20,000 Tor seed addresses, yielding a collection of over 150,000 Tor pages. We make a dataset of the intend to make the domain structure publicly available as a dataset at https://github.com/wsu-wacs/TorEnglishContent.
... In terms of conceptualization, the web is a content made up of accessible web sites through search engines such as Google, Firefox, etc. This content is known as "Surface Web" (Figure 1) [2] [3] [4] [5]. ...
... The TOR daily users in the world that they have used the Internet anonymously during January to December 2018 4 .3 ...
Article
Full-text available
The Internet as the whole is a network of multiple computer networks and their massive infrastructure. The web is made up of accessible websites through search engines such as Google, Firefox, etc. and it is known as the Surface Web. The Internet is segmented further in the Deep Web—the content that it is not indexed and cannot access by traditional search engines. Dark Web considers a segment of the Deep Web. It accesses through TOR. Actors within Dark Web websites are anonymous and hidden. Anonymity, privacy and the possibility of non-detection are three factors that are provided by special browser such as TOR and I2P. In this paper, we are going to discuss and provide results about the influence of the Dark Web in different spheres of society. It is given the number of daily anonymous users of the Dark Web (using TOR) in Kosovo as well as in the whole world for a period of time. The influence of hidden services websites is shown and results are gathered from Ahimia and Onion City Dark Web’s search engines. The anonymity is not completely verified on the Dark Web. TOR dedicates to it and has intended to provide anonymous activities. Here are given results about reporting the number of users and in which place(s) they are. The calculation is based on IP addresses according to country codes from where comes the access to them and report numbers in aggregate form. In this way, indirect are represented the Dark Web users. The number of users in anonymous networks on the Dark Web is another key element that is resulted. In such networks, users are calculated through the client requests of directories (by TOR metrics) and the relay list is updated. Indirectly, the number of users is calculated for the anonymous networks.
... Papers were classified as an evaluation of illegal activities over the Dark Net when they reviewed the presence as well as impacts of illegal activities on the Dark Web in general and its illegal markets in particular on the users [9,10,64,139,152,155,160,[183][184][185][186][187][188][189][190][191][192][193]. Of the 200 papers in our corpus, 18 (9%) papers explored the illegal aspect of Dark Net activities. ...
... Figure 11 shows more details of the publication timeline of this theme. 2013 [160] 2014 [193] 2016 [184,189] 2018 [9,10,185,187,190] 2019 [155,183,186] 2020 [64,139,152,188,191] 2021 [192] ...
Article
Full-text available
The World Wide Web (www) consists of the surface web, deep web, and Dark Web, depending on the content shared and the access to these network layers. Dark Web consists of the Dark Net overlay of networks that can be accessed through specific software and authorization schema. Dark Net has become a growing community where users focus on keeping their identities, personal information, and locations secret due to the diverse population base and well-known cyber threats. Furthermore, not much is known of Dark Net from the user perspective, where often there is a misunderstanding of the usage strategies. To understand this further, we conducted a systematic analysis of research relating to Dark Net privacy and security on N=200 academic papers, where we also explored the user side. An evaluation of secure end-user experience on the Dark Net establishes the motives of account initialization in overlaid networks such as Tor. This work delves into the evolution of Dark Net intelligence for improved cybercrime strategies across jurisdictions. The evaluation of the developing network infrastructure of the Dark Net raises meaningful questions on how to resolve the issue of increasing criminal activity on the Dark Web. We further examine the security features afforded to users, motives, and anonymity revocation. We also evaluate more closely nine user-study-focused papers revealing the importance of conducting more research in this area. Our detailed systematic review of Dark Net security clearly shows the apparent research gaps, especially in the user-focused studies emphasized in the paper.
... [cs.CR] 9 Dec 2020 work. Also, several approaches focus on hybrid crawling [12], searching for data in Tor, i2p, and Freenet. Most of the existing research tries to identify threat intelligence information and other critical information. ...
Preprint
Tor and i2p networks are two of the most popular darknets. Both darknets have become an area of illegal activities highlighting the necessity to study and analyze them to identify and report illegal content to Law Enforcement Agencies (LEAs). This paper analyzes the connections between the Tor network and the i2p network. We created the first dataset that combines information from Tor and i2p networks. The dataset contains more than 49k darknet services. The process of building and analyzing the dataset shows that it is not possible to explore one of the networks without considering the other. Both networks work as an ecosystem and there are clear paths between them. Using graph analysis, we also identified the most relevant domains, the prominent types of services in each network, and their relations. Findings are relevant to LEAs and researchers aiming to crawl and investigate i2p and Tor networks.
... By giving crawler software some feed addresses as the initial address set, they aim to reach pages that overlap with the content in those addresses. A similar crawler software on this subject is described by [14]. Also, in the work of [15], they proposed new crawling methods that can be used on TOR; in the work of [16], they developed advanced crawling and indexing systems like LIGHTS, for use on the TOR network. ...
Article
Full-text available
TOR (The Onion Routing) is a network structure that has become popular in recent years due to providing anonymity to its users and is often preferred by hidden services. In this network, which attracts attention due to the fact that privacy is essential, so the amount of data stored increases day by day, making it difficult to scan and analyze the data. In addition, it is highly likely that the process performed during the onion extension services scan will be considered as cyber-attack and the access to the relevant address will be blocked. Various crawler software has been developed in order to scan and access the services (onion web pages) in this network. However, crawling here is different from crawling pages in a surface network with extensions such as com, net, org. This is because the TOR network is located on the lower layers of the surface network, and the pages in TOR network are accessed only through the TOR browser instead of the traditional browsers (Chrome, Mozilla, etc.). In the crawler softwares developed to date, this situation was taken into consideration and in order to protect the confidentiality, the data was obtained by selecting paths through different relays in the requests made to the addresses. In the TOR network, reaching the target address by passing over different nodes in each request sent by the users, slows down this network. In addition, the low performance of a browser that tries to retrieve information through TOR brings long periods of waiting. Therefore, working with crawler software with high crawling and information acquisition speed will improve the analysis process of the researchers. 4 different crawler software was evaluated according to various criteria in terms of guiding the people who will conduct research in this field and evaluating the superior and weaknesses of the crawlers against each other. The study provides an important point of view for choosing the right crawler in terms of initial starting points for the researchers want to analyze of Tor web services.
... Regarding the Deep Web, many works have focused their attention on darknets, specially on TOR. For instance, a generic crawling framework is proposed in [29] to discover resources with different contents hosted both on the Surface Web or the Deep Web. In particular, the authors carry out an experiment intended to search websites with content about homemade explosives. ...
Preprint
Full-text available
Web is a primary and essential service to share information among users and organizations at present all over the world. Despite the current significance of such a kind of traffic on the Internet, the so-called Surface Web traffic has been estimated in just about 5% of the total. The rest of the volume of this type of traffic corresponds to the portion of Web known as Deep Web. These contents are not accessible by search engines because they are authentication protected contents or pages that are only reachable through the well known as darknets. To browse through darknets websites special authorization or specific software and configurations are needed. Despite TOR is the most used darknet nowadays, there are other alternatives such as I2P or Freenet, which offer different features for end users. In this work, we perform an analysis of the connectivity of websites in the I2P network (named eepsites) aimed to discover if different patterns and relationships from those used in legacy web are followed in I2P, and also to get insights about its dimension and structure. For that, a novel tool is specifically developed by the authors and deployed on a distributed scenario. Main results conclude the decentralized nature of the I2P network, where there is a structural part of interconnected eepsites while other several nodes are isolated probably due to their intermittent presence in the network.
... Moreover, in [27], the authors propose a crawler-based in the query intensive interface information extraction protocol. In [28] it is proposed a crawling for the search of homemade explosives. Additionally, [29] it is presented an architecture based on docker to crawl the dark web. ...
... Motivated by the activities of the HOMER project, this framework was configured towards the discovery of resources on the Surface and the Dark Web containing HME recipes. As an extension of a preliminary study[2], this work has the following additional contributions: 1. It expresses the proposed focused crawling approach employing the three classifiers in an individual or in a combined mode based on conditions related to the destination network of a hyperlink (e.g., the hyperlink selection policy differentiates when a link points to the Dark Web or to specific darknets of the Dark Web) and/or to whether the local context representation of the hyperlinks encountered conveys meaningful or sufficient information. ...
Article
Full-text available
Focused crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating through the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic of interest. This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly navigate through the Surface Web and several darknets present in the Dark Web (i.e., Tor, I2P, and Freenet) during a single crawl by automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the destination network type and the strength of the local evidence present in the vicinity of a hyperlink. It investigates 11 hyperlink selection methods, among which a novel strategy proposed based on the dynamic linear combination of a link-based and a parent Web page classifier. This hybrid focused crawler is demonstrated for the discovery of Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the effectiveness of the proposed focused crawler both for the Surface and the Dark Web.
Article
Tor and i2p networks are two of the most popular darknets. Both darknets have become an area of illegal activities highlighting the necessity to study and analyze them to identify and report illegal content to Law Enforcement Agencies (LEAs). This paper analyzes the connections between the Tor network and the i2p network. We created the first dataset that combines information from Tor and i2p networks. The dataset contains more than 49k darknet services. The process of building and analyzing the dataset shows that it is not possible to explore one of the networks without considering the other. Both networks work as an ecosystem and there are clear paths between them. Using graph analysis, we also identified the most relevant domains, the prominent types of services in each network, and their relations. Findings are relevant to Law Enforcement Agencies (LEAs) and researchers aiming to crawl and investigate i2p and Tor networks.
Article
Web is a primary and essential service to share information among users and organizations at present all over the world. Despite the current significance of such a kind of traffic on the Internet, the so-called Surface Web traffic has been estimated in just about 5% of the total. The rest of the volume of this type of traffic corresponds to the portion of the Web known as Deep Web. These contents are not accessible by usual search engines because they are authentication protected contents or pages only reachable through technologies denoted as darknets. To browse through darknet websites, special authorization or specific software and configurations are needed. TOR is one of the most used darknet nowadays, but there are several other alternatives such as I2P or Freenet, which offer different features for end users. In this work, we perform a connectivity analysis of the websites in the I2P network (named eepsites) aimed to discover if different patterns and relationships from those used in legacy webs are followed in I2P, as well as to get insights about the dimension and structure of this darknet. For that, a novel tool is specifically developed by the authors and deployed on a distributed scenario. Main results conclude the decentralized nature of the I2P network, where there is a structural part of interconnected eepsites while other several nodes are isolated probably due to their intermittent presence in the network.
Method
Full-text available
Since the advent of darknet markets, or illicit cryptomarkets, there has been a sustained interest in studying their operations: the actors, products, payment methods, and so on. However, this research has been limited by a variety of obstacles, including the difficulty in obtaining reliable and representative data, which present challenges to undertaking a complete and systematic study. The Australian National University’s Cybercrime Observatory has developed tools that can be used to collect and analyse data obtained from darknet markets. This paper describes these tools in detail. While the proposed methods are not error-free, they provide a further step in providing a transparent and comprehensive solution for observing darknet markets tailored for data scientists, social scientists, criminologists and others interested in analysing trends from darknet markets.
Article
Full-text available
Abstract Analogous to the spectacular growth of information-superhighway, The Internet, demands for coherent and economical crawling methods are translucent to shoot up. Consequently, many innovative techniques have been put forth for efficient crawling. Among them the significant one is focused crawlers. The focused crawlers are capable in searching web pages that are suitable for the topics defined in advance. Focused crawlers attract several search engines on the grounds of efficient filtering, reduced memory and time consumption. This paper furnishes a relevance computation based survey on web crawling. A bunch of fifty two focused crawlers from the existing literature survey is categorized to four different classes - classic focused crawler, semantic focused crawler, learning focused crawler and ontology learning focused crawler. The prerequisite and the mastery of each metric with respect to harvest rate, target recall, precision and F1-score are discussed. Future outlooks, shortcomings and strategies are also suggested.
Article
Full-text available
The quality of translation work mainly depends on the understanding of the words in their domain. If machine translation can accurately translate the words in a domain in different languages, it can even avoid any human communication error. To achieve this, a high-quality bilingual corpus is crucial as they are always the basis of state-of-the-art machine translation system. However, it is complicated to construct the corpus with large amount of parallel data. In this paper, a new crawling architecture, called Hybrid Crawling Architecture (HCA), will be proposed, which efficiently and effectively collects parallel data from the Web for the bilingual corpus. HCA aims at targeted websites, which contains articles in at least two different languages. As it is a mixture of Focused crawling architecture and Parallel crawling architecture, HCA takes advantages over both architectures. In intensive experiments on crawling parallel data of relevance topics, HCA significantly outperforms Focused crawling architecture and Parallel crawling architecture for 30% and 200% respectively, in terms of quantity.
Conference Paper
Full-text available
Tor hidden services allow running Internet services while protecting the location of the servers. Their main purpose is to enable freedom of speech even in situations in which powerful adversaries try to suppress it. However, providing location privacy and client anonymity also makes Tor hidden services an attractive platform for every kind of imaginable shady service. The ease with which Tor hidden services can be set up has spurred a huge growth of anonymously provided Internet services of both types. In this paper we analyse the landscape of Tor hidden services. We have studied Tor hidden services after collecting 39824 hidden service descriptors on 4th of Feb 2013 by exploiting protocol and implementation flaws in Tor: we scanned them for open ports; in the case of HTTP services, we analysed and classified their content. We also estimated the popularity of hidden services by looking at the request rate for hidden service descriptors by clients. We found that while the content of Tor hidden services is rather varied, the most popular hidden services are related to botnets.
Article
Full-text available
Topical crawling is a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. The use of classification algorithms to guide topical crawlers has been sporadically suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the Web graph. Our results show that Naive Bayes is a weak choice for guiding a topical crawler when compared with Support Vector Machine or Neural Network. Further, the weak performance of Naive Bayes can be partly explained by extreme skewness of posterior probabilities generated by it. We also observe that despite similar performances, different topical crawlers cover subspaces on the Web with low overlap.
Article
Full-text available
The unprecedented growth of the Internet has given rise to the Dark Web, the problematic facet of the Web associated with cybercrime, hate, and extremism. Despite the need for tools to collect and analyze Dark Web forums, the covert nature of this part of the Internet makes traditional Web crawling techniques insufficient for capturing such content. In this study, we propose a novel crawling system designed to collect Dark Web forum content. The system uses a human-assisted accessibility approach to gain access to Dark Web forums. Several URL ordering features and techniques enable efficient extraction of forum postings. The system also includes an incremental crawler coupled with a recall-improvement mechanism intended to facilitate enhanced retrieval and updating of collected content. Experiments conducted to evaluate the effectiveness of the human-assisted accessibility approach and the recall-improvement-based, incremental-update procedure yielded favorable results. The human-assisted approach significantly improved access to Dark Web forums while the incremental crawler with recall improvement also outperformed standard periodic- and incremental-update approaches. Using the system, we were able to collect over 100 Dark Web forums from three regions. A case study encompassing link and content analysis of collected forums was used to illustrate the value and importance of gathering and analyzing content from such online communities.
Article
Full-text available
Context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a Web page. Link contexts have been applied to a variety of Web information retrieval and categorization tasks. Topical or focused Web crawlers have a special reliance on link contexts. These crawlers automatically navigate the hyperlinked structure of the Web while using link contexts to predict the benefit of following the corresponding hyperlinks with respect to some initiating topic or theme. Using topical crawlers that are guided by a support vector machine, we investigate the effects of various definitions of link contexts on the crawling performance. We find that a crawler that exploits words both in the immediate vicinity of a hyperlink as well as the entire parent page performs significantly better than a crawler that depends on just one of those cues. Also, we find that a crawler that uses the tag tree hierarchy within Web pages provides effective coverage. We analyze our results along various dimensions such as link context quality, topic difficulty, length of crawl, training data, and topic domain. The study was done using multiple crawls over 100 topics covering millions of pages allowing us to derive statistically strong results.
Conference Paper
This work investigates the effectiveness of a state-of-the-art concept detection framework for the automatic classification of multimedia content, namely images and videos, embedded in publicly available Web resources containing recipes for the synthesis of Home Made Explosives (HMEs), to a set of predefined semantic concepts relevant to the HME domain. The concept detection framework employs advanced methods for video (shot) segmentation, visual feature extraction (using SIFT, SURF, and their variations), and classification based on machine learning techniques (logistic regression). The evaluation experiments are performed using an annotated collection of multimedia HME content discovered on the Web, and a set of concepts, which emerged both from an empirical study, and were also provided by domain experts and interested stakeholders, including Law Enforcement Agencies personnel. The experiments demonstrate the satisfactory performance of our framework, which in turn indicates the significant potential of the adopted approaches on the HME domain.
Conference Paper
This work proposes a novel framework that integrates diverse state-of-the-art technologies for the discovery, analysis, retrieval, and recommendation of heterogeneous Web resources containing multimedia information about homemade explosives (HMEs), with particular focus on HME recipe information. The framework corresponds to a knowledge management platform that enables the interaction with HME information, and consists of three major components: (i) a discovery component that allows for the identification of HME resources on the Web, (ii) a content-based multimedia analysis component that detects HME-related concepts in multimedia content, and (iii) an indexing, retrieval, and recommendation component that processes the available HME information to enable its (semantic) search and provision of similar information. The proposed framework is being developed in a user-driven manner, based on the requirements of law enforcement and security agencies personnel, as well as HME domain experts. In addition, its development is guided by the characteristics of HME Web resources, as these have been observed in an empirical study conducted by HME domain experts. Overall, this framework is envisaged to increase the operational effectiveness and efficiency of law enforcement and security agencies in their quest to keep the citizen safe.
Conference Paper
This work investigates the effectiveness of a novel interactive search engine in the context of discovering and retrieving Web resources containing recipes for synthesizing Home Made Explosives (HMEs). The discovery of HME Web resources both on Surface and Dark Web is addressed as a domain-specific search problem; the architecture of the search engine is based on a hybrid infrastructure that combines two different approaches: (i) a Web crawler focused on the HME domain; (ii) the submission of HME domain-specific queries to general-purpose search engines. Both approaches are accompanied by a user-initiated post-processing classification for reducing the potential noise in the discovery results. The design of the application is built based on the distinctive nature of law enforcement agency user requirements, which dictate the interactive discovery and the accurate filtering of Web resources containing HME recipes. The experiments evaluating the effectiveness of our application demonstrate its satisfactory performance, which in turn indicates the significant potential of the adopted approaches on the HME domain.
Conference Paper
Focussed crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic based on evidence obtained from the already downloaded pages. This work proposes a classifier-guided focussed crawling approach that estimates the relevance of a hyperlink to an unvisited Web resource based on the combination of textual evidence representing its local context, namely the textual content appearing in its vicinity in the parent page, with visual evidence associated with its global context, namely the presence of images relevant to the topic within the parent page. The proposed focussed crawling approach is applied towards the discovery of environmental Web resources that provide air quality measurements and forecasts, since such measurements (and particularly the forecasts) are not only provided in textual form, but are also commonly encoded as multimedia, mainly in the form of heatmaps. Our evaluation experiments indicate the effectiveness of incorporating visual evidence in the link selection process applied by the focussed crawler over the use of textual features alone, particularly in conjunction with hyperlink exploration strategies that allow for the discovery of highly relevant pages that lie behind apparently irrelevant ones.
Conference Paper
This talk will review the emerging research in Terrorism Informatics based on a web mining perspective. Recent progress in the internationally renowned Dark Web project will be reviewed, including: deep/dark web spidering (web sites, forums, Youtube, virtual worlds), web metrics analysis, dark network analysis, web-based authorship analysis, and sentiment and affect analysis for terrorism tracking. In collaboration with selected international terrorism research centers and intelligence agencies, the Dark Web project has generated one of the largest databases in the world about extremist/terrorist-generated Internet contents (web sites, forums, blogs, and multimedia documents). Dark Web research has received significant international press coverage, including: Associated Press, USA Today, The Economist, NSF Press, Washington Post, Fox News, BBC, PBS, Business Week, Discover magazine, WIRED magazine, Government Computing Week, Second German TV (ZDF), Toronto Star, and Arizona Daily Star, among others. For more Dark Web project information, please see: http://ai.eller.arizona.edu/research/terror/ .
Article
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.
Article
While the Web has become a worldwide platform for communication, terrorists share their ideology and communicate with members on the “Dark Web”—the reverse side of the Web used by terrorists. Currently, the problems of information overload and difficulty to obtain a comprehensive picture of terrorist activities hinder effective and efficient analysis of terrorist information on the Web. To improve understanding of terrorist activities, we have developed a novel methodology for collecting and analyzing Dark Web information. The methodology incorporates information collection, analysis, and visualization techniques, and exploits various Web information sources. We applied it to collecting and analyzing information of 39 Jihad Web sites and developed visualization of their site contents, relationships, and activity levels. An expert evaluation showed that the methodology is very useful and promising, having a high potential to assist in investigation and understanding of terrorist activities by producing results that could potentially help guide both policymaking and intelligence research.
The Deep Web, Trend Micro
  • V Ciancaglini
  • M Balduzzi
  • R Mcardle
  • M Rösler