Chapter
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The Dark Web, a part of the Deep Web that consists of several darknets (e.g. Tor, I2P, and Freenet), provides users with the opportunity of hiding their identity when surfing or publishing information. This anonymity facilitates the communication of sensitive data for legitimate purposes, but also provides the ideal environment for transferring information, goods, and services with potentially illegal intentions. Therefore, Law Enforcement Agencies (LEAs) are very much interested in gathering OSINT on the Dark Web that would allow them to successfully prosecute individuals involved in criminal and terrorist activities. To this end, LEAs need appropriate technologies that would allow them to discover darknet sites that facilitate such activities and identify the users involved. This chapter presents current efforts in this direction by first providing an overview of the most prevalent darknets, their underlying technologies, their size, and the type of information they contain. This is followed by a discussion of the LEAs’ perspective on OSINT on the Dark Web and the challenges they face towards discovering and de-anonymizing such information and by a review of the currently available techniques to this end. Finally, a case study on discovering terrorist-related information, such as home made explosive recipes, on the Dark Web is presented.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The great part of the WWW contents are hosted on the so-called Deep Web [1]. These contents are not accessible by search engines mainly due to they are authentication protected contents or pages that are only reachable through the well known darknets [2], [3]. To browse through darknets websites a special authorization or specific software and configurations are needed. ...
... They are distributed in different geographical places running in different computation clusters too. On the one hand, 7 of the VMs are deployed on the facilities of the University of Granada (Spain): 6 named as i2pProjectM [1][2][3][4][5][6] Every VM executes an I2P router instance for accessing the I2P darknet, except for the i2pProjectBBDD machine. Machines i2pProjectM [1][2][3][4][5][6] were set up as floodfill I2P routers, while the rest i2pProjectM[7-10] run as normal I2P routers. ...
... On the one hand, 7 of the VMs are deployed on the facilities of the University of Granada (Spain): 6 named as i2pProjectM [1][2][3][4][5][6] Every VM executes an I2P router instance for accessing the I2P darknet, except for the i2pProjectBBDD machine. Machines i2pProjectM [1][2][3][4][5][6] were set up as floodfill I2P routers, while the rest i2pProjectM[7-10] run as normal I2P routers. Floodfill routers, as mentioned in Section III-A, allow to discover additional eepsites since they all are in charge of maintaining and managing the NetDB. ...
Preprint
Full-text available
Web is a primary and essential service to share information among users and organizations at present all over the world. Despite the current significance of such a kind of traffic on the Internet, the so-called Surface Web traffic has been estimated in just about 5% of the total. The rest of the volume of this type of traffic corresponds to the portion of Web known as Deep Web. These contents are not accessible by search engines because they are authentication protected contents or pages that are only reachable through the well known as darknets. To browse through darknets websites special authorization or specific software and configurations are needed. Despite TOR is the most used darknet nowadays, there are other alternatives such as I2P or Freenet, which offer different features for end users. In this work, we perform an analysis of the connectivity of websites in the I2P network (named eepsites) aimed to discover if different patterns and relationships from those used in legacy web are followed in I2P, and also to get insights about its dimension and structure. For that, a novel tool is specifically developed by the authors and deployed on a distributed scenario. Main results conclude the decentralized nature of the I2P network, where there is a structural part of interconnected eepsites while other several nodes are isolated probably due to their intermittent presence in the network.
... propose OnionCrawler and ATOL: two automated systems which can crawl and classify dark web content with a novel keyword weighting scheme for high accuracy in classification. Here, Kalpakis George et al. [15] detail the current status of darknets, including Tor, I2P, and Freenet, stressing their dual role in hosting lawful and unlawful activities. It elaborates on problems that Law Enforcement Agencies face while gathering Open Source Intelligence on the Dark Web. ...
Article
Full-text available
The increasing prevalence of DarkNet traffic poses significant challenges for network security. Despite improvements in machine learning techniques, most of the existing studies have not applied appropriate ensemble voting models on newer datasets like CIC-Darknet 2020. Some noteworthy works include methodologies that use CNN with K-Means for the classification of zero-day applications with very high accuracy, or approaches using GAN for data augmentation and improvement of accuracy and training efficiency. Techniques, in most cases, however, are associated with low model interpretability and high computational complexity. This paper discusses the study of a Voting Classifier that combines both Random Forest and Gradient Boosting to improve predictive accuracy in a classification task. The research will be conducted on a broad dataset with several features, where feature selection is applied to get the best input for the models chosen. The results of the experiment indicate that the Voting Classifier has far higher performance compared to any single classifier, with an accuracy of 99.90%, precision of 99.99%, recall of 99.45%, and an F1 score of 99.72%. This clearly indicates the strength of ensemble methods in handling a diverse set of patterns and raising the ability to classify, which is an important lesson for the further development of research in machine learning and models.
... Circuit reconstruction is a complex attack that involves an attacker loading compromised nodes into the network which can then eavesdrop on traffic at the node level [7]. When a victim's traffic is routed through this compromised node, this anonymity is broken. ...
... 8 Anonymity provides the Dark Web as an ideal environment for transferring information, goods, and services with potentially illegal intentions, therefore, LEAs are very interested in gathering OSINT on the Dark Web. 4 However, OSINT differs quite significantly depending on whether the intelligence is performed on the Surface Web or the Dark Web. The actor conducting the reconnaissance also determines how well the reconnaissance prepares for the operation. ...
... Content from the Deep Web cannot be accessed in the usual ways, e.g., by querying search engines, mainly due to their belonging to private parts of websites or requiring specific tools for access to them. The latter belong to specific sub-parts of the Deep Web, known as the Dark Web, which is supported by darknet networks [2,3]. ...
Article
The World Wide Web is the most widely used service on the Internet, although only a small part of it, the Surface Web, is indexed and accessible. The rest of the content, the Deep Web, is split between that unable to be indexed by usual search engines and content that needs to be accessed through specific methods and techniques. The latter is deployed in the so-called darknets, which have been the subject of much less study, where anonymity and privacy security services are preserved. Although there are several darknets, Tor is the most well-known and widely analyzed. Hence, the current work presents an analysis of web site connectivity, relationships and content of one of the less known and explored darknets: Freenet. Given the special features of this study, a new crawling tool, called c4darknet, was developed for the purpose of this work. This tool is, in turn, used in the experimentation that was carried out in a real distributed environment. Our results can be summarized as follows: there is great general availability of websites on Freenet; there are significant nodes within the network connectivity structure; and underage porn or child pornography is predominant among illegal content. Finally, the outcomes are compared against a similar study for the I2P darknet, showing special features and differences between both darknets.
... There is currently a large volume of worthwhile open source data to be analyzed, correlated and linked [32]. This includes social networks, public government documents and reports, online multimedia content, newspapers and even the Deep web and the Dark web [33], among others. Actually, both the Deep Web and the Dark Web (the latter circumscribed within the former) contain even more information than the Surface Web (i.e., the Internet known by most users) [34]. ...
Article
Full-text available
The amount of data generated by the current interconnected world is immeasurable, and a large part of such data is publicly available, which means that it is accessible by any user, at any time, from anywhere in the Internet. In this respect, Open Source Intelligence (OSINT) is a type of intelligence that actually benefits from that open natureby collecting, processing and correlating points of the whole cyberspace to generate knowledge. In fact, recent advances in technology are causing OSINT to currently evolve at a dizzying rate, providing innovative data-driven and AI-powered applications for politics, economy or society, but also offering new lines of action against cyberthreats and cybercrime. The paper at hand describes the current state of OSINT and makes a comprehensive review of the paradigm, focusing on the services and techniques enhancing the cybersecurity field. On the one hand, we analyze the strong points of this methodology and propose numerous ways to apply it to cybersecurity. On the other hand, we cover the limitations when adopting it. Considering there is a lot left to explore in this ample field, we also enumerate some open challenges to be addressed in the future. Additionally, we study the role of OSINT in the public sphere of governments, which constitute an ideal landscape to exploit open data.
... • Huge amount of worthwhile open source data to be analyzed, crossed and linked [23]. It includes social networks, public government documents and reports, online [13]. Both the Deep Web and the Dark Web (the latter circumscribed within the former) contain even more information than the Surface Web (the Internet known by most users). ...
Conference Paper
Full-text available
Phenomenons like Social Networks, Cloud Computing or Internet of Thing are unknowingly generating unimaginable quantities of data. In this context, Open Source Intelligence (OSINT) exploits such information to extract knowledge that is not easily appreciable beforehand by the human eye. Apart from the political, economic or social applications OSINT may bring, there are also serious global concerns that could be covered by this paradigm such as cyber crime and cyber threats. The paper at hand presents the current state of OSINT, the opportunities and limitations it poses, and the challenges to be faced in the future. Furthermore, we particularly study Spain as a potential beneficiary of this powerful methodology.
Article
Full-text available
Dark Web has turned into a platform for a variety of criminal activities, including weapon trafficking, pornography, fake documents, drug trafficking, and, most notably terrorism as detailed in this study. This article uses an LDA-based topic modeling approach to identify the topics addressed in discussions on the Dark Web. The main purpose is to present an overview of jihadists’ communication in cyberspace for the detection of unusual behavior or terrorism-related purposes. According to the findings, conversations in the context of recruitment and propaganda predominated at the forum. There was no direct evidence of terrorist collaboration at the conclusion of the investigation. This does not, however, imply that these sites are risk-free. Propaganda and recruitment tools feed the terrorist activities.
Article
This paper maps communication channels exploited by the Salafi-jihadist violent extremist organisations (VEOs) and their followers between March 2020 and June 2022 on The Onion Router (TOR). It argues that the true scale of digital jihadist presence on TOR has remained insignificant for years. Militant Islamists have mostly used .onion domains as backup propaganda dissemination channels, which enable content takedown policies introduced by countering violent extremism stakeholders to be circumvented. Aside from propaganda distribution, TOR attracts Salafi-jihadist VEOs and their followers for other reasons, as it facilitates anonymous communication, crowdfunding, sharing of terrorist manuals or the organisation of terrorist attacks.
Article
Full-text available
In this contemporary era, where a large part of the world population is deluged by extensive use of the internet and social media, terrorists have found it a potential opportunity to execute their vicious plans. They have got a befitting medium to reach out to their targets to spread propaganda, disseminate training content, operate virtually, and further their goals. To restrain such activities, information over the internet in context of terrorism needs to be analyzed to channel it to appropriate measures in combating terrorism. Open Source Intelligence (OSINT) accounts for a felicitous solution to this problem, which is an emerging discipline of leveraging publicly accessible sources of information over the internet by effectively utilizing it to extract intelligence. The process of OSINT extraction is broadly observed to be in three phases (i) Data Acquisition, (ii) Data Enrichment, and (iii) Knowledge Inference. In the context of terrorism, researchers have given noticeable contributions in compliance with these three phases. However, a comprehensive review that delineates these research contributions into an integrated workflow of intelligence extraction has not been found. The paper presents the most current review in OSINT, reflecting how the various state‐of‐the‐art tools and techniques can be applied in extracting terrorism‐related textual information from publicly accessible sources. Various data mining and text analysis‐based techniques, that is, natural language processing, machine learning, and deep learning have been reviewed to extract and evaluate textual data. Additionally, towards the end of the paper, we discuss challenges and gaps observed in different phases of OSINT extraction. This article is categorized under: Application Areas > Government and Public Sector Commercial, Legal, and Ethical Issues > Social Considerations Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining
Article
Full-text available
Abstract—Dark web is a canopy concept that denotes any kind of illicit activities carried out by anonymous persons or organizations, thereby making it difficult to trace. The illicit content on the dark web is constantly updated and changed. The collection and classification of such illegal activities are challeng-ing tasks, as they are difficult and time-consuming. This problem has in recent times emerged as an issue that requires quick attention from both the industry and academia. To this end, efforts have been made in this article a crawler that is capable of collecting dark web pages, cleaning them, and saving them in a doc-ument database, is proposed. The crawler carries out an automatic classification of the gathered web pages into five classes. The classifiers used in classifying the pages include Linear Support Vector Classifier (SVC), Naïve Bayes (NB), and Document Frequency (TF-IDF). The experimental results revealed that an accu-racy rate of 92% and 81% were achieved by SVC and NB, respectively.
Article
Full-text available
In order to efficiently manage and operate industrial-level production, an increasing number of industrial devices and critical infrastructure (CI) are now connected to the internet, exposed to malicious hackers and cyberterrorists who aim to cause significant damage to institutions and countries. Throughout the various stages of a cyber-attack, Open-source Intelligence (OSINT) tools could gather data from various publicly available platforms, and thus help hackers identify vulnerabilities and develop malware and attack strategies against targeted CI sectors. The purpose of the current study is to explore and identify the types of OSINT data that are useful for malicious individuals intending to conduct cyber-attacks against the CI industry. Applying and searching keyword queries in four open-source surface web platforms (Google, YouTube, Reddit, and Shodan), search results published between 2015 and 2020 were reviewed and qualitatively analyzed to categorize CI information that could be useful to hackers. Over 4000 results were analyzed from the open-source websites, 250 of which were found to provide information related to hacking and/or cybersecurity of CI facilities to malicious actors. Using thematic content analysis, we identified three major types of data malicious attackers could retrieve using OSINT tools: indirect reconnaissance data, proof-of-concept codes, and educational materials. The thematic results from this study reveal an increasing amount of open-source information useful for malicious attackers against industrial devices, as well as the need for programs, training, and policies required to protect and secure industrial systems and CI.
Article
Full-text available
Crime, terrorism, and other illegal activities are increasingly taking place in cyberspace. Crime in the dark web is one of the most serious challenges confronting governments around the world. Dark web makes it difficult to detect criminals and track activities, as it provides anonymity due to special tools such as TOR. Therefore, it has evolved into a platform that includes many illegal activities such as pornography, weapon trafficking, drug trafficking, fake documents, and more specially terrorism as in the context of this paper. Dark web studies are critical for designing successful counter-terrorism strategies. The aim of this research is to conduct a critical analysis of the literature and to demonstrate research efforts in dark web studies related to terrorism. According to result of study, the scientific studies related to terrorism activities have been minimally conducted and the scientific methods used in detecting and combating them in dark web should be varied. Advanced artificial intelligence, image processing and classification by using machine learning, natural language processing methods, hash value analysis, and sock puppet techniques can be used to detect and predict terrorist incidents on the dark web.
Article
Web is a primary and essential service to share information among users and organizations at present all over the world. Despite the current significance of such a kind of traffic on the Internet, the so-called Surface Web traffic has been estimated in just about 5% of the total. The rest of the volume of this type of traffic corresponds to the portion of the Web known as Deep Web. These contents are not accessible by usual search engines because they are authentication protected contents or pages only reachable through technologies denoted as darknets. To browse through darknet websites, special authorization or specific software and configurations are needed. TOR is one of the most used darknet nowadays, but there are several other alternatives such as I2P or Freenet, which offer different features for end users. In this work, we perform a connectivity analysis of the websites in the I2P network (named eepsites) aimed to discover if different patterns and relationships from those used in legacy webs are followed in I2P, as well as to get insights about the dimension and structure of this darknet. For that, a novel tool is specifically developed by the authors and deployed on a distributed scenario. Main results conclude the decentralized nature of the I2P network, where there is a structural part of interconnected eepsites while other several nodes are isolated probably due to their intermittent presence in the network.
Article
Full-text available
In this paper, the authors examine the adequacy of the counter-terrorism concept, which does not envisage institutional responsibility for collecting, processing, and fixing traces of cyber-related terrorist activities. The starting point is the fact that today numerous human activities and communication take place in the cyberspace. Firstly, the focus is on the aspects of terrorism that present a generator of challenges to social stability and, in this context, the elements of the approach adopted by the current National Security Strategy of the Republic of Serbia. In this analysis, adequacy is evaluated from the point of view of functionality. In this sense, it is an attempt to present elements that influence the effectiveness of counter-terrorism in the information age. Related to this is the specification of the role that digital forensics can play in this area. The conclusion is that an effective counter-terrorism strategy must necessarily encompass the institutional incorporation of digital forensics since it alone can contribute to the timely detection or assertion of responsibility for terrorism in a networked computing environment.
Conference Paper
Full-text available
We perform a comprehensive measurement analysis of Silk Road, an anonymous, international online marketplace that operates as a Tor hidden service and uses Bitcoin as its exchange currency. We gather and analyze data over eight months between the end of 2011 and 2012, including daily crawls of the marketplace for nearly six months in 2012. We obtain a detailed picture of the type of goods sold on Silk Road, and of the revenues made both by sellers and Silk Road operators. Through examining over 24,400 separate items sold on the site, we show that Silk Road is overwhelmingly used as a market for controlled substances and narcotics, and that most items sold are available for less than three weeks. The majority of sellers disappears within roughly three months of their arrival, but a core of 112 sellers has been present throughout our measurement interval. We evaluate the total revenue made by all sellers, from public listings, to slightly over USD 1.2 million per month; this corresponds to about USD 92,000 per month in commissions for the Silk Road operators. We further show that the marketplace has been operating steadily, with daily sales and number of sellers overall increasing over our measurement interval. We discuss economic and policy implications of our analysis and results, including ethical considerations for future research in this area.
Conference Paper
Full-text available
Tor hidden services allow running Internet services while protecting the location of the servers. Their main purpose is to enable freedom of speech even in situations in which powerful adversaries try to suppress it. However, providing location privacy and client anonymity also makes Tor hidden services an attractive platform for every kind of imaginable shady service. The ease with which Tor hidden services can be set up has spurred a huge growth of anonymously provided Internet services of both types. In this paper we analyse the landscape of Tor hidden services. We have studied Tor hidden services after collecting 39824 hidden service descriptors on 4th of Feb 2013 by exploiting protocol and implementation flaws in Tor: we scanned them for open ports; in the case of HTTP services, we analysed and classified their content. We also estimated the popularity of hidden services by looking at the request rate for hidden service descriptors by clients. We found that while the content of Tor hidden services is rather varied, the most popular hidden services are related to botnets.
Article
Full-text available
Searching on the Internet today can be compared to dragging a net across the surfgace of the ocean. While a great deal may be caught in the net, there is still a wealth of information that is deep, and therefore, missed. The reason is simple. Most of the Web's information is buried far down on dynamically generated sites, and standard search engines never find it.
Article
Full-text available
We perform a comprehensive measurement analysis of Silk Road, an anonymous, international online marketplace that operates as a Tor hidden service and uses Bitcoin as its exchange currency. We gather and analyze data over eight months between the end of 2011 and 2012, including daily crawls of the marketplace for nearly six months in 2012. We obtain a detailed picture of the type of goods being sold on Silk Road, and of the revenues made both by sellers and Silk Road operators. Through examining over 24,400 separate items sold on the site, we show that Silk Road is overwhelmingly used as a market for controlled substances and narcotics, and that most items sold are available for less than three weeks. The majority of sellers disappears within roughly three months of their arrival, but a core of 112 sellers has been present throughout our measurement interval. We evaluate the total revenue made by all sellers, from public listings, to slightly over USD 1.2 million per month; this corresponds to about USD 92,000 per month in commissions for the Silk Road operators. We further show that the marketplace has been operating steadily, with daily sales and number of sellers overall increasing over our measurement interval. We discuss economic and policy implications of our analysis and results, including ethical considerations for future research in this area.
Article
Full-text available
Freenet is a distributed, Internet-wide peer-to-peer overlay network de-signed to allow anonymized and censorship resistant publication and distribution of information. The system functions as a distributed hashtable, where participating com-puters store information which can be retrieved by any other participant. In this paper we describe a new architecture for the Freenet system, designed to strengthen the privacy of users, and to protect participating nodes from attack in situations where just running a node in the network could be a liability. While in previous incarnations of Freenet the edges of the overlay network were decided algo-rithmically in a manner intended to optimize routing in the network, in the new version we allow nodes to restrict their connections to nodes that they trust. In this type of network (sometimes called a Darknet, or a friend-to-friend network) every participant communicates directly only with his own trusted peers, and thus, under some assump-tions, reveals his identity only to peers he already trusts. While everybody has to trust somebody in the network, there is no central party whom everybody must trust. We offer participants to either stay hidden in a darknet, or to take part of a hybrid model with direct connections also to strangers. The novel algorithm we use for routing in the fixed mesh of trusted connections has been previously described in [29]. This paper focuses on the practical aspects of deploying a functioning data publication system based on the trusted connection model: we describe the system we have settled on and simulate it under realistic conditions.
Conference Paper
Full-text available
This paper analyzes the web browsing behaviour of Tor users. By collecting HTTP requests we show which websites are of interest to Tor users and we determined an upper bound on how vulnerable Tor users are to sophisticated de-anonymization attacks: up to 78 % of the Tor users do not use Tor as suggested by the Tor community, namely to browse the web with TorButton. They could thus fall victim to de- anonymization attacks by merely browsing the web. Around 1% of the requests could be used by an adversary for exploit piggybacking on vul- nerable le formats. Another 7 % of all requests were generated by social networking sites which leak plenty of sensitive and identifying informa- tion. Due to the design of HTTP and Tor, we argue that HTTPS is cur- rently the only eective countermeasure against de-anonymization and information leakage for HTTP over Tor.
Article
Full-text available
Topical crawling is a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. The use of classification algorithms to guide topical crawlers has been sporadically suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the Web graph. Our results show that Naive Bayes is a weak choice for guiding a topical crawler when compared with Support Vector Machine or Neural Network. Further, the weak performance of Naive Bayes can be partly explained by extreme skewness of posterior probabilities generated by it. We also observe that despite similar performances, different topical crawlers cover subspaces on the Web with low overlap.
Article
Full-text available
Anonymity systems such as Tor aim to enable users to communicate in a manner that is untraceable by adversaries that control a small number of machines. To provide efficient service to users, these anonymity systems make full use of forwarding capacity when sending traffic between intermediate relays. In this paper, we show that doing this leaks information about the set of Tor relays in a circuit (path). We present attacks that, with high confidence and based solely on throughput information, can (a) reduce the attacker's uncertainty about the bottleneck relay of any Tor circuit whose throughput can be observed, (b) exactly identify the guard relay(s) of a Tor user when circuit throughput can be observed over multiple connections, and (c) identify whether two concurrent TCP connections belong to the same Tor user, breaking unlinkability. Our attacks are stealthy, and cannot be readily detected by a user or by Tor relays. We validate our attacks using experiments over the live Tor network. We find that the attacker can substantially reduce the entropy of a bottleneck relay distribution of a Tor circuit whose throughput can be observed-the entropy gets reduced by a factor of 2 in the median case. Such information leaks from a single Tor circuit can be combined over multiple connections to exactly identify a user's guard relay(s). Finally, we are also able to link two connections from the same initiator with a crossover error rate of less than 1.5% in under 5 minutes. Our attacks are also more accurate and require fewer resources than previous attacks on Tor.
Article
Full-text available
We present Tor, a circuit-based low-latency anonymous communication service. This second-generation Onion Routing system addresses limitations in the original design by adding perfect forward secrecy, congestion control, directory servers, integrity checking, configurable exit policies, and a practical design for location-hidden services via rendezvous points. Tor works on the real-world Internet, requires no special privileges or kernel modifications, requires little synchronization or coordination between nodes, and provides a reasonable tradeoff between anonymity, usability, and efficiency. We briefly describe our experiences with an international network of more than 30 nodes. We close with a list of open problems in anonymous communication.
Article
Freenet is a popular peer to peer anonymous content-sharing network, with the objective to provide the anonymity of both content publishers and retrievers. Despite more than a decade of active development and deployment and the adoption of well-established cryptographic algorithms in Freenet, it remains unanswered how well the anonymity objective of the initial Freenet design has been met. In this paper we develop a traceback attack on Freenet, and show that the originating machine of a content request message in Freenet can be identified; that is, the anonymity of a content retriever can be broken, even if a single request message has been issued by the retriever. We present the design of the traceback attack, and perform both experimental and simulation studies to confirm the feasibility and effectiveness of the attack. For example, with randomly chosen content requesters (and random contents stored in the) Freenet testbed, Emulab-based experiments show that, for 24% to 43% of the content request messages, we can identify their originating machines. We also briefly discuss potential solutions to address the developed traceback attack. Despite being developed specifically on Freenet, the basic principles of the traceback attack and solutions have important security implications for similar anonymous content-sharing systems.
Conference Paper
This work investigates the effectiveness of a novel interactive search engine in the context of discovering and retrieving Web resources containing recipes for synthesizing Home Made Explosives (HMEs). The discovery of HME Web resources both on Surface and Dark Web is addressed as a domain-specific search problem; the architecture of the search engine is based on a hybrid infrastructure that combines two different approaches: (i) a Web crawler focused on the HME domain; (ii) the submission of HME domain-specific queries to general-purpose search engines. Both approaches are accompanied by a user-initiated post-processing classification for reducing the potential noise in the discovery results. The design of the application is built based on the distinctive nature of law enforcement agency user requirements, which dictate the interactive discovery and the accurate filtering of Web resources containing HME recipes. The experiments evaluating the effectiveness of our application demonstrate its satisfactory performance, which in turn indicates the significant potential of the adopted approaches on the HME domain.
Article
Encryption policy is becoming a crucial test of the values of liberal democracy in the twenty-first century.
Article
We created the Yahoo Flickr Creative Commons 100 Million Dataseta (YFCC100M) in 2014 as part of the Yahoo Webscope program, which is a reference library of interesting and scientifically useful datasets. The YFCC100M is the largest public multimedia collection ever released, with a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all uploaded to Flickr between 2004 and 2014 and published under a CC commercial or noncommercial license. The dataset is distributed through Amazon Web Services as a 12.5GB compressed archive containing only metadata. However, as with many datasets, the YFCC100M is constantly evolving; over time, we have released and will continue to release various expansion packs containing data not yet in the collection; for instance, the actual photos and videos, as well as several visual and aural features extracted from the data, have already been uploaded to the cloud, ensuring the dataset remains accessible and intact for years to come. The YFCC100M dataset overcomes many of the issues affecting existing multimedia datasets in terms of modalities, metadata, licensing, and, principally, volume.
Article
We lead a double life. We have relationships that enrich our lives, events we participate in, and places we go that become a part of who we are. We also have online friendships, which enrich our lives and make up part of who we are. There are points of intersection between our two lives: events we arrange online that take place offline, places that we have been (offline) that we photograph, tag, and share with our online social community. The amount of data being generated by people online, who share information about their offline lives, is unprecedented and growing. We are a lucky few. We are the "technorati": the people who have access to technology, and the knowledge to use it. We are wealthy, well-travelled, well-educated, and proud owners of a plethora of gadgets that allow us to weave a seamless tapestry between our online and offline lives. We are not most people. The data we generate does not represent most people.
Article
Communication privacy has been a growing concern, particularly with the Internet becoming a major hub of our daily interactions. Revelations of government tracking and corporate profiling have resulted in increasing interest in anonymous communication mechanisms. Several systems have been developed with the aim of preserving communication privacy via unlinkability within a public network environment such as Tor and I2P. As the anonymity networks cannot guarantee perfect anonymity, it is important for users to understand the risks they might face when utilizing such technologies. In this paper, we discuss potential attacks on the anonymity networks that can compromise user identities and communication links. We also summarize protection mechanisms against such attacks. Many attacks against anonymity networks are well studied, and most of the modern systems have built-in mechanisms to prevent these attacks. Additionally, some of the attacks require considerable resources to be effective and hence are very unlikely to succeed against modern anonymity networks.
Article
Instagram is a relatively new form of communication where users can instantly share their current status by taking pictures and tweaking them using filters. It has seen a rapid growth in the number of users as well as uploads since it was launched in October 2010. Inspite of the fact that it is the most popular photo sharing application, it has attracted relatively less attention from the web and social media research community. In this paper, we present a large-scale quantitative analysis on millions of users and pictures we crawled over 1 month from Instagram. Our analysis reveals several insights on Instagram which were never studied before: 1) its social network properties are quite different from other popular social media like Twitter and Flickr, 2) people typically post once a week, and 3) people like to share their locations with friends. To the best of our knowledge, this is the first in-depth analysis of user activities, demographics, social network structure and user-generated content on Instagram.
Conference Paper
Focussed crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic based on evidence obtained from the already downloaded pages. This work proposes a classifier-guided focussed crawling approach that estimates the relevance of a hyperlink to an unvisited Web resource based on the combination of textual evidence representing its local context, namely the textual content appearing in its vicinity in the parent page, with visual evidence associated with its global context, namely the presence of images relevant to the topic within the parent page. The proposed focussed crawling approach is applied towards the discovery of environmental Web resources that provide air quality measurements and forecasts, since such measurements (and particularly the forecasts) are not only provided in textual form, but are also commonly encoded as multimedia, mainly in the form of heatmaps. Our evaluation experiments indicate the effectiveness of incorporating visual evidence in the link selection process applied by the focussed crawler over the use of textual features alone, particularly in conjunction with hyperlink exploration strategies that allow for the discovery of highly relevant pages that lie behind apparently irrelevant ones.
Conference Paper
Freenet is a popular peer to peer anonymous network, with the objective to provide the anonymity of both content publishers and retrievers. Despite more than a decade of active development and deployment and the adoption of well-established cryptographic algorithms in Freenet, it remains unanswered how well the anonymity objective of the initial Freenet design has been met. In this paper we develop a traceback attack on Freenet, and show that the originating machine of a content request message in Freenet can be identified; that is, the anonymity of a content retriever can be broken, even if a single request message has been issued by the retriever. We present the design of the traceback attack, and perform Emulab-based experiments to confirm the feasibility and effectiveness of the attack. With randomly chosen content requesters (and random contents stored in the Freenet testbed), the experiments show that, for 24% to 43% of the content request messages, we can identify their originating machines. We also briefly discuss potential solutions to address the developed traceback attack. Despite being developed specifically on Freenet, the basic principles of the traceback attack and solutions have important security implications for similar anonymous content sharing systems.
Conference Paper
This talk will review the emerging research in Terrorism Informatics based on a web mining perspective. Recent progress in the internationally renowned Dark Web project will be reviewed, including: deep/dark web spidering (web sites, forums, Youtube, virtual worlds), web metrics analysis, dark network analysis, web-based authorship analysis, and sentiment and affect analysis for terrorism tracking. In collaboration with selected international terrorism research centers and intelligence agencies, the Dark Web project has generated one of the largest databases in the world about extremist/terrorist-generated Internet contents (web sites, forums, blogs, and multimedia documents). Dark Web research has received significant international press coverage, including: Associated Press, USA Today, The Economist, NSF Press, Washington Post, Fox News, BBC, PBS, Business Week, Discover magazine, WIRED magazine, Government Computing Week, Second German TV (ZDF), Toronto Star, and Arizona Daily Star, among others. For more Dark Web project information, please see: http://ai.eller.arizona.edu/research/terror/ .
Article
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.
Article
We lead a double life. We have relationships that enrich our lives, events we participate in, and places we go that become a part of who we are. We also have online friendships, which enrich our lives and make up part of who we are. There are points of intersection between our two lives: events we arrange online that take place offline, places that we have been (offline) that we photograph, tag, and share with our online social community. The amount of data being generated by people online, who share information about their offline lives, is unprecedented and growing. We are a lucky few. We are the "technorati": the people who have access to technology, and the knowledge to use it. We are wealthy, well-travelled, well-educated, and proud owners of a plethora of gadgets that allow us to weave a seamless tapestry between our online and offline lives. We are not most people. The data we generate does not represent most people.
Article
THE PARADOX OF THE INVISIBLE WEB is that it's easy to understand why it exists, but it's very hard to actually define in concrete, specific terms. In a nutshell, the Invisible Web consists of content that's been excluded from general-purpose search engines and Web directories such as Lycos and LookSmart-and yes, even Google. There's nothing inherently "invisible" about this content. But since this content is not easily located with the information-seeking tools used by most Web users, it's effectively invisible because it's so difficult to find unless you know exactly where to look. In this paper, we define the Invisible Web and delve into the reasons search engines can't "see" its content. We also discuss the four different "types" of invisibility, ranging from the "opaque" Web which is relatively accessible to the searcher, to the truly invisible Web, which requires specialized finding aids to access effectively.
The Tor dark net.’ global commission on internet governance
  • G Owen
  • N Savage
PunkSPIDER, the crawler that scanned the Dark Web
  • P Paganini
Paganini P (2015) PunkSPIDER, the crawler that scanned the Dark Web. Retrieved 27 Jul 2016, from http://securityaffairs.co/wordpress/37632/hacking/punkspider-scanned-tor.html
Interactive discovery and retrieval of web resources containing home made explosive recipes International conference on human aspects of information security, privacy, and trust
  • G Kalpakis
  • T Tsikrika
  • C Iliou
  • T Mironidis
  • S Vrochidis
  • J Middleton
  • I Kompatsiaris
Web crawling: foundations and trends in information retrieval
  • C Olston
  • M Najork
Olston C, Najork M (2010) Web crawling: foundations and trends in information retrieval
White paper: the deep web: surfacing hidden value
  • J Bartlett
Bartlett J (2014) The Dark Net. Random House, London Bergman MK (2001) White paper: the deep web: surfacing hidden value. J Electron Pub 7(1)
The Dark Net. Random House
  • J Bartlett
Interactive discovery and retrieval of web resources containing home made explosive recipes
  • G Kalpakis
  • T Tsikrika
  • C Iliou
  • T Mironidis
  • S Vrochidis
  • J Middleton
  • I Kompatsiaris
  • George Kalpakis