Conference Paper

A Framework for the Discovery, Analysis, and Retrieval of Multimedia Homemade Explosives Information on the Web

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This work proposes a novel framework that integrates diverse state-of-the-art technologies for the discovery, analysis, retrieval, and recommendation of heterogeneous Web resources containing multimedia information about homemade explosives (HMEs), with particular focus on HME recipe information. The framework corresponds to a knowledge management platform that enables the interaction with HME information, and consists of three major components: (i) a discovery component that allows for the identification of HME resources on the Web, (ii) a content-based multimedia analysis component that detects HME-related concepts in multimedia content, and (iii) an indexing, retrieval, and recommendation component that processes the available HME information to enable its (semantic) search and provision of similar information. The proposed framework is being developed in a user-driven manner, based on the requirements of law enforcement and security agencies personnel, as well as HME domain experts. In addition, its development is guided by the characteristics of HME Web resources, as these have been observed in an empirical study conducted by HME domain experts. Overall, this framework is envisaged to increase the operational effectiveness and efficiency of law enforcement and security agencies in their quest to keep the citizen safe.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... With regard to research efforts related to discovering and analyzing HME Web content, a concept detection mechanism has been developed in the context of HOMER project with the goal of identifying the relevance of already discovered multimedia files (videos/images) to the HME domain in an automatic fashion [11]; however, this work has solely addressed the identification of HME-related objects in multimedia, rather than the discovery of such content on the Web. In addition, a Knowledge Management Platform for managing the discovery, analysis, and retrieval of HME-related content has been developed [12]; nevertheless, this effort has mainly addressed issues related to the architecture of the entire framework for the HME knowledge management, rather than the discovery of HME-related Web content. Moreover, a work related to the development of an interactive search engine for the discovery of HMErelated information on the Web has been proposed [13]; however, this effort mainly deals with the interaction of the end users with the framework, rather than with the approaches implemented for the discovery of the HME information. ...
Article
Full-text available
Focused crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating through the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic of interest. This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly navigate through the Surface Web and several darknets present in the Dark Web (i.e., Tor, I2P, and Freenet) during a single crawl by automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the destination network type and the strength of the local evidence present in the vicinity of a hyperlink. It investigates 11 hyperlink selection methods, among which a novel strategy proposed based on the dynamic linear combination of a link-based and a parent Web page classifier. This hybrid focused crawler is demonstrated for the discovery of Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the effectiveness of the proposed focused crawler both for the Surface and the Dark Web.
Conference Paper
This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly traverse the Surface Web and several darknets present in the Dark Web (i.e. Tor, I2P and Freenet) during a single crawl by automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the network type. This hybrid focused crawler is demonstrated for the discovery of Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the effectiveness of the proposed ap-proach both for the Surface and the Dark Web.
Conference Paper
This work investigates the effectiveness of a novel interactive search engine in the context of discovering and retrieving Web resources containing recipes for synthesizing Home Made Explosives (HMEs). The discovery of HME Web resources both on Surface and Dark Web is addressed as a domain-specific search problem; the architecture of the search engine is based on a hybrid infrastructure that combines two different approaches: (i) a Web crawler focused on the HME domain; (ii) the submission of HME domain-specific queries to general-purpose search engines. Both approaches are accompanied by a user-initiated post-processing classification for reducing the potential noise in the discovery results. The design of the application is built based on the distinctive nature of law enforcement agency user requirements, which dictate the interactive discovery and the accurate filtering of Web resources containing HME recipes. The experiments evaluating the effectiveness of our application demonstrate its satisfactory performance, which in turn indicates the significant potential of the adopted approaches on the HME domain.
Conference Paper
Full-text available
This paper introduces an algorithm for fast temporal seg-mentation of videos into shots. The proposed method detects abrupt and gradual transitions, based on the visual similar-ity of neighboring frames of the video. The descriptive ef-ficiency of both local (SURF) and global (HSV histograms) descriptors is exploited for assessing frame similarity, while GPU-based processing is used for accelerating the analysis. Specifically, abrupt transitions are initially detected between successive video frames where there is a sharp change in the visual content, which is expressed by a very low similarity score. Then, the calculated scores are further analysed for the identification of frame-sequences where a progressive change of the visual content takes place and, in this way gradual tran-sitions are detected. Finally, a post-processing step is per-formed aiming to identify outliers due to object/camera move-ment and flash-lights. The experiments show that the pro-posed algorithm achieves high accuracy while being capable of faster-than-real-time analysis.
Article
Full-text available
This study aims to investigate how Al Qaeda uses the Internet for military training and preparation. What kind of training material is available on jihadi webpages, who produces it, and for what purpose? The article argues that in spite of a vast amount of training-related literature online, there have been few organized efforts by Al Qaeda to train their followers by way of the Internet. The Internet is per today not a “virtual training camp” organized from above, but rather a resource bank maintained and accessed largely by self-radicalized sympathizers.
Chapter
Full-text available
The broad adoption of Web 2.0 tools has signalled a new era of “Medicine 2.0” in the field of medical informatics. The support for collaboration within online communities and the sharing of information in social networks offers the opportunity for new communication channels among patients, medical experts, and researchers. This paper introduces MORMED, a novel multilingual social networking and content management platform that exemplifies the Medicine 2.0 paradigm, and aims to achieve knowledge commonality by promoting sociality, while also transcending language barriers through automated translation. The MORMED platform will be piloted in a community interested in the treatment of rare diseases (Lupus or Antiphospholipid Syndrome). KeywordsMedicine 2.0-Social Networking-Multilingual Web-Information Management-Rare Diseases
Article
Full-text available
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors (software to compute the color descriptors from this paper is available from http://www.colordescriptors.com) in a structured way. The analytical invariance properties of color descriptors are explored, using a taxonomy based on invariance properties with respect to photometric transformations, and tested experimentally using a data set with known illumination conditions. In addition, the distinctiveness of color descriptors is assessed experimentally using two benchmarks, one from the image domain and one from the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results further reveal that, for light intensity shifts, the usefulness of invariance is category-specific. Overall, when choosing a single descriptor and no prior knowledge about the data set and object and scene categories is available, the OpponentSIFT is recommended. Furthermore, a combined set of color descriptors outperforms intensity-based SIFT and improves category recognition by 8 percent on the PASCAL VOC 2007 and by 7 percent on the Mediamill Challenge.
Conference Paper
Full-text available
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used. To increase illumination invariance and discriminative power, color descriptors have been proposed only recently. As many descriptors exist, a structured overview of color invariant descriptors in the context of image category recognition is required.
Conference Paper
Full-text available
In addition to the actual content Web pages consist of navi- gational elements, templates, and advertisements. This boil- erplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state- of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to re- trieval performance and show signicant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable accuracy.
Conference Paper
Full-text available
Feature matching is at the base of many computer vision problems, such as object recognition or structure from motion. Current methods rely on costly descriptors for detection and matching. In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise. We demonstrate through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations. The efficiency is tested on several real-world applications, including object detection and patch-tracking on a smart phone.
Conference Paper
Full-text available
We propose to mine structured query templates from search logs, for enabling rich query interpretation that recognizes both query intents and associated attributes. We formalize the notion of template as a sequence of keywords and domain attributes, and our objective is to discover templates with high precision and recall for matching queries in a domain of interest. Our solution bootstraps from small seed input knowledge to discover relevant query templates, by harnessing the wealth of information available in search logs. We model this information in a tri-partite QueST network of queries, sites, and templates. We propose a probabilistic inferencing framework based on the dual metrics of precision and recall- and we show that the dual inferencing correspond respectively to the random walks in backward and forward directions. We deployed and tested our algorithm over a real-world search log of 15 million queries. The algorithm achieved accuracy of as high as 90% (on F-measure), with little seed knowledge and even with incomplete domain schema.
Conference Paper
Full-text available
Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, we developed DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs. DBpedia Spotlight allows users to configure the annotations to their specific needs through the DBpedia Ontology and quality measures such as prominence, topical pertinence, contextual ambiguity and disambiguation confidence. We compare our approach with the state of the art in disambiguation, and evaluate our results in light of three baselines and six publicly available annotation systems, demonstrating the competitiveness of our system. DBpedia Spotlight is shared as open source and deployed as a Web Service freely available for public use.
Article
Full-text available
Topical crawling is a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. The use of classification algorithms to guide topical crawlers has been sporadically suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the Web graph. Our results show that Naive Bayes is a weak choice for guiding a topical crawler when compared with Support Vector Machine or Neural Network. Further, the weak performance of Naive Bayes can be partly explained by extreme skewness of posterior probabilities generated by it. We also observe that despite similar performances, different topical crawlers cover subspaces on the Web with low overlap.
Article
Full-text available
This paper addresses the problem of large-scale image search. Three constraints have to be taken into account: search accuracy, efficiency, and memory usage. We first present and evaluate different ways of aggregating local image descriptors into a vector and show that the Fisher kernel achieves better performance than the reference bag-of-visual words approach for any given vector dimension. We then jointly optimize dimensionality reduction and indexing in order to obtain a precise vector comparison as well as a compact representation. The evaluation shows that the image representation can be reduced to a few dozen bytes while preserving high accuracy. Searching a 100 million image dataset takes about 250 ms on one processor core.
Conference Paper
Full-text available
We expand on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million Web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clusters of near-duplicate documents evolved over time. We found that 29.2% of all Web pages are very similar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of near-duplicate documents are fairly stable: Two documents that are near-duplicates of one another are very likely to still be near-duplicates 10 weeks later. This result is of significant relevance to search engines: Web crawlers can be fairly confident that two pages that have been found to be near-duplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions, thereby freeing up crawling resources that can be brought to bear more productively somewhere else.
Article
Full-text available
Domain-specific Web search engines are effective tools for reducing the difficulty experienced when acquiring information from the Web. Existing methods for building domain-specific Web search engines require human expertise or specific facilities. However, we can build a domain-specific search engine simply by adding domain-specific keywords, called "keyword spices," to the user's input query and forwarding it to a general-purpose Web search engine. Keyword spices can be effectively discovered from Web documents using machine learning technologies. The paper describes domain-specific Web search engines that use keyword spices for locating recipes, restaurants, and used cars.
Article
Full-text available
We present Tor, a circuit-based low-latency anonymous communication service. This second-generation Onion Routing system addresses limitations in the original design by adding perfect forward secrecy, congestion control, directory servers, integrity checking, configurable exit policies, and a practical design for location-hidden services via rendezvous points. Tor works on the real-world Internet, requires no special privileges or kernel modifications, requires little synchronization or coordination between nodes, and provides a reasonable tradeoff between anonymity, usability, and efficiency. We briefly describe our experiences with an international network of more than 30 nodes. We close with a list of open problems in anonymous communication.
Article
Full-text available
Much research has been carried out in order to manage structured documents such as SGML documents and to provide powerful query facilities which exploit document structures as well as document contents. In order to perform structure queries efficiently in a structured document management system, an index structure which supports fast document element access must be provided. However, there has been little research on the index structures for structured documents. In this paper, we propose various kinds of new inverted indexing schemes and signature file schemes for efficient structure query processing. We evaluate the storage requirements and disk access times of our schemes and present the analytical and experimental results. 1 Introduction Since the Standard Generalized Markup Language (SGML) [13] [15] was standardized, many structured document management systems have been built to manage structured documents including [1] [2] [3] [4] [5] [6] [17] [18] [20] [21] [23]. In those syst...
Article
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors (software to compute the color descriptors from this paper is available from http://www.colordescriptors.com) in a structured way. The analytical invariance properties of color descriptors are explored, using a taxonomy based on invariance properties with respect to photometric transformations, and tested experimentally using a data set with known illumination conditions. In addition, the distinctiveness of color descriptors is assessed experimentally using two benchmarks, one from the image domain and one from the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results further reveal that, for light intensity shifts, the usefulness of invariance is category-specific. Overall, when choosing a single descriptor and no prior knowledge about the data set and object and scene categories is available, the OpponentSIFT is recommended. Furthermore, a combined set of color descriptors outperforms intensity-based SIFT and improves category recognition by 8 percent on the PASCAL VOC 2007 and by 7 percent on the Mediamill Challenge.
Chapter
This chapter takes a look at the existing search landscape to observe that there are a myriad of search systems, covering all possible domains, which share a handful of theoretical retrieval models and are separated by a series of pre- and post-processing steps as well as a finite set of parameters. We also observe that given the infinite variety of real-world search tasks and domains, it is unlikely that any one combination of these steps and parameters would yield a renaissance-engine — a search engine that can answer any questions about art, sciences, law or entertainment. We therefore set forth to analyze the different components of a search system and propose a model for domain specific search, including a definition thereof, as well as a technical framework to build a domain specific search system.
Conference Paper
In this work we deal with the problem of how different local descriptors can be extended, used and combined for improving the effectiveness of video concept detection. The main contributions of this work are: 1) We examine how effectively a binary local descriptor, namely ORB, which was originally proposed for similarity matching between local image patches, can be used in the task of video concept detection. 2) Based on a previously proposed paradigm for introducing color extensions of SIFT, we define in the same way color extensions for two other non-binary or binary local descriptors (SURF, ORB), and we experimentally show that this is a generally applicable paradigm. 3) In order to enable the efficient use and combination of these color extensions within a state-of-the-art concept detection methodology (VLAD), we study and compare two possible approaches for reducing the color descriptor’s dimensionality using PCA. We evaluate the proposed techniques on the dataset of the 2013 Semantic Indexing Task of TRECVID.
Article
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.
Conference Paper
Focussed crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic based on evidence obtained from the already downloaded pages. This work proposes a classifier-guided focussed crawling approach that estimates the relevance of a hyperlink to an unvisited Web resource based on the combination of textual evidence representing its local context, namely the textual content appearing in its vicinity in the parent page, with visual evidence associated with its global context, namely the presence of images relevant to the topic within the parent page. The proposed focussed crawling approach is applied towards the discovery of environmental Web resources that provide air quality measurements and forecasts, since such measurements (and particularly the forecasts) are not only provided in textual form, but are also commonly encoded as multimedia, mainly in the form of heatmaps. Our evaluation experiments indicate the effectiveness of incorporating visual evidence in the link selection process applied by the focussed crawler over the use of textual features alone, particularly in conjunction with hyperlink exploration strategies that allow for the discovery of highly relevant pages that lie behind apparently irrelevant ones.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
Robot-generated Web indices such as AltaVista are comprehensive but imprecise; manually generated directories such as Yahoo! are precise but cannot keep up with large, rapidly growing categories such as personal homepages or news stories on the American economy. Thus, if a user is searching for a particular page that is not cataloged in a directory, she is forced to query a web index and manually sift through a large number of responses. Furthermore, if the page is not yet indexed, then the user is stymied. This paper presents Dynamic Reference Sifting — a novel architecture that attempts to provide both maximally comprehensive coverage and highly precise responses in real time, for specific page categories.To demonstrate our approach, we describe Ahoy! The Homepage Finder (http://www.cs.washington,edu/research/ahoy), a fielded web service that embodies Dynamic Reference Sifting for the domain of personal homepages. Given a person's name and institution, Ahoy! filters the output of multiple web indices to extract one or two references that are most likely to point to the person's homepage. If it finds no likely candidates, Ahoy! uses knowledge of homepage placement conventions, which it has accumulated from previous experience, to “guess” the URL for the desired homepage. The search process takes 9 seconds on average. On 74% of queries from our primary test sample, Ahoy! finds the target homepage and ranks it as the top reference. 9% of the targets are found by guessing the URL. In comparison, AltaVista can find 58% of the targets and ranks only 23% of these as the top reference.
Conference Paper
This talk will review the emerging research in Terrorism Informatics based on a web mining perspective. Recent progress in the internationally renowned Dark Web project will be reviewed, including: deep/dark web spidering (web sites, forums, Youtube, virtual worlds), web metrics analysis, dark network analysis, web-based authorship analysis, and sentiment and affect analysis for terrorism tracking. In collaboration with selected international terrorism research centers and intelligence agencies, the Dark Web project has generated one of the largest databases in the world about extremist/terrorist-generated Internet contents (web sites, forums, blogs, and multimedia documents). Dark Web research has received significant international press coverage, including: Associated Press, USA Today, The Economist, NSF Press, Washington Post, Fox News, BBC, PBS, Business Week, Discover magazine, WIRED magazine, Government Computing Week, Second German TV (ZDF), Toronto Star, and Arizona Daily Star, among others. For more Dark Web project information, please see: http://ai.eller.arizona.edu/research/terror/ .
Article
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.
Article
Goal of the workshop was to bring together experts and prospective researchers around the exciting and future-oriented topic of plagiarism analysis, authorship identification, and high similarity search. This topic receives increasing attention, which results, among others, from the fact that information about nearly any subject can be found on the World Wide Web.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Enabling Cross-Language Intelligent Information Processing in Multilingual Social Networks
  • K Bratanis
  • D Bibikas
  • I Paraskakis
K. Bratanis, D. Bibikas, and I. Paraskakis, "Enabling Cross-Language Intelligent Information Processing in Multilingual Social Networks", Technical Report, Thessaloniki, Greece: South East European Research Centre (SEERC), 2012.
Tor: The second-generation onion router
  • R Dingledine
  • N Mathewson
  • P Syverson
R. Dingledine, N. Mathewson, and P. Syverson, P. "Tor: The second-generation onion router", In Proc. of the USENIX Security Symposium, 2004, pp. 303-320.
Dynamic Reference Sifting: a Case Study in the Homepage Domain
  • J Shakes
  • M Langheinrich
  • O Etzioni
J. Shakes, M. Langheinrich, and O. Etzioni, "Dynamic Reference Sifting: a Case Study in the Homepage Domain," Proc. 6th International World Wide Web Conference (WWW6), 1997, pp. 189-200.
Index structures for structured documents
  • Y K Lee
  • S Yoo
  • K Yoon
  • B Berra
Y.K. Lee, S. Yoo, K. Yoon, and B. Berra, "Index structures for structured documents," Proc. first ACM International Conference on Digital Libraries, ACM, 1996, pp. 91-99.
MORMED: Towards a Multilingual Social Networking Platform Facilitating Medicine 2.0
  • E Kargioti
  • D Kourtesis
  • D Bibikas
  • I Paraskakis
  • U Boes
E. Kargioti, D. Kourtesis, D. Bibikas, I. Paraskakis, and U. Boes, "MORMED: Towards a Multilingual Social Networking Platform Facilitating Medicine 2.0," Proc. XII Mediterranean Conference on Medical and Biological Engineering and Computing 2010, Springer Berlin Heidelberg, 2010, pp. 971-974.